vectara-ingest
includes a number of crawlers that make it easy to crawl data sources and index the results into Vectara.
If remove_boilerplate
is enabled, vectara-ingest
uses Goose3 and JusText in Indexer.index_url
to enhance text extraction from HTML content, ensuring relevant material is gathered while excluding ads and irrelevant content.
vectara-ingest
supports crawling and indexing web content in 42 languages currently. To determine the language of a given webpage, we utilize the langdetect package, and adjust the use of Goose3 and JusText accordingly.
Let's go through some of the main crawlers to explain how they work and how to customize them to your needs. This will also provide good background to creating (and contributing) new types of crawlers.
...
website_crawler:
urls: [https://vectara.com]
pos_regex: []
neg_regex: []
num_per_second: 10
pages_source: crawl
max_depth: 3 # only needed if pages_source is set to 'crawl'
html_processing:
ids_to_remove: [td-123]
tags_to_remove: [nav]
classes_to_remove: []
keep_query_params: false
crawl_report: false
remove_old_content: false
ray_workers: 0
...
The website crawler indexes the content of a given web site. It supports two modes for finding pages to crawl (defined by pages_source
):
sitemap
: in this mode the crawler retrieves the sitemap for each of the target websites (specificed in theurls
parameter) and indexes all the URLs listed in each sitemap. Note that some sitemaps are partial only and do not list all content of the website - in those cases,crawl
may be a better option.crawl
: in this mode for each url specified inurls
, the crawler starts there and crawls the website recursively, following links no more thanmax_depth
. If you'd like to crawl only the URLs specified in theurls
list (without any further hops) usemax_depth=0
.
Other parameters:
num_per_second
specifies the number of call per second when crawling the website, to allow rate-limiting. Defaults to 10.pos_regex
defines one or more (optional) regex expressions defining URLs to match for inclusion.neg_regex
defines one or more (optional) regex expressions defining URLs to match for exclusion.keep_query_params
: if true, maintains the full URL including query params in the URL. If false, then it removes query params from collected URLs.crawl_report
: if true, creates a file under ~/tmp/mount calledurls_indexed.txt
that lists all URLs crawledremove_old_content
: if true, removes any URL that currently exists in the corpus but is NOT in this crawl. CAUTION: this removes data from your corpus. Ifcrawl_report
is true then the list of URLs associated with the removed documents is listed inurls_removed.txt
The html_processing
configuration defines a set of special instructions that can be used to ignore some content when extracting text from HTML:
ids_to_remove
defines an (optional) list of HTML IDs that are ignored when extracting text from the page.tags_to_remove
defines an (optional) list of HTML semantic tags (like header, footer, nav, etc) that are ignored when extracting text from the page.classes_to_remove
defines an (optional) list of HTML "class" types that are ignored when extracting text from the page.
Note: when specifying regular expressions it's recommended to use single quotes (as opposed to double quotes) to avoid issues with escape characters.
ray_workers
, if defined, specifies the number of ray workers to use for parallel processing. ray_workers=0 means dont use Ray. ray_workers=-1 means use all cores available.
Note that ray with docker does not work on Mac M1/M2 machines.
...
database_crawler:
db_url: "postgresql://<username>:<password>@my_db_host:5432/yelp"
db_table: yelp_reviews
select_condition: "city='New Orleans'"
doc_id_columns: [postal_code]
text_columns: [business_name, review_text]
metadata_columns: [city, state, postal_code]
The database crawler can be used to read data from a relational database and index relevant columns into Vectara.
db_url
specifies the database URI including the type of database, host/port, username and password if needed.- For MySQL: "mysql://username:password@host:port/database"
- For PostgreSQL:"postgresql://username:password@host:port/database"
- For Microsoft SQL Server: "mssql+pyodbc://username:password@host:port/database"
- For Oracle: "oracle+cx_oracle://username:password@host:port/database"
db_table
the table name in the databaseselect_condition
optional condition to filter rows in the table bydoc_id_columns
defines one or more columns that will be used as a document ID, and will aggregate all rows associated with this value into a single Vectara document. The crawler will also use the content in these columns (concatenated) as the title for that row in the Vectara document. If this is not specified, the code will aggregate everyrows_per_chunk
(default 500) rows.text_columns
a list of column names that include textual information we want to use as the main text indexed into vectara. The code concatenates these columns for each row.title_column
is an optional column name that will hold textual information to be used as title at the document level.metadata_columns
a list of column names that we want to use as metadata.
In the above example, the crawler would
- Include all rows in the database "yelp" that are from the city of New Orleans (
SELECT * FROM yelp WHERE city='New Orleans'
) - Group all rows that have the same values for
postal_code
into the same Vectara document - Each such Vectara document that is indexed, will include several sections (one per row), each representing the textual fields
business_name
andreview_text
and including the meta-data fieldscity
,state
andpostal_code
.
...
hfdataset_crawler:
dataset_name: "coeuslearning/hotel_reviews"
split: "train"
select_condition: "city='New York City, USA'"
start_row: 0
num_rows: 55
title_column: hotel
text_columns: [review]
metadata_columns: [city, hotel]
The database crawler can be used to read data from a relational database and index relevant columns into Vectara.
dataset_name
the huggingface dataset namesplit
the "split" of the dataset in the HF datasets hub (e.g. "train", or "test", or "corpus"; look at DS card to determine)select_condition
optional condition to filter rows in the table bystart_row
if specified skips the specified number of rows from the start of the datasetnum_rows
if specified limits the dataset size by number of specified rowsid_column
optional column for the ID of the dataset. Must be unique if usedtext_columns
a list of column names that include textual information we want to use as the main text indexed into vectara. The code concatenates these columns for each row.title_column
is an optional column name that will hold textual information to be used as title at the document level.metadata_columns
a list of column names that we want to use as metadata.
In the above example, the crawler would
- Include all rows in the dataset that are from NYC
- Each Vectara document that is indexed, will include a title as the value of
hotel
and text fromreview
. The metadata will include the fieldscity
andhotel
. - include only the first 55 rows matching the condition
...
csv_crawler:
file_path: "/path/to/Game_of_Thrones_Script.csv"
select_condition: "Season='Season 1'"
doc_id_columns: [Season, Episode]
text_columns: [Name, Sentence]
metadata_columns: ["Season", "Episode", "Episode Title"]
column_types: []
separator: ','
sheet_name: "my-sheet"
The csv crawler is similar to the database crawler, but instead of pulling data from a database, it uses a local CSV or XLSX file.
select_condition
optional condition to filter rows in the table bydoc_id_columns
defines one or more columns that will be used as a document ID, and will aggregate all rows associated with this value into a single Vectara document. This will also be used as the title. If this is not specified, the code will aggregate everyrows_per_chunk
(default 500) rows.text_columns
a list of column names that include textual information we want to usetitle_column
is an optional column name that will hold textual information to be used as titlemetadata_columns
a list of column names that we want to use as metadatacolumn_types
an optional dictionary of column name and type (int, float, str). If unspecified, or for columns not included, the default type is str.separator
a string that will be used as a separator in the CSV file (default ',') (relevant only for CSV files)sheet_name
the name of the sheet in the XLSX file to use (relevant only for XLSX files)
In the above example, the crawler would
- Read all the data from the local CSV file under
/path/to/Game_of_Thrones_Script.csv
- Group all rows that have the same values for both
Season
andEpisode
into the same Vectara document - Each such Vectara document that is indexed, will include several sections (one per row), each representing the textual fields
Name
andSentence
and including the meta-data fieldsSeason
,Episode
andEpisode Title
.
Note that the type of file is determined by it's extension (e.g. CSV vs XLSX)
...
bulkupload_crawler:
json_path: "/path/to/file.JSON"
The Bulk Upload crawler accepts a single JSON file that is an array of Vectara JSON document objects as specified here. It then iterates through these document objects, and uploads them one by one to Vectara.
This bulk upload crawler has no parameters.
...
rss_crawler:
source: bbc
rss_pages: [
"http://feeds.bbci.co.uk/news/rss.xml", "http://feeds.bbci.co.uk/news/world/rss.xml", "http://feeds.bbci.co.uk/news/uk/rss.xml",
"http://feeds.bbci.co.uk/news/business/rss.xml", "http://feeds.bbci.co.uk/news/politics/rss.xml",
"http://feeds.bbci.co.uk/news/health/rss.xml", "http://feeds.bbci.co.uk/news/education/rss.xml",
"http://feeds.bbci.co.uk/news/science_and_environment/rss.xml", "http://feeds.bbci.co.uk/news/technology/rss.xml",
"http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml", "http://feeds.bbci.co.uk/news/world/middle_east/rss.xml",
"http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml", "http://feeds.bbci.co.uk/news/world/asia/rss.xml",
"http://feeds.bbci.co.uk/news/world/europe/rss.xml"
]
days_past: 90
delay: 1
The RSS crawler can be used to crawl URLs listed in RSS feeds such as on news sites. In the example above, the rss_crawler is configured to crawl various newsfeeds from the BBC.
source
specifies the name of the rss data feed.rss_pages
defines one or more RSS feed locations.days_past
specifies the number of days backward to crawl; for example with a value of 90 as in this example, the crawler will only index news items that have been published no earlier than 90 days in the past.delay
defines the number of seconds to wait between news articles, so as to make the crawl more friendly to the hosting site.
...
hackernews_crawler:
max_articles: 1000
days_past: 3
days_past_comprehensive: false
The hackernews crawler can be used to crawl stories and comments from hacker news.
max_articles
specifies a limit to the number of stories crawled.days_past
specifies the number of days backward to crawl, based on the top, new, ask, show and best story lists. For example with a value of 3 as in this example, the crawler will only index stories if the story or any comment in the story was published or updated in the last 3 days.days_past_comprehensive
if true, then the crawler performs a comprehensive search for ALL stories published within the lastdays_past
days (which takes longer to run)
...
docs_crawler:
base_urls: ["https://docs.vectara.com/docs"]
pos_regex: [".*vectara.com/docs.*"]
neg_regex: [".*vectara.com/docs/rest-api/.*"]
num_per_second: 10
extensions_to_ignore: [".php", ".java", ".py", ".js"]
docs_system: docusaurus
remove_code: true
html_processing:
ids_to_remove: []
tags_to_remove: [footer]
crawl_report: false
remove_old_content: false
ray_workers: 0
The Docs crawler processes and indexes content published on different documentation systems. It has two parameters
base_urls
defines one or more base URLS for the documentation content.pos_regex
defines one or more (optional) regex expressions defining URLs to match for inclusionneg_regex
defines one or more (optional) regex expressions defining URLs to match for exclusionextensions_to_ignore
specifies one or more file extensions that we want to ignore and not index into Vectara.doc_system
is a text string specifying the document system crawled, and is added to the metadata under "source"ray_workers
if it exists defines the number of ray workers to use for parallel processing. ray_workers=0 means dont use Ray. ray_workers=-1 means use all cores available.num_per_second
specifies the number of call per second when crawling the website, to allow rate-limiting. Defaults to 10.crawl_report
: if true, creates a file under ~/tmp/mount calledurls_indexed.txt
that lists all URLs crawledremove_old_content
: if true, removes any URL that currently exists in the corpus but is NOT in this crawl. CAUTION: this removes data from your corpus. Ifcrawl_report
is true then the list of URLs associated with the removed documents is listed inurls_removed.txt
The html_processing
configuration defines a set of special instructions that can be used to ignore some content when extracting text from HTML:
ids_to_remove
defines an (optional) list of HTML IDs that are ignored when extracting text from the page.tags_to_remove
defines an (optional) list of HTML semantic tags (like header, footer, nav, etc) that are ignored when extracting text from the page.
Note: when specifying regular expressions it's recommended to use single quotes (as opposed to double quotes) to avoid issues with escape characters.
...
discourse_crawler:
base_url: "https://discuss.vectara.com"
The discourse forums crawler requires a single parameter, base_url
, which specifies the home page for the public forum we want to crawl.
In the secrets.toml
file you should have DISCOURSE_API_KEY="<YOUR_DISCOURSE_KEY>" which will provide the needed authentication for the crawler to access this data.
The mediawiki crawler can crawl content in any wikimedia-powered website such as Wikipedia or others, and index it into Vectara.
...
wikimedia_crawler:
project: "en.wikipedia"
api_url: "https://en.wikipedia.org/w/api.php"
n_pages: 1000
The mediawiki crawler first looks at media statistics to determine the most viewed pages in the last 7 days, and then based on that picks the top n_pages
to crawl.
api_url
defines the base URL for the wikiproject
defines the mediawiki project name.
...
github_crawler:
owner: "vectara"
repos: ["getting-started", "protos", "slackbot", "magazine-search-demo", "web-crawler", "Search-UI", "hotel-reviews-demo"]
crawl_code: false
num_per_second: 2
The GitHub crawler indexes content from GitHub repositories into Vectara.
repos
: list of repository names to crawlowner
: GitHub repository ownercrawl_code
: by default the crawler indexes only issues and comments; if this is set to True it will also index the source code (but that's usually not recommended).num_per_second
specifies the number of call per second when crawling the website, to allow rate-limiting. Defaults to 10.
It is highly recommended to add a GITHUB_TOKEN
to your secret.toml
file under the specific profile you're going to use. The GITHUB_TOKEN (see here for how to create one for yourself) ensures you don't run against rate limits.
...
jira_crawler:
jira_base_url: "https://vectara.atlassian.net/"
jira_username: [email protected]
jira_jql: "created > -365d"
The JIRA crawler indexes issues and comments into Vectara.
jira_base_url
: the Jira base_urljira_username
: the user name that the crawler should use (JIRA_PASSWORD
should be separately defined in thesecrets.toml
file)jira_jql
: a Jira JQL condition on the issues identified; in this example it is configured to only include items from the last year.
...
twitter_crawler:
bearer_token: the Twitter API bearer token. see https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api'
userhandles: a list of user handles to pull tweets from
num_tweets: number of most recent tweets to pull from each user handle
clean_tweets: whether to remove username / handles from tweet text (default: True)
The twitter crawler indexes top num_tweeets
mentions from twitter
bearer_token
: the Twitter developer authentication tokenuserhandles
: a list of user handles to look for in mentionsnum_tweets
: number of recent tweets mentioning each handle to pullclean_tweets
if True removes all username/handle
To setup Notion you will have to setup a Notion integration, and share the pages you want indexed with this connection.
...
notion_crawler:
remove_old_content: false
The notion crawler indexes notion pages into Vectara:
remove_old_content
: if true, removes any document that currently exists in the corpus but is NOT in this crawl. CAUTION: this removes data from your corpus.
For this crawler, you need to specify NOTION_API_KEY
(which is associated with your custom integration) in the secrets.toml
file.
The HubSpot crawler has no specific parameters, except the HUBSPOT_API_KEY
that needs to be specified in the secrets.toml
file. The crawler will index the emails on your Hubspot instance. The crawler also uses clean_email_text()
module which takes the email message as a parameter and cleans it to make it more presentable. This function in core/utils.py
is taking care of indentation character >
.
The crawler leverages Presidio Analyzer and Anonymizer to accomplish PII masking, achieving a notable degree of accuracy in anonymizing sensitive information with minimal error.
...
gdrive_crawler:
permissions: ['Vectara', 'all']
days_back: 365
ray_workers: 0
delegated_users:
- [email protected]
- [email protected]
The gdrive crawler indexes content of your Google Drive folder
days_back
: include only files created within the last N dayspermissions
: list ofdisplayName
values to include. We recommend including your company name (e.g.Vectara
) andall
to include all non-restricted files.delegated_users
: list of user emails in your organization.ray_workers
: 0 if not using Ray, otherwise specifies the number of Ray workers to use.
This crawler identifies Google Drive files based on the list of delegated users. For each user it looks at those files that the user either created or has access to, but limiting only to files that have "accessible by all" permissions (so that "restricted" files are not included)
Note that this crawler uses a Google Drive service account mode to access files,
and you need to include a credentials.json
file in the main vectara-ingest folder.
For more information see Google documentation under
"Service account credentials".
...
folder_crawler:
path: "/Users/ofer/Downloads/some-interesting-content/"
extensions: ['.pdf']
The folder crawler indexes all files specified from a local folder.
path
: the local folder locationextensions
: list of file extensions to be included. If one of those extensions is '*' then all files would be crawled, disregarding any other extensions in that list.source
: a string that is added to each file's metadata under the "source" fieldmetadata_file
: an optional CSV file for metadata. Each row should have afilename
column as key to match the file in the folder, and 1 or more additional columns used as metadata. This file should be in thepath
folder, but will be ignored for indexing purposes.
Note that the local path you specify is mapped into a fixed location in the docker container /home/vectara/data
, but that is a detail of the implementation that you don't need to worry about in most cases, just specify the path to your local folder and this mapping happens automatically.
...
s3_crawler:
s3_path: s3://my-bucket/files
extensions: ['*']
The S3 crawler indexes all content that's in a specified S3 bucket path.
s3_path
: a valid S3 location where the files to index resideextensions
: list of file extensions to be included. If one of those extensions is '*' then all files would be crawled, disregarding any other extensions in that list.metadata_file
: an optional CSV file for metadata. Each row should have afilename
column as key to match the file in the folder, and 1 or more additional columns used as metadata. This file should be in the sames3_path
folder, but will be ignored for indexing purposes.
...
yt_crawler:
playlist_url: <some-yotube-playlist-url>
The Youtube crawler loads all videos from a playlist, extracts the subtitles into text (or transcribes the audio if subtitles don't exist), and indexes that text.
playlist_url
: a valid youtube playlist URL
...
slack_crawler:
days_past: 30
channels_to_skip: ["alerts"]
retries: 5
To use the slack crawler you need to create slack bot app and give it permissions. Following are the steps.
-
Create a Slack App: Log in to your Slack workspace and navigate to the Slack API website. Click on "Your Apps" and then "Create New App." Provide a name for your app, select the workspace where you want to install it, and click "Create App."
-
Configure Basic Information: In the app settings, you can configure various details such as the app name, icon, and description. Make sure to fill out the necessary information accurately.
-
Install the Bot to Your Workspace: Once you've configured your app, navigate to the "Install App" section. Click on the "Install App to Workspace" button to add the bot to your Slack workspace. This step will generate an OAuth access token that you'll need to use to authenticate your bot.
-
Add User Token Scope: To add user token scope, navigate to the "OAuth & Permissions" section in your app settings. Under the "OAuth Tokens for Your Workspace" section, you'll need to add
users:read
,channel:read
,channel:history
scopes. -
Save Changes: Make sure to save any changes you've made to your app settings.
-
Place the generated user token in
secrets.toml
.SLACK_USER_TOKEN= <user_token>
Edgar
crawler: crawls SEC Edgar annual reports (10-K) and indexes those into Vectarafmp
crawler: crawls information about public companies using the FMP APIPMC
crawler: crawls medical articles from PubMed Central and indexes them into Vectara.Arxiv
crawler: crawls the top (most cited or latest) Arxiv articles about a topic and indexes them into Vectara.