This is a demo of how to use Weaviate.
What is Weaviate you might ask. Well according to official website:
Weaviate is an open source vector search engine that stores both objects and vectors, allowing for combining vector search with structured filtering with the fault-tolerance and scalability of a cloud-native database, all accessible through GraphQL, REST, and various language clients.
In other words Weaviate is a search engine that is built with semantic search in mind which means that it has optimizations for quick vector search. Plus it also provides traditional text search with ability to add filters into request. And it's all done in one go, all on Weaviate's side. So it's fast, modern and flexible.
Ok then. But what is vector search? In layman terms when doing text search instead of looking for exact or partial word-by-word match we'd rather look for text with vector representation that is the closest to vector representation of search text. Vector representation is just a set of numbers that tries to encode internal meaning of any given text in such way that texts that might have different set of words but have similar meaning will have similar vectors (set of numbers), and in opposite, texts with different meaning - very different vectors.
If you are confused then let me show you an example. With vector search if you type:
"how pandemic affects on gas prices in US"
you will get an article with title:
"Oil slides more than 3% as virus cases mount"
Different words but the meaning is quite close.
Examples of how to use Weaviate can be found in:
- quick start guide: under each request there is a link to Weaviate's console where you can run request in a real time
- youtube channel, you can start with introduction video
- and of course in this repository :)
- data
- raw: contains csv file with article from CNBC website and scrapped by data.world
- docs
- api_calls.rest: endpoint call examples
- notebooks: notebooks are also can be rendered by nbviewer
- 1.EDA: exploratory data analysis notebook where each column of the dataset is described
- 2.data_preprocessing: for this demo project data preprocessing is done in form of notebook
- 3.weaviate_search_examples: here you can find different variants of search requests
- src
- config
- config.yaml: config file
- data
- data_loader.py: this data loader parses schema from running Weaviate instance and automatically creates all required objects and references. More about it in this Readme
- docker
- docker-compose.yml: speaks for itself. In order to create new docker-compose file you can use Weaviate docker-compose generator
- models
- schema.json: specifies classes, properties, data types and references for the data. More about it in this Readme
- utils: various utils files
- config
- main.py: script that loads data into running Weaviate instance and as a check runs search request and outputs result.
- pyproject.toml: package dependencies are stored here and managed py Poetry
In order to use this project (with the provided CNBC dataset) you need to do these steps:
-
Install all the required packages. In this project package dependencies are managed by Poetry and are stored in "pyproject.toml" file (in this file also specified required version of python). After
poetry
is installed and virtual environment is created (in case you don't want poetry to create it automatically), run:poetry install
-
Run notebook with preprocessing (notebooks/2.data_preprocessing.ipynb). This notebook as a result will create csv file in data/interim folder that later can be used for sending data into weaviate instance.
-
Run docker-compose file from src/docker/docker-compose.yml. This will download all necessary images and start weaviate instance with specified modules.
NOTE: in this project custom text2vec-transformers image is used which can be created from this repository (there you can also find out how to create text2vec-transformers module with you desired way of text vectorization). If it doesn't fit your needs here is a list with prebuilt images; just change image value (line 27 in docker-compose.yml) with the image name from that list.
-
Load data into running Weaviate instance. For this simply run:
python main.py
When Weaviate instance is running with data inside it, Weaviate's console can be used. For that just open in your browser:
... provide host and port (most likely it will be http://0.0.0.0:8080) and hit "Connect" button.
Now you can create search request in GraphQL format (examples can be found in 3.weaviate_search_examples.ipynb, chapter 2).
For example this request will output top 5 articles with certainty more then 0.8 and articles will be about oil prices and Saudi Arabia and US relations:
{
Get {
Article(
limit: 5
nearText: {
concepts: ["Saudi Arabia US relations and oil prices"],
certainty: 0.8,
}
)
{
published_at
title
short_description
_additional {
certainty
}
hasAuthors {
... on Author {
name
}
}
}
}
}
In order to run black
formatter and flake8
linter before each commit you need to add them into .git/hooks
folder either manually or with helper script:
sh .add_git_hooks.sh`
This script will put pre-commit
file into .git/hooks
folder of the project and make it executable.