Skip to content

Commit

Permalink
Merge pull request #4 from supabase-community/or/refactor-2
Browse files Browse the repository at this point in the history
Remove namespace option + add README docs
  • Loading branch information
olirice authored Jun 24, 2024
2 parents 4a6294d + 3572a1d commit 0e8b753
Show file tree
Hide file tree
Showing 12 changed files with 142 additions and 86 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ jobs:
- name: Run tests
run: poetry run pytest
env:
PINECONE_APIKEY: ${{secrets.PINECONE_APIKEY}}
PINECONE_API_KEY: ${{secrets.PINECONE_API_KEY}}
74 changes: 57 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,12 @@

---

A CLI for migrating vector workloads from [Pinecone](https://www.pinecone.io/) to [Pgvector](https://github.com/pgvector/pgvector) on [Supabase](https://supabase.com).
A CLI for migrating data from vector databases to [Supabase](https://supabase.com).

Additional data sources will be added soon.
Supported data sources include:
- [Pinecone](https://docs.pinecone.io/home)
- (more soon)

# Use

```
vec2pg --help
Expand All @@ -51,28 +52,67 @@ vec2pg --help
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

## Pinecone
## Migration Guide

### Pinecone

```
vec2pg pinecone migrate --help
```

```
Usage: vec2pg pinecone migrate [OPTIONS] PINECONE_INDEX PINECONE_API_KEY
POSTGRES_CONNECTION_STRING
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * pinecone_index TEXT [default: None] [required] │
│ * pinecone_api_key TEXT [env var: PINECONE_API_KEY] [default: None] [required] │
│ * postgres_connection_string TEXT [env var: POSTGRES_CONNECTION_STRING] [default: None] [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

Usage: vec2pg pinecone migrate [OPTIONS] PINECONE_APIKEY PINECONE_INDEX
PINECONE_NAMESPACE POSTGRES_CONNECTION_STRING
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * pinecone_apikey TEXT [env var: PINECONE_APIKEY] [default: None] [required] │
│ * pinecone_index TEXT [env var: PINECONE_INDEX] [default: None] [required] │
│ * pinecone_namespace TEXT [env var: PINECONE_NAMESPACE] [default: None] [required] │
│ * postgres_connection_string TEXT [env var: POSTGRES_CONNECTION_STRING] [default: None] [required] │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯


To migrate from [Pinecone serverless](https://www.pinecone.io/blog/serverless/) index to Postgres you'll need:

- A Pinecone API Key

![pinecone api key](/assets/pinecone_api_key.png)

- The Pinecone serverless index name

![pinecone serverless index name](/assets/pinecone_index_name.png)

- A Supabase instance

From the Supabase instance we need the connection parameters. Retrive them [here](https://supabase.com/dashboard/project/_/settings/database)

![supabsae connection parameters](/assets/supabase_connection_params.png)

And substitute those values into a valid Postgres connection string
```
postgresql://<User>:<Password>@<Host>:<Port>/postgres
```
e.g.
```
postgresql://postgres.ahqsutirwnsocaaorimo:<Password>@aws-0-us-east-1.pooler.supabase.com:6543/postgres
```

Then we can call `vec2pg pinecone migrate` passing our values. You can supply all parameters directly to the CLI, but its a good idea to pass the Pinecone API Key (PINECONE_API_KEY) and Supabase connection string (POSTGRES_CONNECTION_STRING) as environment variables to avoid logging credentials to your shell's history.

![sample output](/assets/pinecone_to_supabase_output.png)

The CLI provies a progress bar to monitor the migration.

On completion, you can view a copy of the Pinecone index data in Supabase Postgres at `vec2pg.<pinecone index name>`

![view results](/assets/view_results.png)

From there you can transform and manipulate the data in Postgres using SQL.


# Requisites
- Python >= 3.8
Expand All @@ -84,7 +124,7 @@ To run the tests you will need
- docker
- [Pinecone API key](https://docs.pinecone.io/guides/get-started/authentication#find-your-pinecone-api-key)

The Pinecone API key should be stored as an environment variable `PINECONE_APIKEY`
The Pinecone API key should be stored as an environment variable `PINECONE_API_KEY`

Run the tests
```
Expand Down
Binary file added assets/pinecone_api_key.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/pinecone_index_name.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/pinecone_to_supabase_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/supabase_connection_params.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/view_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
98 changes: 55 additions & 43 deletions src/vec2pg/plugins/pinecone.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Annotated, Optional
from typing import Annotated

import numpy as np
import psycopg
Expand All @@ -10,46 +10,44 @@
app = typer.Typer()

# Env Var Names
PINECONE_APIKEY = "PINECONE_APIKEY"
PINECONE_API_KEY = "PINECONE_API_KEY"
PINECONE_NAMESPACE = "PINECONE_NAMESPACE"
PINECONE_INDEX = "PINECONE_INDEX"
POSTGRES_CONNECTION_STRING = "POSTGRES_CONNECTION_STRING"
POSTGRES_SCHEMA_NAME = "POSTGRES_SCHEMA_NAME"
POSTGRES_TABLE_NAME = "POSTGRES_TABLE_NAME"


def to_qualified_table_name(pinecone_index, pinecone_namespace) -> str:
table_name = f"{pinecone_index}_{pinecone_namespace or 'all'}"
def to_qualified_table_name(pinecone_index: str) -> str:
table_name = f"{pinecone_index}"
return f'vec2pg."{table_name}"'


@app.command()
def migrate(
pinecone_apikey: Annotated[str, typer.Argument(envvar=PINECONE_APIKEY)],
pinecone_index: Annotated[str, typer.Argument(envvar=PINECONE_INDEX)],
pinecone_namespace: Annotated[
Optional[str], typer.Argument(envvar=PINECONE_NAMESPACE)
],
pinecone_index: str,
pinecone_api_key: Annotated[str, typer.Argument(envvar=PINECONE_API_KEY)],
postgres_connection_string: Annotated[
str, typer.Argument(envvar=POSTGRES_CONNECTION_STRING)
],
):
# Init Pinecone client and index
client = Pinecone(api_key=pinecone_apikey)
client = Pinecone(api_key=pinecone_api_key)
index = client.Index(pinecone_index)
vector_count = index.describe_index_stats()["total_vector_count"]

index_description = index.describe_index_stats()
index_namespaces = [key for key in index_description["namespaces"]]
vector_count = index_description["total_vector_count"]

# Prep the database with minimal requirements
conn = psycopg.connect(postgres_connection_string, autocommit=True)
conn.execute("create extension if not exists vector")
conn.execute("create schema if not exists vec2pg")

# Setup the Postgres table
qualified_name = to_qualified_table_name(pinecone_index, pinecone_namespace)
qualified_name = to_qualified_table_name(pinecone_index)
conn.execute(f"drop table if exists {qualified_name}") # type: ignore
create_table_query = (
f"create table {qualified_name} (id text, values vector, metadata json)"
)
create_table_query = f"create table {qualified_name} (id text, values vector, namespace text, metadata json)"
conn.execute(create_table_query) # type: ignore

# Make psycopg aware of the vector type
Expand All @@ -58,31 +56,45 @@ def migrate(
# Progress bar
with tqdm(total=vector_count) as pbar:

# Iterate through the pinecone index
for ids in index.list(
namespace=pinecone_namespace,
limit=100,
):
batch_result = index.fetch(ids, namespace=pinecone_namespace)
records = [
(rec["id"], np.array(rec["values"]), rec.get("metadata"))
for rec in batch_result.vectors.values()
]

cur = conn.cursor()

with cur.copy(
f"""
copy {qualified_name}(id, values, metadata)
from stdin with (format binary)
""" # type: ignore
) as copy:
copy.set_types(["text", "vector", "json"])

for rec in records:
copy.write_row(rec)

while conn.pgconn.flush() == 1:
pass

pbar.update(len(records))
for pinecone_namespace in index_namespaces:

# Iterate through the pinecone index
for ids in index.list(
namespace=pinecone_namespace,
limit=100,
):
batch_result = index.fetch(ids, namespace=pinecone_namespace)
records = [
(
rec["id"],
np.array(rec["values"]),
pinecone_namespace,
rec.get("metadata"),
)
for rec in batch_result.vectors.values()
]

cur = conn.cursor()

with cur.copy(
f"""
copy {qualified_name}(id, values, namespace, metadata)
from stdin with (format binary)
""" # type: ignore
) as copy:
copy.set_types(["text", "vector", "text", "json"])

for rec in records:
copy.write_row(rec)

while conn.pgconn.flush() == 1:
pass

pbar.update(len(records))

typer.echo(
f"Pinecone index {pinecone_index} successfully written to Postgres table "
+ typer.style(
qualified_name, fg=typer.colors.BLACK, bg=typer.colors.WHITE, bold=True
)
)
23 changes: 15 additions & 8 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from parse import parse
from pinecone import Pinecone, ServerlessSpec
from pinecone.data.index import Index
from typer.testing import CliRunner

from vec2pg.plugins import pinecone

Expand Down Expand Up @@ -114,12 +115,7 @@ def cursor(maybe_start_pg: None, postgres_connection_string: str):

@pytest.fixture(scope="session")
def pinecone_client() -> Pinecone:
return Pinecone(api_key=environ[pinecone.PINECONE_APIKEY])


@pytest.fixture(scope="session")
def pinecone_namespace():
return "ns1"
return Pinecone(api_key=environ[pinecone.PINECONE_API_KEY])


@pytest.fixture(scope="session")
Expand All @@ -131,7 +127,7 @@ def pinecone_index_name():

@pytest.fixture(scope="session")
def pinecone_index(
pinecone_client, pinecone_namespace, pinecone_index_name
pinecone_client, pinecone_index_name
) -> Generator[Index, None, None]:

pinecone_client.create_index(
Expand All @@ -143,16 +139,22 @@ def pinecone_index(

index = pinecone_client.Index(pinecone_index_name)

# insert dummy records in 2 different namespaces
index.upsert(
vectors=[
{"id": "vec1", "values": [1.0, 1.5], "metadata": {"key": "val"}},
{"id": "vec2", "values": [2.0, 1.0]},
{"id": "vec3", "values": [0.1, 3.0]},
],
namespace="",
)
index.upsert(
vectors=[
{"id": "vec4", "values": [1.0, -2.5]},
{"id": "vec5", "values": [3.0, -2.0]},
{"id": "vec6", "values": [0.5, -1.5]},
],
namespace=pinecone_namespace,
namespace="foo",
)

# Indexes are eventually consistent....
Expand All @@ -171,3 +173,8 @@ def pinecone_index(
yield index
finally:
pinecone_client.delete_index(pinecone_index_name)


@pytest.fixture(scope="session")
def cli_runner():
yield CliRunner()
7 changes: 7 additions & 0 deletions tests/test_help.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from typer.testing import CliRunner

from vec2pg.cli import app


def test_app_help_does_not_error(cli_runner: CliRunner):
cli_runner.invoke(app, ["--help"])
16 changes: 7 additions & 9 deletions tests/test_pinecone.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typer.testing import CliRunner

from vec2pg.cli import app
from vec2pg.plugins.pinecone import PINECONE_APIKEY, to_qualified_table_name
from vec2pg.plugins.pinecone import PINECONE_API_KEY, to_qualified_table_name


def test_pinecone_subcommand_does_not_error() -> None:
Expand All @@ -22,28 +22,26 @@ def test_index_is_good(pinecone_index: Index) -> None:


def test_pinecone_migrate(
pinecone_namespace,
pinecone_index_name,
postgres_connection_string,
pinecone_index_name: str,
postgres_connection_string: str,
cursor,
cli_runner: CliRunner,
) -> None:
runner = CliRunner()
result = runner.invoke(
result = cli_runner.invoke(
app,
[
"pinecone",
"migrate",
environ[PINECONE_APIKEY],
pinecone_index_name,
pinecone_namespace,
environ[PINECONE_API_KEY],
postgres_connection_string,
],
)

print(result.stdout)
assert result.exit_code == 0

qualified_name = to_qualified_table_name(pinecone_index_name, pinecone_namespace)
qualified_name = to_qualified_table_name(pinecone_index_name)

recs = cursor.execute(
f"select id, values, metadata from {qualified_name}"
Expand Down
8 changes: 0 additions & 8 deletions tests/test_template.py

This file was deleted.

0 comments on commit 0e8b753

Please sign in to comment.