Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Vector Index Build Component — Remote Vector Service Client #2393

Open
Tracked by #2391
jed326 opened this issue Jan 14, 2025 · 0 comments
Open
Tracked by #2391

Remote Vector Index Build Component — Remote Vector Service Client #2393

jed326 opened this issue Jan 14, 2025 · 0 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label

Comments

@jed326
Copy link
Contributor

jed326 commented Jan 14, 2025

See #2391 for background information

Overview

Following up on the RFCs, this is the first part of the low-level design for the Vector Index Build Component. The Vector Index Build Component is a logical component we further split into 2 subcomponents and their respective responsibilities:

  1. Object Store I/O Component
    1. Upload flat vectors to Object Store
    2. Download graph file from object store
  2. Remote Vector Service Client Component
    1. Signal to Remote Vector Index Build Service to begin graph construction after vector files have been uploaded
    2. Receive a signal from Remote Vector Index Build Service to begin graph file download after graph construction is completed

This document contains the low level design for [2] Remote Vector Service Client Component, covering the design for the client the vector engine will use to interact with the remote vector build service as well the workflows associated with using this client.

Tenets

The key tenets of the client are straightforward:

  1. Signal to the remote vector service that vector blob upload is complete and ready for graph construction
  2. Receive a signal from the remote vector service that graph construction is complete and graph download is ready
  3. Handle failures so as to not fail the merge/flush operation

High level overview:

Image

Alternatives Considered

1. [Recommended] REST requests with polling

In this approach we submit the graph build request via REST request and then use REST requests to poll the vector build service for completion status.

Pros:

  • Simplest implementation
  • Vector build service itself could still implement a queue to receive the build requests

Cons:

  • Higher request count to remote vector build service, need to configure polling interval and smart retry logic
  • A state machine would be required to keep track of the build progress

2. Persistent connection (gRPC, websockets)

In this approach we open a persistent connection between the OpenSearch cluster and the remote vector build service and keep the connection open until the graph construction is complete.
Pros:

  • Lower request count to remote vector build service

Cons:

  • Need to build robust reconnect logic. How do we handle the case where vector build completes or fails during a disconnect?
  • vector build service is multi-tenant so we may be bottlenecked on the number of concurrent persistent connections to the remote vector build service. Specifically any request queueing may lead to unexpected outcomes.
  • If we design a stateless system then it would be difficult to retry only specific actions, for example retrying only the graph upload part if that were to fail
  • There is very little data transfer happening through this client

3. REST callback

In this approach we submit a graph build request via REST request and expose a REST callback endpoint for the remote vector build service to notify on when the graph construction is complete
Pros:

  • No need to maintain a persistent connection or poll for results

Cons:

  • Very difficult to pass a notification from transport layer down to the index writer in the middle of segment merge operations, especially as GPU builds are per-segment rather than per shard
  • The REST callback would require a coordination layer to figure out which node / shard / segment the callback is associated with and this would have to exist outside of the segment merge
  • No way to get intermediate status, so we would have to heuristically determine how long to await on notification. Since graph builds may take on the order of hours, this could waste a lot of time if we need to fall back to the CPU build path

4. Queue based mechanism

In this approach we submit a graph build request to a queue rather than directly to the remove vector build service. We also consume graph build completed notifications through a separate queue.
Pros:

  • Same as [3]
  • Load shedding / balancing is easier with the queue in front of the remote vector build service

Cons:

  • Queue based implementation can make it difficult to prioritize tasks as we would not know the priority of any tasks until they are consumed from the queue
  • Additional piece of Queue infrastructure adds both cost and complexity
  • The same as [3]

Workflow

In addition to performing status checks, we also need to fallback to the local CPU build in the remote failure scenarios. Below is a high level workflow overview with highlighted components representing usage of the remote vector service client. In this diagram we do not make distinctions between failure statuses and failed HTTP requests — this will clarified in the sections further below.

Image

Polling Based Client

We will implement a simple HTTP client that performs POST / GET requests against a configurable remote vector service endpoint

Vector Build Service Client Configurations

This section covers all of the different configurations we will expose in order for a user to configure the client to point at their remote vector build service

  1. Cluster setting to store remote vector service endpoint
    1. We can also consider an index setting override for the remote vector service endpoint, as we may find certain types of indices perform better on certain types of specialized hardware. This is also logic that the vector build service itself could handle though. We are not planning on this for now.
  2. Cluster settings to store auth header information. For now we will support the following auth headers:
    1. Basic Auth
    2. API Keys

ml-commons connector docs for reference

Trigger Vector Build

POST /build

Input:
- type: The remote object store type (s3 / azure / gcs, etc)
- container: The name of the container (s3 bucket, azure container, gcs bucket)
- Vector file: Full file path to the vector file, including the container base path
- Index parameters: JSON object including all required graph parameters
- Tenant ID: Unique identifier for the cluster making the request. This can be used for billing, authorization, etc.

Output:
- Job ID: Unique identifier both the vector engine and remote vector build service will use to associate the vector build task.

This API needs to be idempotent in order to support retries.

Additionally, we do not create a task id to track the vector build status because this would require the vector build service to internally maintain a mapping between task id and the graph file being built, and this mapping would need to be persisted after the graph construction is complete in order to signal the vector engine to download the graph file from the object store. In failure scenarios, this makes it complicated for the vector build service to determine how long to persist task IDs after the graph construction is complete.

The key invariant is the vector blob path is unique to a specific segment which is being worked on, so that path can be used to associate a status request with a given graph build request. Moreover since there is a 1:1 mapping between constructed graph file and vector blob, any status request could simply check for the existence of a graph file to determine if a graph build is complete or not (whether to do so or not is left up to the implementation of the remote vector build service).

Get Vector Build Status

For the vector build status the key design decision is the verbosity of the status outputs and subsequent state machine implementations:

[Recommended] 1. Low verbosity for maximum compatibility

The possible task statuses in this solution would look like:

  1. RUNNING_GRAPH_BUILD -- Graph build task is in progress. This state represents all time between when the build request is submitted and when the graph upload is complete.
  2. FAILED_GRAPH_BUILD -- Graph build task is failed
  3. COMPLETED_GRAPH_BUILD -- Graph build task is completed, including graph upload

Pros

  1. Fewest number of states for simplicity
  2. Specific retry implementation logic would be determined by the vector build service rather than implicitly defined by the client (see: Status Request Failure Response)
  3. Fewer number of configurations needed for retries and failure scenarios

Cons

  1. No granular visibility into remote vector build service components, such as vector download time, graph build time, graph upload time
GET /status

Input:
- Job ID: Unique identifier both the vector engine and remote vector build service will use to associate the vector build task.

Output:
- Task Status:
    1. RUNNING_GRAPH_BUILD -- Graph build task is in progress. This state represents all time between when the build request is submitted and when the graph upload is complete.
    2. FAILED_GRAPH_BUILD -- Graph build task is failed
    3. COMPLETED_GRAPH_BUILD -- Graph build task is completed, including graph upload

2. High Verbosity for granular visibility

The possible task statuses in this solution would look like:

  1. PENDING_VECTOR_DOWNLOAD
  2. RUNNING_VECTOR_DOWNLOAD
  3. FAILED_VECTOR_DOWNLOAD
  4. PENDING_GRAPH_BUILD
  5. RUNNING_GRAPH_BUILD
  6. FAILED_GRAPH_BUILD
  7. PENDING_GRAPH_UPLOAD
  8. RUNNING_GRAPH_UPLOAD
  9. COMPLETED_GRAPH_UPLOAD
  10. FAILED_GRAPH_UPLOAD

Pros

  1. Granular visibility into remote vector build service components, such as vector download time, graph build time, graph upload time, etc.

Cons

  1. Additional complexity involved in managing state transitions between all the success and failure states
  2. More complexity in designing the remote vector build service component as the client is strictly dictating the states the vector build service needs to maintain.
  3. More tightly coupled client/service
  4. Retry logic (in state machine) will need to be handled client side

Cancel Vector Build

We also provide a cancellation API for operational support in order to cancel specific graph build tasks.

POST /cancel

Input:
- Job ID: Unique identifier both the vector engine and remote vector build service will use to associate the vector build task.

Output:
- Request acknowledgment

Internal State Machine

Because we want to proceed with less verbose statuses and leave more specific retry implementation up to the remote vector build service itself, we do not need to (and do not want to) maintain a complicated state machine for each remote vector build. Following diagram contains the internal state machine for each remote vector build task as well as the state transitions based on remote vector service client responses.

Image

Since the states and transitions are very straightforward, we will not maintain this state machine as a DAG or any other data structure from within the segment merge/flush operation but instead we are using the term “state machine” as a way to formalize the expected outcomes of each API response.

Failure scenarios including retry logic is discussed in the next section below: Status Request Failure Response

Failure Scenarios

This section covers the various failure scenarios related to the client and how we would handle each failure and specifically we need to make distinctions between retriable and non-retriable results.

Status Request Failure Response

This covers the scenario where we do not receive any request failures however the /status API indicates that the graph build failed. In order to retry these scenarios, we will need to submit another /build request to the remote vector build service.

The number of times we will re-submit the /build request will be controlled by a cluster setting and specific failure retry implementation will be left up to the remote vector build service to implement. For example, if a failure happens in the graph upload step we will leave it up to the remote vector build service to decide whether to retry specifically re-uploading the graph to the object store or if it should start from scratch and rebuild the graph. This type of retry across nodes is naturally unsynchronized as it’s up to the job scheduler of the remote vector build service to schedule the graph build jobs.

Request Failures

This covers any failure responses received when calling the /build and the /status APIs. From the remote vector service client perspective, the main information available to us to make a determination on whether to retry or not is the HTTP status code, and for that we should follow the AWS SDK retry standards on transient errors (source). This means the following status codes will be eligible for retry:

  • 429
  • 500
  • 502
  • 503
  • 504
  • 509

It will be up to the specific remote vector build service to implement these status codes — for example it’s left up to a specific service to choose whether to throw 403 or 404 when a request is received for a non-existent vector blob.

For this type of failure scenario we will provide a separate client retry/backoff + jitter configuration (a separate cluster setting from Status Request Failure Response) to be used to retry failed HTTP requests. In order to mitigate the unlikely scenario of synchronized retries across nodes we will implement retries with exponential backoff + jitter.

Metrics

This section will cover metrics specific to the the remote vector service client and it’s usage. Other metrics emitted by the remote vector build service itself will be handled in a separate document.

  1. Build Request Success/Failure Count
  2. Build Request Retry Count
  3. Status Request Success/Failure Count
  4. Overall Graph Build Success/Failure Count
  5. Overall Graph Build Retry Count

Today the k-NN stats API only supports cluster and node level stats, so we can gather these metrics on a cluster/node level and expose them via the k-nn stats API.

As a separate item we should explore supporting index/shard level k-nn stats as it would be valuable to see specifically which indices are using and benefiting the most from the remote vector build service.

Future Improvements

Although a polling based client will be the simplest implementation in the first iteration, we may encounter scaling problems as adoption of the feature increases. In a future low level design we will further explore how we can design the state machine and state transitions in such a way that are forward compatible with any client architecture changes. For now we are keeping the number of statuses as few as possible to make doing so easier in the future.

@jed326 jed326 changed the title HLD - Remote Vector Build Service Client Remote Vector Index Build Component — Remote Vector Service Client Jan 14, 2025
@jed326 jed326 self-assigned this Jan 14, 2025
@jed326 jed326 added Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label and removed untriaged labels Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label
Projects
Status: New
Status: Backlog
Development

No branches or pull requests

1 participant