Remote Vector Index Build Component — Remote Vector Service Client #2393
Labels
Features
Introduces a new unit of functionality that satisfies a requirement
indexing-improvements
This label should be attached to all the github issues which will help improving the indexing time.
Roadmap:Vector Database/GenAI
Project-wide roadmap label
See #2391 for background information
Overview
Following up on the RFCs, this is the first part of the low-level design for the Vector Index Build Component. The Vector Index Build Component is a logical component we further split into 2 subcomponents and their respective responsibilities:
This document contains the low level design for [2] Remote Vector Service Client Component, covering the design for the client the vector engine will use to interact with the remote vector build service as well the workflows associated with using this client.
Tenets
The key tenets of the client are straightforward:
High level overview:
Alternatives Considered
1. [Recommended] REST requests with polling
In this approach we submit the graph build request via REST request and then use REST requests to poll the vector build service for completion status.
Pros:
Cons:
2. Persistent connection (gRPC, websockets)
In this approach we open a persistent connection between the OpenSearch cluster and the remote vector build service and keep the connection open until the graph construction is complete.
Pros:
Cons:
3. REST callback
In this approach we submit a graph build request via REST request and expose a REST callback endpoint for the remote vector build service to notify on when the graph construction is complete
Pros:
Cons:
4. Queue based mechanism
In this approach we submit a graph build request to a queue rather than directly to the remove vector build service. We also consume graph build completed notifications through a separate queue.
Pros:
Cons:
Workflow
In addition to performing status checks, we also need to fallback to the local CPU build in the remote failure scenarios. Below is a high level workflow overview with highlighted components representing usage of the remote vector service client. In this diagram we do not make distinctions between failure statuses and failed HTTP requests — this will clarified in the sections further below.
Polling Based Client
We will implement a simple HTTP client that performs POST / GET requests against a configurable remote vector service endpoint
Vector Build Service Client Configurations
This section covers all of the different configurations we will expose in order for a user to configure the client to point at their remote vector build service
ml-commons connector docs for reference
Trigger Vector Build
POST /build
This API needs to be idempotent in order to support retries.
Additionally, we do not create a task id to track the vector build status because this would require the vector build service to internally maintain a mapping between task id and the graph file being built, and this mapping would need to be persisted after the graph construction is complete in order to signal the vector engine to download the graph file from the object store. In failure scenarios, this makes it complicated for the vector build service to determine how long to persist task IDs after the graph construction is complete.
The key invariant is the vector blob path is unique to a specific segment which is being worked on, so that path can be used to associate a status request with a given graph build request. Moreover since there is a 1:1 mapping between constructed graph file and vector blob, any status request could simply check for the existence of a graph file to determine if a graph build is complete or not (whether to do so or not is left up to the implementation of the remote vector build service).
Get Vector Build Status
For the vector build status the key design decision is the verbosity of the status outputs and subsequent state machine implementations:
[Recommended] 1. Low verbosity for maximum compatibility
The possible task statuses in this solution would look like:
Pros
Cons
2. High Verbosity for granular visibility
The possible task statuses in this solution would look like:
Pros
Cons
Cancel Vector Build
We also provide a cancellation API for operational support in order to cancel specific graph build tasks.
Internal State Machine
Because we want to proceed with less verbose statuses and leave more specific retry implementation up to the remote vector build service itself, we do not need to (and do not want to) maintain a complicated state machine for each remote vector build. Following diagram contains the internal state machine for each remote vector build task as well as the state transitions based on remote vector service client responses.
Since the states and transitions are very straightforward, we will not maintain this state machine as a DAG or any other data structure from within the segment merge/flush operation but instead we are using the term “state machine” as a way to formalize the expected outcomes of each API response.
Failure scenarios including retry logic is discussed in the next section below: Status Request Failure Response
Failure Scenarios
This section covers the various failure scenarios related to the client and how we would handle each failure and specifically we need to make distinctions between retriable and non-retriable results.
Status Request Failure Response
This covers the scenario where we do not receive any request failures however the /status API indicates that the graph build failed. In order to retry these scenarios, we will need to submit another /build request to the remote vector build service.
The number of times we will re-submit the /build request will be controlled by a cluster setting and specific failure retry implementation will be left up to the remote vector build service to implement. For example, if a failure happens in the graph upload step we will leave it up to the remote vector build service to decide whether to retry specifically re-uploading the graph to the object store or if it should start from scratch and rebuild the graph. This type of retry across nodes is naturally unsynchronized as it’s up to the job scheduler of the remote vector build service to schedule the graph build jobs.
Request Failures
This covers any failure responses received when calling the /build and the /status APIs. From the remote vector service client perspective, the main information available to us to make a determination on whether to retry or not is the HTTP status code, and for that we should follow the AWS SDK retry standards on transient errors (source). This means the following status codes will be eligible for retry:
It will be up to the specific remote vector build service to implement these status codes — for example it’s left up to a specific service to choose whether to throw 403 or 404 when a request is received for a non-existent vector blob.
For this type of failure scenario we will provide a separate client retry/backoff + jitter configuration (a separate cluster setting from Status Request Failure Response) to be used to retry failed HTTP requests. In order to mitigate the unlikely scenario of synchronized retries across nodes we will implement retries with exponential backoff + jitter.
Metrics
This section will cover metrics specific to the the remote vector service client and it’s usage. Other metrics emitted by the remote vector build service itself will be handled in a separate document.
Today the k-NN stats API only supports cluster and node level stats, so we can gather these metrics on a cluster/node level and expose them via the k-nn stats API.
As a separate item we should explore supporting index/shard level k-nn stats as it would be valuable to see specifically which indices are using and benefiting the most from the remote vector build service.
Future Improvements
Although a polling based client will be the simplest implementation in the first iteration, we may encounter scaling problems as adoption of the feature increases. In a future low level design we will further explore how we can design the state machine and state transitions in such a way that are forward compatible with any client architecture changes. For now we are keeping the number of statuses as few as possible to make doing so easier in the future.
The text was updated successfully, but these errors were encountered: