Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

austintlee
Copy link
Contributor

@austintlee austintlee commented Nov 18, 2024

Sycamore connectors (implementations of BaseDBReader) produce MaterializedDataset (from_items)

return from_items(items=[{"doc": doc.serialize()} for doc in self.read_docs()])

and this creates an issue when working with queries that need to scan a large number of documents.

Largely, we can take two approaches to addressing this problem. We can try to reduce the amount of data that get loaded into memory when constructing a MaterializedDataset by only serializing references to the docs and then when we deserialize, we fetch the docs from storage on-demand.

Alternatively, we can introduce a custom Datasource, e.g. SqlDatasource and MongoDatasource, for each of our connectors. With this approach, we can achieve streaming and read parallelism that is native to Ray.

In this PR, I demonstrate a working example of the first approach for the OpenSearch reader/connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant