Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

austintlee · 2024-11-18T09:44:57Z

Sycamore connectors (implementations of BaseDBReader) produce MaterializedDataset (from_items)

sycamore/lib/sycamore/sycamore/connectors/base_reader.py

Line 83 in 6039f8c

return from_items(items=[{"doc": doc.serialize()} for doc in self.read_docs()])

and this creates an issue when working with queries that need to scan a large number of documents.

Largely, we can take two approaches to addressing this problem. We can try to reduce the amount of data that get loaded into memory when constructing a MaterializedDataset by only serializing references to the docs and then when we deserialize, we fetch the docs from storage on-demand.

Alternatively, we can introduce a custom Datasource, e.g. SqlDatasource and MongoDatasource, for each of our connectors. With this approach, we can achieve streaming and read parallelism that is native to Ray.

In this PR, I demonstrate a working example of the first approach for the OpenSearch reader/connector.

… refs

Allow OpenSearchReader to output to MaterializedDataset consisting of…

47d4df4

… refs

austintlee requested review from bsowell, eric-anderson and karanataryn November 18, 2024 16:10

Fix a bug in iter_rows.

b68d2bb

austintlee mentioned this pull request Dec 17, 2024

Allow Doc reconstruct via function #1072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

austintlee commented Nov 18, 2024 •

edited

Loading

Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

Are you sure you want to change the base?

Allow OpenSearchReader to output to MaterializedDataset consisting of refs #1029

Conversation

austintlee commented Nov 18, 2024 • edited Loading

austintlee commented Nov 18, 2024 •

edited

Loading