Allow Doc reconstruct via function #1072

austintlee · 2024-12-12T19:55:58Z

No description provided.

lib/sycamore/sycamore/connectors/opensearch/opensearch_reader.py

eric-anderson · 2024-12-12T20:03:09Z

lib/sycamore/sycamore/connectors/opensearch/opensearch_reader.py

+            for data in self.output:
+                doc_id = data["_source"]["parent_id"] or data["_id"]
+                if doc_id not in unique:
+                    # The reconstructor may or may not fetch the document from a cache


I'd remove the comment; it's assuming a particular implementation.

eric-anderson · 2024-12-16T21:27:32Z

lib/sycamore/sycamore/connectors/opensearch/opensearch_reader.py

@@ -93,7 +98,17 @@ class OpenSearchReaderQueryResponse(BaseDBReader.QueryResponse):
    def to_docs(self, query_params: "BaseDBReader.QueryParams") -> list[Document]:
        assert isinstance(query_params, OpenSearchReaderQueryParams)
        result: list[Document] = []
-        if not query_params.reconstruct_document:
+        if query_params.doc_reconstructor is not None:


How are you planning to implement parallel reconstruction?
I think this is another place where making the reconstructor a class would help. It could have a member variable .parallel_reconstruct where the first stage does dict -> (simplified Doc); and the second stage is a Map to to simplified doc -> full doc.

Maybe we can discuss this idea in #1029

lib/sycamore/sycamore/reader.py

eric-anderson · 2024-12-16T21:33:59Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+        original_docs = (
+            context.read.binary(path, binary_format="pdf")
+            .partition(partitioner=UnstructuredPdfPartitioner())
+            .materialize(cache_dir)


This should not work; default materialize naming is content-hash+doc_id.

sycamore/lib/sycamore/sycamore/materialize.py

Line 313 in f80ff0e

def doc_to_name(doc: Document, bin: bytes) -> str:

You need to override the naming function to match with line 137.

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

eric-anderson · 2024-12-16T21:38:17Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+            data = pickle.load(open(f"{cache_dir}/{found}", "rb"))
+            return Document(**data)
+
+        path = str(TEST_DIR / "resources/data/pdfs/Ray.pdf")


I would recommend fake documents rather than running the partitioner. This test is testing way more than is necessary. As a side benefit, it becomes very easy to verify that you're reconstructing via pickle. The fake documents can do doc.hidden_var = ; and then you check that hidden_var pops out on the far side. Since it wouldn't be written to opensearch, you know it's doing pickle reconstruction.

Also it seems better to test with multiple documents which is easier if they're fake.

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

eric-anderson · 2024-12-16T21:41:36Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+        retrieved_materialized = sorted(retrieved_docs.take_all(), key=lambda d: d.doc_id)
+        query_materialized = query_docs.take_all()
+        retrieved_materialized_reconstructed = sorted(retrieved_docs_reconstructed.take_all(), key=lambda d: d.doc_id)
+


I would have expected assert original_docs == retrieved_docs
To verify completely the same. Then you can remove most of the other checks.

Once you add multiple docs from earlier comment, you'll need to sort by docid.

eric-anderson · 2024-12-17T04:09:30Z

lib/sycamore/sycamore/connectors/doc_reconstruct.py

+        pass
+
+    @abstractmethod
+    def get_doc_id(self, data) -> str:


For now, I'd get rid of the abstract base class. It's not clear to me this is the right abstraction, but we won't know until we implement this for a few more databases. E.g. it's assuming that there are source fields, and that the document comes back as a dictionary.

eric-anderson · 2024-12-17T04:09:39Z

lib/sycamore/sycamore/connectors/doc_reconstruct.py

+    def get_required_source_fields(self) -> list[str]:
+        return ["parent_id"]
+
+    def get_doc_id(self, data) -> str:


data : dict

eric-anderson · 2024-12-17T04:13:11Z

lib/sycamore/sycamore/reader.py

-                doc_reconstructor=doc_reconstructor,
-                kwargs=query_kwargs,
-            )
+        if query is None:


remove the default in OpenSearchReaderQueryParams
That change makes sure this code and the other code stay in sync.

eric-anderson · 2024-12-17T04:15:12Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+        expected_count = len(original_docs)
+        actual_count = get_doc_count(os_client, TestOpenSearchRead.INDEX)
+        # refresh should have made all ingested docs immediately available for search
+        assert actual_count == expected_count, f"Expected {expected_count} documents, found {actual_count}"


FYI: Golang taught me to use got & want, which are shorter and still clear. Fine to keep as is.

eric-anderson · 2024-12-17T04:15:25Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

        retrieved_materialized = sorted(retrieved_docs.take_all(), key=lambda d: d.doc_id)
        query_materialized = query_docs.take_all()
        retrieved_materialized_reconstructed = sorted(retrieved_docs_reconstructed.take_all(), key=lambda d: d.doc_id)

-        with OpenSearch(**TestOpenSearchRead.OS_CLIENT_ARGS) as os_client:
-            os_client.indices.delete(TestOpenSearchRead.INDEX)
+        # with OpenSearch(**TestOpenSearchRead.OS_CLIENT_ARGS) as os_client:


remove obsolete line.

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

eric-anderson · 2024-12-17T04:21:23Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+        dicts = [
+            {
+                "doc_id": "1",
+                "hidden": hidden,


# make sure we read from pickle files -- this part won't be written into opensearch.

eric-anderson · 2024-12-17T04:22:18Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

+                reconstruct_document=True,
+                doc_reconstructor=OpenSearchDocumentReconstructor(TestOpenSearchRead.INDEX, doc_reconstructor),
+            )
+            .filter(lambda doc: doc.doc_id == "1")


Why not read them all, sort and compare? The filter seems unnecessary. You could drop all the docs except for 1 and the test would remain the same.

eric-anderson · 2024-12-17T04:23:54Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

-        for i in range(len(doc.elements) - 1):
-            assert doc.elements[i].element_index < doc.elements[i + 1].element_index
+        assert len(retrieved_docs_reconstructed) == 1
+        assert retrieved_docs_reconstructed[0].data["hidden"] == hidden


Does assert retrieved_docs_reconstructed == original_docs not work?
That's the condition that we want to be true.

Yes, I will add this check (original_docs is after .explode(), but you meant docs).

eric-anderson · 2024-12-17T04:24:27Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

-            os_client.indices.delete(TestOpenSearchRead.INDEX)
-            os_client.indices.create(TestOpenSearchRead.INDEX, **TestOpenSearchRead.INDEX_SETTINGS)
-            os_client.indices.refresh(TestOpenSearchRead.INDEX)
+        # with OpenSearch(**TestOpenSearchRead.OS_CLIENT_ARGS) as os_client:


remove obsolete line.

eric-anderson · 2024-12-17T20:40:38Z

lib/sycamore/sycamore/connectors/doc_reconstruct.py

    def __init__(self, index_name: str, reconstruct_fn: Callable[[str, str], Document]):
        self.index_name = index_name
        self.reconstruct_fn = reconstruct_fn

    def get_required_source_fields(self) -> list[str]:
        return ["parent_id"]

-    def get_doc_id(self, data) -> str:
+    def get_doc_id(self, data: dict) -> str:
        return data["_source"]["parent_id"] or data["_id"]

    def reconstruct(self, data) -> Document:


data: dict here as well.

eric-anderson · 2024-12-17T20:41:21Z

lib/sycamore/sycamore/tests/integration/connectors/opensearch/test_opensearch_read.py

            )
-            .filter(lambda doc: doc.doc_id == "1")
+            # .filter(lambda doc: doc.doc_id == "1")


remove obsolete line.

austintlee added 5 commits December 12, 2024 11:25

Add support for doc reconstruct fn

1b03c8b

Fix mutable input param in reader

4797add

Fix test

4fea3d7

Apply reformatting

dec7b43

Clean up

3d043e1

austintlee requested a review from eric-anderson December 12, 2024 19:56

eric-anderson reviewed Dec 16, 2024

View reviewed changes

austintlee added 2 commits December 16, 2024 17:13

Use custom doc_to_name to find materialized docs

8627035

Address reviewer comments

c0c3674

eric-anderson reviewed Dec 17, 2024

View reviewed changes

Address reviewer comments

4026a52

eric-anderson approved these changes Dec 17, 2024

View reviewed changes

austintlee added 2 commits December 18, 2024 21:16

Address comments, fix mypy and broken unit tests

7890fe0

Fix mypy

3bf9cc9

austintlee merged commit 80d7ab9 into main Dec 19, 2024
10 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Doc reconstruct via function #1072

Allow Doc reconstruct via function #1072

austintlee commented Dec 12, 2024

eric-anderson Dec 12, 2024

austintlee Dec 17, 2024

eric-anderson Dec 16, 2024

austintlee Dec 17, 2024

eric-anderson Dec 16, 2024

austintlee Dec 17, 2024

eric-anderson Dec 16, 2024

eric-anderson Dec 16, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 17, 2024

eric-anderson Dec 17, 2024

austintlee Dec 19, 2024

eric-anderson Dec 17, 2024

austintlee Dec 19, 2024

Allow Doc reconstruct via function #1072

Allow Doc reconstruct via function #1072

Conversation

austintlee commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment