Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local doc cache support for document reconstruct #1066

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

austintlee
Copy link
Contributor

No description provided.

@@ -126,9 +133,21 @@ def to_docs(self, query_params: "BaseDBReader.QueryParams") -> list[Document]:
doc.properties[DocumentPropertyTypes.SOURCE] = DocumentSource.DB_QUERY
assert doc.doc_id, "Retrieved invalid doc with missing doc_id"
if not doc.parent_id:
if query_params.document_cache is not None:
cached_doc = query_params.document_cache.get(f"{index_name}:{doc.doc_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to move this get_key logic to a shared method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

# doc_ids = list(unique_docs.keys())
doc_ids = [d for d in list(unique_docs.keys()) if d not in list(cached_docs.keys())]

# We can't safely exclude embeddings since we might need them for 'rerank', e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do embedding style reranking locally? Won't the knn result just be ordered? Or do we expect to reorder based on something else after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. When I wrote the comment, I thought we use embeddings for reranking. In general, though, I don't know if we can safely drop them in the context of document reconstruct. Maybe document reconstruct should be a set of options not just a boolean.

doc.elements.sort(key=lambda e: e.element_index if e.element_index is not None else float("inf"))
if doc.doc_id not in cached_docs:
doc.elements.sort(key=lambda e: e.element_index if e.element_index is not None else float("inf"))
if query_params.document_cache is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can cached_docs have anything if document_cache is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be an empty dict.

@@ -224,7 +225,7 @@ def opensearch(
index_name: str,
query: Optional[Dict] = None,
reconstruct_document: bool = False,
query_kwargs: dict[str, Any] = {},
query_kwargs=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to an optional? Just for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants