-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local doc cache support for document reconstruct #1066
base: main
Are you sure you want to change the base?
Conversation
@@ -126,9 +133,21 @@ def to_docs(self, query_params: "BaseDBReader.QueryParams") -> list[Document]: | |||
doc.properties[DocumentPropertyTypes.SOURCE] = DocumentSource.DB_QUERY | |||
assert doc.doc_id, "Retrieved invalid doc with missing doc_id" | |||
if not doc.parent_id: | |||
if query_params.document_cache is not None: | |||
cached_doc = query_params.document_cache.get(f"{index_name}:{doc.doc_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to move this get_key logic to a shared method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
# doc_ids = list(unique_docs.keys()) | ||
doc_ids = [d for d in list(unique_docs.keys()) if d not in list(cached_docs.keys())] | ||
|
||
# We can't safely exclude embeddings since we might need them for 'rerank', e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do embedding style reranking locally? Won't the knn result just be ordered? Or do we expect to reorder based on something else after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. When I wrote the comment, I thought we use embeddings for reranking. In general, though, I don't know if we can safely drop them in the context of document reconstruct. Maybe document reconstruct should be a set of options not just a boolean.
doc.elements.sort(key=lambda e: e.element_index if e.element_index is not None else float("inf")) | ||
if doc.doc_id not in cached_docs: | ||
doc.elements.sort(key=lambda e: e.element_index if e.element_index is not None else float("inf")) | ||
if query_params.document_cache is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can cached_docs have anything if document_cache is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be an empty dict.
@@ -224,7 +225,7 @@ def opensearch( | |||
index_name: str, | |||
query: Optional[Dict] = None, | |||
reconstruct_document: bool = False, | |||
query_kwargs: dict[str, Any] = {}, | |||
query_kwargs=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change to an optional? Just for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
No description provided.