-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring PDF loaders: all #28970
Draft
pprados
wants to merge
26
commits into
langchain-ai:master
Choose a base branch
from
pprados:pprados/pdf_loaders
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Refactoring PDF loaders: all #28970
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
pprados
force-pushed
the
pprados/pdf_loaders
branch
7 times, most recently
from
January 3, 2025 14:28
ad07a86
to
7cd0649
Compare
**Description:** bump gritql dependency, to use new binary names from [here](getgrit/gritql#565) **Issue:** fixes langchain-ai#27822
Co-authored-by: Erick Friis <[email protected]>
…ai#28975) Remove redundant word for improved sentence fluency Co-authored-by: Erick Friis <[email protected]>
Description: Document update. A minor typo is fixed. Install lxml as required. Issue: - Dependencies: - Twitter handle: @sathesh --------- Co-authored-by: Erick Friis <[email protected]>
## Description To integrate ModelScope inference API endpoints for both Embeddings, LLMs and ChatModels, install the package `langchain-modelscope-integration` (as discussed in issue langchain-ai#28928 ). This is necessary because the package name `langchain-modelscope` was already registered by another party. ModelScope is a premier platform designed to connect model checkpoints with model applications. It provides the necessary infrastructure to share open models and promote model-centric development. For more information, visit GitHub page: [ModelScope](https://github.com/modelscope).
Believe the current implementation raises PydanticUserError following [this](https://github.com/pydantic/pydantic/releases/tag/v2.10.1) Pydantic release. Resolves langchain-ai#28989
…ser` (langchain-ai#28959) - **Description:** Fix the `body` keyword argument for AzureAIDocumentIntelligenceParser` - **Issue:** langchain-ai#28948
Description: Add a missing 'has' verb in the Streaming Conceptual Guide.
**Description:** This PR updates the codebase to reflect the deprecation of the AgentType feature. It includes the following changes: Documentation Update: Added a deprecation notice to the AgentType class comment. Provided a reference to the official LangChain migration guide for transitioning to LangGraph agents. Reference Link: https://python.langchain.com/docs/how_to/migrate_agent/ **Twitter handle:** @hrrrriiiishhhhh --------- Co-authored-by: Chester Curme <[email protected]>
…langchain-ai#28984) ### Description - In the example, remove `llama-2-13b-chat`, `mixtral-8x7b-instruct-v0-1`. - Fix llm friendli streaming implementation. - Update examples in documentation and remove duplicates. ### Issue N/A ### Dependencies None ### Twitter handle `@friendliai`
…Template` (langchain-ai#28969) - **Description:** Very small change in Docstring for `BasePromptTemplate` - **Issue:** langchain-ai#28966
accross -> across
…to `auto` (langchain-ai#28961) - **Description:** `DuckDuckGoSearchAPIWrapper` default value for backend has been changed to avoid User Warning - **Issue:** langchain-ai#28957
This PR is to correct a simple typo in how-to guides section.
Before: ![Screenshot 2025-01-02 at 1 49 30 PM](https://github.com/user-attachments/assets/cb30526a-fc0b-439f-96d1-962c226d9dc7) After: ![Screenshot 2025-01-02 at 1 49 38 PM](https://github.com/user-attachments/assets/32c747ea-6391-4dec-b778-df457695d197)
- In this PR, I have updated the AzureML Endpoint with the latest endpoint. - **Description:** I have changed the existing `/chat/completions` to `/models/chat/completions` in libs/community/langchain_community/llms/azureml_endpoint.py - **Issue:** langchain-ai#25702 --------- Co-authored-by: = <=>
…angchain-ai#28914) This commit updates the documentation and package registry for the FalkorDB Chat Message History integration. **Changes:** - Added a comprehensive example notebook falkordb_chat_message_history.ipynb demonstrating how to use FalkorDB for session-based chat message storage. - Added a provider notebook for FalkorDB - Updated libs/packages.yml to register FalkorDB as an integration package, following LangChain's new guidelines for community integrations. **Notes:** - This update aligns with LangChain's process for registering new integrations via documentation updates and package registry modifications. - No functional or core package changes were made in this commit. --------- Co-authored-by: Chester Curme <[email protected]>
…Demonstration (langchain-ai#28938) ## Description This pull request updates the documentation for FAISS regarding filter construction, following the changes made in commit `df5008f`. ## Issue None. This is a follow-up PR for documentation of [langchain-ai#28207](langchain-ai#28207) ## Dependencies: None. --------- Co-authored-by: Chester Curme <[email protected]>
…in-ai#28902) Problem: "Optional" object is used in one example without importing, which raises the following error when copying the example into IDE or Jupyter Lab ![image](https://github.com/user-attachments/assets/3a6c48cc-937f-4774-979b-b3da64ced247) Solution: Just importing Optional from typing_extensions module, this solves the problem! --------- Co-authored-by: Erick Friis <[email protected]>
pprados
force-pushed
the
pprados/pdf_loaders
branch
from
January 3, 2025 16:50
7cd0649
to
5b0eba0
Compare
pprados
force-pushed
the
pprados/pdf_loaders
branch
from
January 3, 2025 17:03
fe2e4a7
to
d9a1b0c
Compare
This was referenced Jan 7, 2025
eyurtsev
pushed a commit
that referenced
this pull request
Jan 7, 2025
- **Refactoring PDF loaders step 1**: "community: Refactoring PDF loaders to standardize approaches" - **Description:** Declare CloudBlobLoader in __init__.py. file_path is Union[str, PurePath] anywhere - **Twitter handle:** pprados This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses to prepare the update of all parsers. For more details, see [PR 28970](#28970). @eyurtsev it's the start of a PR series.
@pprados FYI Docling has meanwhile been added: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactoring all PDF loader and parser: community
Description: refactoring of PDF parsers and loaders. See below
Issue: missing locks, parameter inconsistency, missing lazy approach, split loader and parser, etc.
Twitter handle: pprados
Add tests and docs:
docs/docs/integrations
directoryLint and test: done
Rational
Even though
Document
has apage_content
parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.Why is it important to unify the different parsers? Each has its own characteristics and strategies, more or less effective depending on the family of PDF files. One strategy is to identify the family of the PDF file (by inspecting the metadata or the content of the first page) and then select the most efficient parser in that case. By unifying parsers, the following code doesn't need to deal with the specifics of different parsers, as the result is similar for each. We'll propose a Parser using this strategy in another PR.
The PR
We propose a substantial PR to improve the different PDF parser integrations. All my clients struggle with PDFs. I took the initiative to address this issue at its root by refactoring the various integrations of Python PDF parsers. The goal is to standardize a minimum set of parameters and metadata and bring improvements to each one (bug fixes, feature additions).
Don't worry about the size of the PR. In the end, there are only two modified files. The rest is just updating unit tests and docs.
langchain_community/document_loaders/pdf.py
-
langchain_community/document_loaders/parsers/pdf.py
langchain_community/tests/integration_tests/document_loaders/pdf.py
-
langchain_community/tests/integration_tests/document_loaders/parsers/pdf.py
docs/docs/integrations/document_loaders/*pdf*.ipynb
docs/docs/how_to/document_loader_pdf.ipynb
docs/docs/how_to/document_loader_custom.ipynb
In order to qualify all the code, we worked in a separate project, using the
langchain-common
structure. In this way, we can compare the results of the historical implementation with the new ones.We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the
langchain-common
structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here.The only difference is the name to import classes.
All this // project is available here. Consult the
compare_old_new
directory with your development environment, using DIFF to identify differences.Metadata
All parsers use lowercase keys for pdf file metadata. Except
PDFPlumberParser
. For this particular case, we've added a dictionary wrapper that warns when keys with upper case letters are used.Images
The current implementation in LangChain involves asking each parser for the text on a page, then retrieving images to apply OCR. The text extracted from images is then appended to the end of the page text, which may split paragraphs across pages, worsening the RAG model’s performance.
To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between two paragraphs of text (
\n\n
or\n
), just before the end of the page. This allows a half-paragraph to be combined with the first paragraph of the following page.Currently, the LangChain implementation uses RapidOCR to analyze images and extract any text. This algorithm is designed to work with Chinese and English, not other languages. Since the implementation uses a function rather than a method, it’s not possible to modify it. We have modified the various parsers to allow for selecting the algorithm to analyze images. Now, it’s possible to use RapidOCR, Tesseract, or invoke a multimodal LLM to get a description of the image.
To standardize this, we propose a new abstract class:
For converting images to text, the possible formats are: text, markdown, and HTML. Why is this important? If it’s necessary to split a result, based on the origin of the text fragments, it’s possible to do so at the level of image translations. An identification rule such as
![text](...)
or<img …/>
allows us to identify text fragments originating from an image.Tables
Tables present in PDF files are another challenge. Some algorithms can detect part of them. This typically involves a specialized process, separate from the text flow. That is, the text extracted from the page includes each cell's content, sometimes in columns, sometimes in rows. This text is challenging for the LLM to interpret. Depending on the capabilities of the libraries, it may be possible to detect tables, then identify the cell boxes during text extraction to inject the table in its entirety. This way, the flow remains coherent. It’s even possible to add a few paragraphs before and after the table to prompt an LLM to describe it. Only the description of the table will be used for embedding.
Tables identified in PDF pages can be translated into markdown (if there are no merged cells) or HTML (which consumes more tokens). LLMs can then make use of them.
Unfortunately, this approach isn’t always feasible. In such cases, we can apply the approach used for images, by injecting tables and images between two paragraphs in the page’s text flow. This is always better than placing them at the end of the page.
Combining Pages
As mentioned, in a RAG project, we want to work with the text flow of a document, rather than by page. A mode is dedicated to this, which can be configured to specify the character to use for page delimiters in the flow. This could simply be
\n
,------\n
or\f
to clearly indicate a page change, or<!-- PAGE BREAK -->
for seamless injection in a Markdown viewer without a visual effect.Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm calculate this parameter.
Similarly, we’ve added metadata in all parsers with the total number of pages in the document. Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The
total_pages
metadata can then be used. We recommend this approach in an extension to LangChain that we propose for managing document references: langchain-reference.Compatibility
We have tried, as much as possible, to maintain compatibility with the previous version. This is reflected in preserving the order of parameters and using the default values for each implementation so that the results remain similar. The unit and integration tests for the various parsers have not been modified; they are still valid.
Ideally, we would prefer an interface like:
but this could break compatibility for positional arguments.
Perhaps it would be feasible to plan a migration for LangChain v1.0 by modifying the default parameters to make them mandatory during the transition to v1.0. At that point, we could reintroduce default values.
Normalisation
The
AzureAIDocumentIntelligenceParser
class introduces themode
parameter, which accepts the valuessingle
,page
, andmarkdown
.The deprecated
UnstructuredPDFLoader
class introduces themode
parameter, which accepts the valuessingle
,paged
, andmarkdown
.Based on this model, we are extending the presence of the mode parameter to most parsers, with the value
single
,page
, andmarkdown
.paged
is declared depreciated.The different
Loader
andBlobParser
classes now offer the following parameters:file_path
str
orPurePath
with the file name.password
str with the file password, if needed.mode
to return a single document per file or one document per page (extended withelements
in the case of Unstructured or other specific parser).pages_delimiter
to specify how to join pages (\f
by default).extract_images
to enable image extraction (already present in most Loaders/Parsers).images_to_text
to specify how to handle images (invoking OCR, LLM, etc.).extract_tables
to allow extraction of tables detected by underlying libraries, for certain parsers.The integration of image texts is now between two paragraphs.
For the
images_to_text
parameter, we propose three functions:convert_images_to_text_with_rapidocr()
convert_images_to_text_with_tesseract()
convert_images_to_description()
Here’s how it’s used:
Tables
Some parsers are able to extract arrays, but this is not integrated into langchain. We've added the necessary features to take this into account.
PyMuPDFLoader
PDFPlumberLoader
ZeroxPDFLoader
UnstructuredPDFLoader
Metadata
The different parsers offer a minimum set of common metadata:
source
page
total_page
creationdate
creator
producer
All keys are in lowercase.
Tests
We propose matrix tests to validate all parsers compatible with the new approach.
test_standard_parameters()
test_parser_with_table()
To validate all the parsers, we retrieved all the PDF files used by each parser for its own tests, and invoked all the parsers from langchain, along with all these files. This ensures that there are no crashes when parsing a PDF file.
New features of parsers
We resume the modification for each parsers
New loader / parsers
New parsers will be introduced in a separate pull request.
UnstructuredPDF
LlamaIndexPDF
PyMuPDF4LLM
PDFRouter
DoclingPDF
PDFMulti
For example, with the unification of parsers, it will be possible to choose the parser according to the characteristics of the PDF file.
This will be present in other PRs.
Succession of PR