Refactoring PDF loaders: 02 PyMuPDF #29063

pprados · 2025-01-07T08:46:24Z

Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

vercel · 2025-01-07T08:46:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 10, 2025 3:46pm

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

pprados · 2025-01-07T16:19:05Z

@eyurtsev I rebase the code with master ;-)

eyurtsev

Great will take a look in the AM

eyurtsev

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

Define the standardized structure of metadata
Create a dedicated ImageParser which is a blob parser

libs/community/langchain_community/document_loaders/parsers/pdf.py

eyurtsev · 2025-01-08T23:02:35Z

libs/community/langchain_community/document_loaders/parsers/pdf.py

+    for k, v in metadata.items():
+        if type(v) not in [str, int]:
+            v = str(v)
+        if k.startswith("/"):


bug? The file path could be an absolute path on the local machine -- this looks like an error right now?

No, it's for the metadata key, not values. Some PDF parseur use key like /creationDate

libs/community/langchain_community/document_loaders/parsers/pdf.py

eyurtsev · 2025-01-09T03:32:07Z

libs/community/langchain_community/document_loaders/parsers/pdf.py

@@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr(
    return text


+# Type to change the function to convert images to text.
+CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]]


MAJOR:

Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)

A PDF parser doesn't would accept a parser as part of the initializer

class PDFParser(...): def __Init__(self, ... *, ..., image_blob_parser: Optional[BlobParser] = None): pass

If the image_pdf_parser is provided, then it'll be used for OCR purposes.

I've done it! There are now 3 ImageBlogParsers!

vercel bot deployed to Preview January 7, 2025 08:55 View deployment

vercel bot deployed to Preview January 7, 2025 09:15 View deployment

pprados marked this pull request as ready for review January 7, 2025 09:16

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 7, 2025

ccurme assigned eyurtsev Jan 7, 2025

pprados added 7 commits January 7, 2025 17:08

Prepare the integration of new versions of PDFLoader.

21759e2

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

Fix Line too long

4607354

Fix Line too long

668dc9c

Fix Line too long

7a5b5c5

Fix Line too long

6340ded

Update PyMuPDF

4845781

Fix tu

3beda82

pprados force-pushed the pprados/02-pymupdf branch from 039819c to 3beda82 Compare January 7, 2025 16:09

vercel bot deployed to Preview January 7, 2025 16:18 View deployment

eyurtsev reviewed Jan 8, 2025

View reviewed changes

pprados mentioned this pull request Jan 8, 2025

Refactoring PDF loaders: all #28970

Draft

2 tasks

eyurtsev reviewed Jan 9, 2025

View reviewed changes

pprados added 3 commits January 9, 2025 16:48

Fix review - step 1

743a83e

Fix all remarques

b623750

Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf

20f5a41

pprados marked this pull request as draft January 10, 2025 12:45

vercel bot deployed to Preview January 10, 2025 13:30 View deployment

pprados force-pushed the pprados/02-pymupdf branch from 0d99673 to 3fe4ec5 Compare January 10, 2025 13:40

vercel bot deployed to Preview January 10, 2025 13:49 View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 4342991 to 760267b Compare January 10, 2025 14:05

vercel bot deployed to Preview January 10, 2025 14:15 View deployment

pprados force-pushed the pprados/02-pymupdf branch 3 times, most recently from 9fc89e0 to d30b26d Compare January 10, 2025 14:47

vercel bot deployed to Preview January 10, 2025 14:58 View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 6765dbf to df1d4d5 Compare January 10, 2025 15:09

vercel bot deployed to Preview January 10, 2025 15:24 View deployment

Fix remarques

91234f0

pprados force-pushed the pprados/02-pymupdf branch from df1d4d5 to 91234f0 Compare January 10, 2025 15:37

vercel bot deployed to Preview January 10, 2025 15:46 View deployment

pprados marked this pull request as ready for review January 10, 2025 15:46

dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring PDF loaders: 02 PyMuPDF #29063

Refactoring PDF loaders: 02 PyMuPDF #29063

pprados commented Jan 7, 2025 •

edited

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

eyurtsev left a comment

eyurtsev Jan 8, 2025

pprados Jan 10, 2025

eyurtsev Jan 9, 2025

pprados Jan 10, 2025

Refactoring PDF loaders: 02 PyMuPDF #29063

Are you sure you want to change the base?

Refactoring PDF loaders: 02 PyMuPDF #29063

Conversation

pprados commented Jan 7, 2025 • edited Loading

vercel bot commented Jan 7, 2025 • edited Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev Jan 8, 2025

Choose a reason for hiding this comment

pprados Jan 10, 2025

Choose a reason for hiding this comment

eyurtsev Jan 9, 2025

Choose a reason for hiding this comment

pprados Jan 10, 2025

Choose a reason for hiding this comment

pprados commented Jan 7, 2025 •

edited

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading