FEAT: Enhance PDFConverter to support text injection into existing PDFs #641

KutalVolkan · 2025-01-12T14:31:34Z

Overview:
This PR introduces enhancements to the PDF Converter, enabling direct modification of existing PDFs. It builds upon the existing functionality, supporting both template-based and non-template-based PDF generation while adding new capabilities for precise text injection.

Key Features:

Modify Existing PDFs:
- Inject text at specified coordinates on specific pages of an existing PDF e.g,. CV
- Customize text with font, size, and color options.
Backward Compatibility:
- Maintains support for template-based PDF generation using SeedPrompt.
- Retains non-templated PDF generation from plain string prompts.

Related Issues

FEAT: PDF Injection for RAG Vulnerabilities #541

Next Steps (DONE):

Update tests to cover the new functionality.
Add pypdf==5.1.0 to the pyproject.toml file to ensure the correct dependency is installed for the new feature.
Clean up the demo notebook and make the path to the CV file more generic.

…ample

…fully pre-commit hooks

romanlutz · 2025-01-17T19:13:56Z

pyrit/prompt_converter/pdf_converter.py


 from fpdf import FPDF
+from pypdf import PdfReader, PdfWriter, PageObject


we have to update pyproject.toml to add this dependency and remove fpdf

Hello Roman,

Depending on the features we want, we need both dependencies, pypdf is used for reading and modifying existing PDFs (e.g., CVs), while fpdf is for creating PDFs from scratch. This allows us to create custom templates, as mentioned in issue.

By keeping both, we would support:

Templated PDF Generation: YAML-based templates with SeedPrompt for dynamic data injection.

Non-Templated PDF Generation: Directly converting plain string prompts into PDFs.

Reading and Modifying PDFs: Supporting existing PDFs and post-creation edits.

I'd prefer to keep both dependencies since it allows us to handle not only CVs but also generate custom cover letters from scratch. It enables us to test injecting malicious input into both CVs and cover letters.

For example, we could let an LLM generate cover letters for us and then test how injecting malicious inputs into these cover letters might trick an AI recruiter, not just through CV injections but also through targeted manipulations of the cover letter.

Wdyt?

romanlutz · 2025-01-17T19:16:43Z

pyrit/prompt_converter/pdf_converter.py

+                    x = item.get("x", 10)
+                    y = item.get("y", 10)


is 10 just an arbitrary location you chose or does it mean something?

The 10 is a default offset to avoid placing text at the edge of the page (0,0). It ensures better readability and positioning. While it serves a practical purpose, the choice of 10 is somewhat arbitrary and can be adjusted as needed.

pyrit/prompt_converter/pdf_converter.py

romanlutz · 2025-01-17T19:20:41Z

pyrit/prompt_converter/pdf_converter.py

+        reader = PdfReader(self._existing_pdf)
+        writer = PdfWriter()
+
+        for page_number, page in enumerate(reader.pages):


Some readers just read content once and then essentially don't keep the content in memory. Does this one work that way? If so, this will stop working if you call it twice. Just asking since it's not obvious to me.

The PdfReader loads the cross-reference table and essential metadata into memory during initialization, which allows for repeated access to reader.pages .

This behavior was verified through testing, which confirmed that reader.pages provides consistent results across multiple accesses.

def test_pdf_reader_repeated_access(): # Create a simple PDF in memory pdf = FPDF() pdf.add_page() pdf.set_font("Arial", size=12) pdf.cell(200, 10, txt="Test Content", ln=True) pdf_bytes = BytesIO() pdf.output(pdf_bytes) pdf_bytes.seek(0) # Use PdfReader to read the PDF reader = PdfReader(pdf_bytes) first_access = reader.pages[0].extract_text() second_access = reader.pages[0].extract_text() # Assertions to verify behavior assert first_access == second_access, "Repeated access should return consistent data"

romanlutz · 2025-01-17T19:21:27Z

pyrit/prompt_converter/pdf_converter.py

+            writer.add_page(page)
+
+        output_pdf = BytesIO()
+        writer.write(output_pdf)


where is that written to?

The writer.write(output_pdf) method writes the contents of the PDF into the BytesIO object, output_pdf. A BytesIO object is an in-memory stream that mimics file-like behavior but stores its data in the system's RAM instead of writing it to disk.

pyrit/prompt_converter/pdf_converter.py

rdheekonda · 2025-01-17T19:35:49Z

tests/unit/converter/test_pdf_converter.py

+            font_size=12,
+            font_color=(0, 0, 0),
+        )
+    assert "y_pos exceeds page height" in str(excinfo.value)


Good Unit Test Coverage:

Having unit tests to cover edge cases for this error-prone functionality would be valuable.

Consider adding unit tests for the following scenarios:

Empty Injection Items

Handling of injection items with non-existent page numbers

Non-standard fonts, once the code has been updated

Injection on the last page

romanlutz · 2025-01-17T19:37:03Z

doc/code/converters/pdf_converter.py

+initialize_pyrit(memory_db_type=IN_MEMORY)
+
+dbdata_path = get_default_dbdata_path()
+cv_pdf_path = dbdata_path / "volkan_kutal_cv.pdf"  # Dynamically resolve the CV PDF path


We might have to add a basic PDF to the repo for this?

+1, under assets maybe

feat: Enhance PDFConverter with CV text injection and orchestrator ex…

b5ee6c3

…ample

KutalVolkan changed the title ~~[DRAFT] FEAT: Enhance PDFConverter with existing PDFs text injection~~ [DRAFT] FEAT: Enhance PDFConverter to support text injection into existing PDFs Jan 12, 2025

feat: added tests and dependency, clean up notebook demo, run success…

a4bfd30

…fully pre-commit hooks

KutalVolkan changed the title ~~[DRAFT] FEAT: Enhance PDFConverter to support text injection into existing PDFs~~ FEAT: Enhance PDFConverter to support text injection into existing PDFs Jan 12, 2025

KutalVolkan marked this pull request as ready for review January 12, 2025 18:37

feat: add out-of-bounds validation for overlay injection and tests

47412c6

romanlutz reviewed Jan 17, 2025

View reviewed changes

pyrit/prompt_converter/pdf_converter.py Show resolved Hide resolved

romanlutz reviewed Jan 17, 2025

View reviewed changes