Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Enhance PDFConverter to support text injection into existing PDFs #641

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

KutalVolkan
Copy link
Contributor

@KutalVolkan KutalVolkan commented Jan 12, 2025

Overview:
This PR introduces enhancements to the PDF Converter, enabling direct modification of existing PDFs. It builds upon the existing functionality, supporting both template-based and non-template-based PDF generation while adding new capabilities for precise text injection.

Key Features:

  1. Modify Existing PDFs:

    • Inject text at specified coordinates on specific pages of an existing PDF e.g,. CV
    • Customize text with font, size, and color options.
  2. Backward Compatibility:

    • Maintains support for template-based PDF generation using SeedPrompt.
    • Retains non-templated PDF generation from plain string prompts.

Related Issues

Next Steps (DONE):

  • Update tests to cover the new functionality.
  • Add pypdf==5.1.0 to the pyproject.toml file to ensure the correct dependency is installed for the new feature.
  • Clean up the demo notebook and make the path to the CV file more generic.

@KutalVolkan KutalVolkan changed the title [DRAFT] FEAT: Enhance PDFConverter with existing PDFs text injection [DRAFT] FEAT: Enhance PDFConverter to support text injection into existing PDFs Jan 12, 2025
@KutalVolkan KutalVolkan changed the title [DRAFT] FEAT: Enhance PDFConverter to support text injection into existing PDFs FEAT: Enhance PDFConverter to support text injection into existing PDFs Jan 12, 2025
@KutalVolkan KutalVolkan marked this pull request as ready for review January 12, 2025 18:37

from fpdf import FPDF
from pypdf import PdfReader, PdfWriter, PageObject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to update pyproject.toml to add this dependency and remove fpdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Roman,

Depending on the features we want, we need both dependencies, pypdf is used for reading and modifying existing PDFs (e.g., CVs), while fpdf is for creating PDFs from scratch. This allows us to create custom templates, as mentioned in issue.

By keeping both, we would support:

  • Templated PDF Generation: YAML-based templates with SeedPrompt for dynamic data injection.
  • Non-Templated PDF Generation: Directly converting plain string prompts into PDFs.
  • Reading and Modifying PDFs: Supporting existing PDFs and post-creation edits.

I'd prefer to keep both dependencies since it allows us to handle not only CVs but also generate custom cover letters from scratch. It enables us to test injecting malicious input into both CVs and cover letters.

For example, we could let an LLM generate cover letters for us and then test how injecting malicious inputs into these cover letters might trick an AI recruiter, not just through CV injections but also through targeted manipulations of the cover letter.

Wdyt?

Comment on lines +174 to +175
x = item.get("x", 10)
y = item.get("y", 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 10 just an arbitrary location you chose or does it mean something?

Copy link
Contributor Author

@KutalVolkan KutalVolkan Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10 is a default offset to avoid placing text at the edge of the page (0,0). It ensures better readability and positioning. While it serves a practical purpose, the choice of 10 is somewhat arbitrary and can be adjusted as needed.

reader = PdfReader(self._existing_pdf)
writer = PdfWriter()

for page_number, page in enumerate(reader.pages):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some readers just read content once and then essentially don't keep the content in memory. Does this one work that way? If so, this will stop working if you call it twice. Just asking since it's not obvious to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PdfReader loads the cross-reference table and essential metadata into memory during initialization, which allows for repeated access to reader.pages .

This behavior was verified through testing, which confirmed that reader.pages provides consistent results across multiple accesses.

def test_pdf_reader_repeated_access():
    # Create a simple PDF in memory
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=12)
    pdf.cell(200, 10, txt="Test Content", ln=True)
    pdf_bytes = BytesIO()
    pdf.output(pdf_bytes)
    pdf_bytes.seek(0)

    # Use PdfReader to read the PDF
    reader = PdfReader(pdf_bytes)
    first_access = reader.pages[0].extract_text()
    second_access = reader.pages[0].extract_text()

    # Assertions to verify behavior
    assert first_access == second_access, "Repeated access should return consistent data"

writer.add_page(page)

output_pdf = BytesIO()
writer.write(output_pdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is that written to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writer.write(output_pdf) method writes the contents of the PDF into the BytesIO object, output_pdf. A BytesIO object is an in-memory stream that mimics file-like behavior but stores its data in the system's RAM instead of writing it to disk.

font_size=12,
font_color=(0, 0, 0),
)
assert "y_pos exceeds page height" in str(excinfo.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Unit Test Coverage:

Having unit tests to cover edge cases for this error-prone functionality would be valuable.

Consider adding unit tests for the following scenarios:

  • Empty Injection Items

  • Handling of injection items with non-existent page numbers

  • Non-standard fonts, once the code has been updated

  • Injection on the last page

initialize_pyrit(memory_db_type=IN_MEMORY)

dbdata_path = get_default_dbdata_path()
cv_pdf_path = dbdata_path / "volkan_kutal_cv.pdf" # Dynamically resolve the CV PDF path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to add a basic PDF to the repo for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, under assets maybe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants