-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Enhance PDFConverter to support text injection into existing PDFs #641
base: main
Are you sure you want to change the base?
FEAT: Enhance PDFConverter to support text injection into existing PDFs #641
Conversation
…fully pre-commit hooks
|
||
from fpdf import FPDF | ||
from pypdf import PdfReader, PdfWriter, PageObject |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to update pyproject.toml to add this dependency and remove fpdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Roman,
Depending on the features we want, we need both dependencies, pypdf
is used for reading and modifying existing PDFs (e.g., CVs), while fpdf
is for creating PDFs from scratch. This allows us to create custom templates, as mentioned in issue.
By keeping both, we would support:
- Templated PDF Generation: YAML-based templates with SeedPrompt for dynamic data injection.
- Non-Templated PDF Generation: Directly converting plain string prompts into PDFs.
- Reading and Modifying PDFs: Supporting existing PDFs and post-creation edits.
I'd prefer to keep both dependencies since it allows us to handle not only CVs but also generate custom cover letters from scratch. It enables us to test injecting malicious input into both CVs and cover letters.
For example, we could let an LLM generate cover letters for us and then test how injecting malicious inputs into these cover letters might trick an AI recruiter, not just through CV injections but also through targeted manipulations of the cover letter.
Wdyt?
x = item.get("x", 10) | ||
y = item.get("y", 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 10 just an arbitrary location you chose or does it mean something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10
is a default offset to avoid placing text at the edge of the page (0,0)
. It ensures better readability and positioning. While it serves a practical purpose, the choice of 10
is somewhat arbitrary and can be adjusted as needed.
reader = PdfReader(self._existing_pdf) | ||
writer = PdfWriter() | ||
|
||
for page_number, page in enumerate(reader.pages): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some readers just read content once and then essentially don't keep the content in memory. Does this one work that way? If so, this will stop working if you call it twice. Just asking since it's not obvious to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PdfReader
loads the cross-reference table and essential metadata into memory during initialization, which allows for repeated access to reader.pages
.
This behavior was verified through testing, which confirmed that reader.pages
provides consistent results across multiple accesses.
def test_pdf_reader_repeated_access():
# Create a simple PDF in memory
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.cell(200, 10, txt="Test Content", ln=True)
pdf_bytes = BytesIO()
pdf.output(pdf_bytes)
pdf_bytes.seek(0)
# Use PdfReader to read the PDF
reader = PdfReader(pdf_bytes)
first_access = reader.pages[0].extract_text()
second_access = reader.pages[0].extract_text()
# Assertions to verify behavior
assert first_access == second_access, "Repeated access should return consistent data"
writer.add_page(page) | ||
|
||
output_pdf = BytesIO() | ||
writer.write(output_pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is that written to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The writer.write(output_pdf)
method writes the contents of the PDF into the BytesIO
object, output_pdf
. A BytesIO
object is an in-memory stream that mimics file-like behavior but stores its data in the system's RAM instead of writing it to disk.
font_size=12, | ||
font_color=(0, 0, 0), | ||
) | ||
assert "y_pos exceeds page height" in str(excinfo.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Unit Test Coverage:
Having unit tests to cover edge cases for this error-prone functionality would be valuable.
Consider adding unit tests for the following scenarios:
-
Empty Injection Items
-
Handling of injection items with non-existent page numbers
-
Non-standard fonts, once the code has been updated
-
Injection on the last page
initialize_pyrit(memory_db_type=IN_MEMORY) | ||
|
||
dbdata_path = get_default_dbdata_path() | ||
cv_pdf_path = dbdata_path / "volkan_kutal_cv.pdf" # Dynamically resolve the CV PDF path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have to add a basic PDF to the repo for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, under assets
maybe
Overview:
This PR introduces enhancements to the PDF Converter, enabling direct modification of existing PDFs. It builds upon the existing functionality, supporting both template-based and non-template-based PDF generation while adding new capabilities for precise text injection.
Key Features:
Modify Existing PDFs:
Backward Compatibility:
SeedPrompt
.Related Issues
Next Steps (DONE):
pypdf==5.1.0
to thepyproject.toml
file to ensure the correct dependency is installed for the new feature.