Ready To Train PDF (PDF-RTT)

This project is a Python-based PDF preprocessing tool. It provides various operations such as removing headers and footers, marking bounding boxes, removing tables, excluding lines, and saving the result as HTML or TXT.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You need to have Python installed on your machine. You can download Python here.

Installing

Clone the repository to your local machine:

git clone [email protected]:LordWaif/pdf-rtt.git

Install the spacy and poppler:

apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
python -m spacy download pt
python -m spacy download pt_core_news_sm

Install the required packages:

pip install -r requirements.txt

Usage

The main functionality of the project is encapsulated in the preprocess_pdf function. Here is a basic usage example:

from preprocesser import preprocess_pdf

preprocess_pdf(
    file,
    isBbox=True,  
    out_file_bbox=out, 
    out_path_html=html, 
    out_path_txt=txt, 
    pages=pages, 
    min_chain=5, 
    max_lines_header=10, 
    max_lines_footer=10, 
    cross_similarities_header=False, 
    cross_similarities_footer=True, 
    verbose=True, 
    slice_window=3
)

Api

uvicorn service:app --reload --port 19002 --host 0.0.0.0
celery -A celery_app worker --pool=threads --loglevel=info -Q rtt_queue

docker-compose -p rtt_services up -d

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
README.md		README.md
celery_app.py		celery_app.py
config.json		config.json
docker-compose.yaml		docker-compose.yaml
element_detection.py		element_detection.py
example.py		example.py
extract_rectangles.py		extract_rectangles.py
extract_sections.py		extract_sections.py
extract_tables.py		extract_tables.py
layout_functions.py		layout_functions.py
line_utils.py		line_utils.py
mark_functions.py		mark_functions.py
preprocesser.py		preprocesser.py
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt
requirements_api.txt		requirements_api.txt
service.py		service.py
similarity_functions.py		similarity_functions.py
spacy_install.sh		spacy_install.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ready To Train PDF (PDF-RTT)

Getting Started

Prerequisites

Installing

Usage

Api

License

About

Releases

Packages

Contributors 2

Languages

License

LordWaif/pdf-rtt

Folders and files

Latest commit

History

Repository files navigation

Ready To Train PDF (PDF-RTT)

Getting Started

Prerequisites

Installing

Usage

Api

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages