GitHub - seayoung1112/PageNLPParser: an text nlp parser for learnbyread project

Workflow to parse and write books to database:

1 (optional) convert pdf to txt:

install pdfminer (python module) run pdf2txt.py -o output_file original_pdf_file

2 run preprocess.py to get a .pre file

the format of preprocessed file will be each paragraph separated by one blank line (two \n\n) and each chapter title will be a single paragraph with a [--TITLE--] tag

open this file with an text editor with grammer checking and correct possible mistakes (depending on the quality of source)

3 run PageParser class (Java)

the text will be divided to sentences using NLP then be paginated according to a word count setting. the final format stored in mongodb will be :

Page: {index: int, title: string(optional), book: string, sentencese: array}

and blank string will appear in sentencese as a separator between paragraphs

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/learnbyread		src/learnbyread
.gitignore		.gitignore
README.md		README.md
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

seayoung1112/PageNLPParser

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages