Skip to content

seayoung1112/PageNLPParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Workflow to parse and write books to database:

1 (optional) convert pdf to txt:

install pdfminer (python module) run pdf2txt.py -o output_file original_pdf_file

2 run preprocess.py to get a .pre file

the format of preprocessed file will be each paragraph separated by one blank line (two \n\n) and each chapter title will be a single paragraph with a [--TITLE--] tag

open this file with an text editor with grammer checking and correct possible mistakes (depending on the quality of source)

3 run PageParser class (Java)

the text will be divided to sentences using NLP then be paginated according to a word count setting. the final format stored in mongodb will be :

Page: {index: int, title: string(optional), book: string, sentencese: array}

and blank string will appear in sentencese as a separator between paragraphs

About

an text nlp parser for learnbyread project

Resources

Stars

Watchers

Forks

Packages

No packages published