Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This package does not work for me... #7

Open
sch82812121 opened this issue Dec 30, 2013 · 6 comments
Open

This package does not work for me... #7

sch82812121 opened this issue Dec 30, 2013 · 6 comments

Comments

@sch82812121
Copy link

Somehow, it does not seem to work with my directory layout (Ubuntu 10.04).
This seems to be a tesseract-related issue (Cuneiform seems to work)...

pdfocr -i beleg0059.pdf -o b59.pdf
Input file is /home/samba-shares/family/scans/beleg0059.pdf
Output file is /home/samba-shares/family/scans/b59.pdf
Using working dir /tmp/d20131230-26500-1fddng
Getting info from PDF file

Warning: no info dictionary found
NumberOfPages: 4

Converting 4 pages

Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/1.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `1.hocr.html': No such file or directory

Error while running OCR on page 1

Extracting page 2
Converting page 2 to ppm
Running OCR on page 2
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/2.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `2.hocr.html': No such file or directory

Error while running OCR on page 2

Extracting page 3
Converting page 3 to ppm
Running OCR on page 3
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/3.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `3.hocr.html': No such file or directory

Error while running OCR on page 3

Extracting page 4
Converting page 4 to ppm
Running OCR on page 4
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/4.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `4.hocr.html': No such file or directory
Error while running OCR on page 4
Merging together PDF files
/tmp/d20131230-26500-1fddng/-new.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/
-new.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Updating PDF info for /home/samba-shares/family/scans/b59.pdf
/tmp/d20131230-26500-1fddng/merged.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/merged.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Cleaning up temporary files

@perrette
Copy link

Hi, and thanks for sharing your code !

I have the same issue as @sch82812121, with log such as:

Converting 4 pages
==========
Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
Tesseract Open Source OCR Engine v3.03 with Leptonica
mv: cannot stat ‘1.hocr.html’: No such file or directory
Error while running OCR on page 1
==========

and so on for each of 4 pages. Maybe you have an idea where this could come from?
Best,
Mahé

@johanovic
Copy link

I can confirm this bug on ubuntu 14.04

@ashwin
Copy link

ashwin commented Jun 9, 2014

I'm on Ubuntu 14.04 and seeing the same error.

@xylo
Copy link

xylo commented Jun 29, 2014

Same on my Ubuntu 14.04.

@hankschwie
Copy link

Hi!

Tesseract in Version 3.03 does not use the .html extention for the hOCR files anymore, it uses .hocr instead. To fix it you can edit the sourcecode in pdfocr.rb, lines 336 to look like this:

sh "tesseract", "-l", language, basefn+'.ppm', basefn, "hocr"
and remove or comment out the next line

sh "mv", basefn+'.hocr.hocr', basefn+'.hocr'

However, an even better solution is found here: snowboard975@4d274c9
so long
hank

@mmcraedhcu
Copy link

Same on my Ubuntu 14.04

Wish this issue was titled something more relevant... This is an important fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants