You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.
Thank you very much for ocrodjvu. I am using ocrodjvu with the options --engine=tesseract -l deu. Versions are:
tesseract: 3.02
ocrodjvu: 0.7.16
With the attached page I get the following exception:
/usr/share/ocrodjvu/lib/hocr.py:435: EncodingWarning: byte 0x10 in position 25317: control character
contents = utils.sanitize_utf8(contents)
Exception while processing page 1:
Traceback (most recent call last):
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 418, in page_thread
result = self.process_page(page)
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 401, in process_page
page_size=size
File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 271, in extract_text
return self._hocr.extract_text(stream, **kwargs)
File "/usr/share/ocrodjvu/lib/hocr.py", line 473, in extract_text
scan_result = scan(doc.find('/body'), settings)
File "/usr/share/ocrodjvu/lib/hocr.py", line 374, in scan
for zone in _scan(node, settings, settings.page_size):
File "/usr/share/ocrodjvu/lib/hocr.py", line 239, in _scan
return get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 285, in _scan
raise errors.MalformedHocr("character zones intermixed with non-character zones")
MalformedHocr: Malformed hOCR document: character zones intermixed with non-character zones
jwilk
changed the title
Crash with tesseract
Tesseract: 3.02: Malformed hOCR document: character zones intermixed with non-character zones
Feb 11, 2019
Issue reported by anonymous at Bitbucket:
Thank you very much for ocrodjvu. I am using ocrodjvu with the options
--engine=tesseract -l deu
. Versions are:tesseract: 3.02
ocrodjvu: 0.7.16
With the attached page I get the following exception:
Attachment: t-p-086.pgm.djvu.zip
The text was updated successfully, but these errors were encountered: