-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no OCR after conversion (wrongly OCR'ed djvus?) #20
Comments
The two files have a common issue - an unrecognized type of text annotation. I added this new "region" list expression type and both files now have text annotations. I originally put the corresponding warnings as debug messages (only viewable via One thing I've noticed regarding the second file ("Психология наровод и наций") is that it is very large (between 400 and 600MiB depending on the optimizations). This may cause performance issues with PDF viewers. The first file's ("The Socialist System") translated text layer is bonkers: The DjVu looks a little better, but is still weird: As discussed previously, I don't know how to assign text to specified boxes in PDF files, which is what needs to be done here. |
Yes, the first file (Kornai) has a rubbish text layer after the update of dpsprep. It might be that it was OCR'ed in vertical position and after that position was changed to horizontal but coordinates of the text layer somehow remained the same. That how it seems to me wile looking to the djvu text layer – it is obviously vertical. I saw several such files. Re-OCR'ed. Better situation is with the second file (Психология) – I can copy text from the converted pdf as from djvu. So this update improved conversion :). Thank you. Some files become very big after conversions, that's true. But quite rarely. Algorithms of compression in djvu still is better than in pdf. My converted second file is 308 MB. I'm using options |
This issue continues that part of #16 about OCR, but with other files.
Two files. File Kornai. I can correctly copy text from djvu file in DjVu4, but not in Ocular, I can't see boxes of text in blue in latter. Evince let me see boxes and copy text (even correct), but very strangely, you could see (wrong orientation and placement, I was copying first paragraph):
No OCR after conversion. Something is wrong with djvu file, I doubt that can be solved without re-OCR.
File 2.djvu has correct (with many mistakes, but that shouldn't matter, I think) OCR that can be seen in Ocular and other viewers, I can copy text correctly from them. And no OCR after conversion. This case is more strange, because djvu file seems normal.
The text was updated successfully, but these errors were encountered: