-
Notifications
You must be signed in to change notification settings - Fork 19
Support Python 3 #39
Comments
This issue has become worse, as Ubuntu 20.04 specifically does not offer pip for python anymore, so even "manual" non-package installation is becoming very difficult. |
@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support? |
Great! |
Fixed by pull request |
A quick test is OK, thanks. |
@bastien-roucaries In my case, I could not extract the hocr from a djvu file using djvu2hocr. It complained that the argument to write was bytes instead of string. Note that the method encode converts string to bytes in the given encoding. I had to make the following modifications to '''lib/cli/djvu2hocr.py''': At line 331, replace At line 345, replace At line 277, replace |
@bastien-roucaries yes, @Dominic-Mayers's changes are needed to workaround an error. Now it works fine in my Debian machine. > ~/src/py/ocrodjvu$ djvu2hocr ~/99tech.djvu
Converting /home/farid/fin/stock/books/murphy/99tech.djvu:
Traceback (most recent call last):
File "/usr/local/bin/djvu2hocr", line 26, in <module>
cli.main(sys.argv)
File "/usr/local/share/ocrodjvu/lib/cli/djvu2hocr.py", line 331, in main
sys.stdout.write(hocr_header.encode('UTF-8'))
TypeError: write() argument must be str, not bytes |
I confirm. FYI, I tried to convert the resulting hOCR with hocr2djvused and got |
I saw four forks with a Python3-conversion I merged the successful parts The remaining issues are string/bytes issues with the optional ocrad and gocr. I guess there has to be done something with TextIOWrapper in common.py to adapt the output of tesseract.py, cuneiform.py, ocrad.py and gocr.py. You can see the remaining issues with
or more specifically:
|
I made the tests for gocr and ocrad work as well. For the gocr output I used BytesIO instead of StringIO. |
We should probably try to get it working on Python 3.10 as well: |
Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice. I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment. |
I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together. My last summer holiday I've spent time on improving the MRC-compression of ocrmypdf by using the djvu-tricks of these JWilk-repo's, not only using tesseract, but also easyocr for segmentation details of text-parts to the foreground. Unfortunately my first proof of concept got late due to struggling with cython and memory management during custom otsu-histograms, so my holiday was over before the POC was live. By the way, didjvu is not mentioned in the jwilk-retirement-message.
I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it. I tend to get comparable open source functionality into similar PDF MRC compression. PDF is what I use when I scan in a document and spread it among my peers. Gamera4 still has minimal maintenance, so distutils might still be taken care of. I was able to revive a functional pip-installer for python 2.7 as the main pip-download doesn't support 2.7 anymore. |
I read your issues in the Gamera-4 repo. There are more issues and support might be dropped as well, mostly due to Python as moving target, just as with these jwilk-repos. We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers? My effort this summer was giving live to even another binarizer, based on otsu of easyocr-segments. |
There is GitHub Actions support in this upstream repository now, so this might be limited to mostly copy-and-paste, although some changes are required (see my didjvu fork for example). I might have a look at it and might decide to "modernize" the code as I did for didjvu as well in the case I find enough time to do so.
I am aware of that.
I use both didjvu and ocrodjvu on a regular basis at the moment - and I might keep maintaining at least the bits which I actually use as far as I am able to. While I am rather familiar with Python development, the actual DJVU and image processing stuff is something I only have a rough understanding of.
If you look at the corresponding issue there, future is not really clear. I just started fixing some deprecated stuff to test Python 3.11 compatibility, but especially with distutils the migration path for some functionality is not even clear for the upstream developers.
I just looked through the code of didjvu: It seems like the only important imports are from |
I just did my first real test with the aggregated Python3 port. Apart from the fact that the |
Nice.
We did those upgrade activities to make those repos survive the deprecation of python 2.7, and I'm glad they do.
|
The main reason for deprecating for example JWilk and Gamera-repo's would be Python as moving target, which is too tedious to follow.
I wonder whether converting them to a more solid language would be able to preserve them.
It's probably too much of an effort right now, but there are AI solutions for translation Python to C++ or Java nowadays:
https://morioh.com/p/81aa0e33b28a
[https://i.ytimg.com/vi/cKUEvbzcCQ4/maxresdefault.jpg]<https://morioh.com/p/81aa0e33b28a>
Convert Python code to Java & C++ with AI Code Translator by Facebook - Morioh<https://morioh.com/p/81aa0e33b28a>
How to Install OpenJDK 11 on CentOS 8 What is OpenJDK? OpenJDk or Open Java Development Kit is a free, open-source framework of the Java Platform, Standard Edition (or Java SE).
morioh.com
|
This still is a matter of taste and of the actual code base. If the code has been modernized, there should not be any real issues for plain Python code. The biggest problems mostly arise from Python 2 code which has been made compatible to Python 3, but never actually modernized. From my experience, Python 3 tends to be rather stable, except that its C/C++ APIs might change (as we see for gamera-4). For this reason, maintaining Python code should mostly be easy enough. |
Thanks so much for this excellent software! I have been using it for years to run OCR on scans and it has never failed me.
Would it be possible to add Python 3 support? Unfortunatly, Python 2 development has been officially frozen and Python 2 will no longer receive updates: https://www.python.org/doc/sunset-python-2/
The text was updated successfully, but these errors were encountered: