Skip to content

Commit

Permalink
(Obsidian) added notes re OCR (pre-)processing
Browse files Browse the repository at this point in the history
# Conflicts:
#	MuPDF
  • Loading branch information
GerHobbelt committed Nov 24, 2023
1 parent c8bb978 commit 4d3246f
Showing 1 changed file with 12 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,16 @@

and, IFF possible:

3. *hopefully* add the ability to manipulate these behaviours through user-provided *scripting* for customized behaviours for individual input files.
3. *hopefully* add the ability to manipulate these behaviours through user-provided *scripting* for customized behaviours for individual input files. This SHOULD include arbitrary *page image* preprocessing flows, inspired on the algorithms included in `unpaper` and `libprecog` (a.k.a. `PRLib`).

The key to this (scriptable) flow is that image (pre)processing is just not a single forward movement, but should allow for a preprocessing *graph* to be set up so we can create masks, etc., which are then to be applied to later stages in hat preprocessing graph

1. Also add the feature to tesseract to provide a separate image for the *segmentation phase* in that codebase and/or override that phase entirely by allowing external processes (or the preprocessor) to deliver a list of segments (*bboxes*) to be OCR-ed.

We note this as we saw that the LSTM OCR engine accepts/expects color or greyscale "original image data", but the thresholding in tesseract, while okay, delivers a mask that's sometimes unsuitable for *segmentation*, while it is fine for OCR (old skool tess v3 style): for *segmentation* we want fattened characters, possibly even connected together, i.e. some subtle shrink+grow / opening&closing before we binarize and feed *that* binarized image to the segmentation logic.
Meanwhile we feed *another*, *thinner*, binarized image to the old skool tesseract v3 engine.
*Plus* we use a *thicker* binarized image as a *denoise/background-removal mask*, applying it to the color/greyscale "original image" data, after which we normalize the result in order to nicely span the entire available *dynamic range* in colorspace (grey or RGB), so as to feed the LSTM engine the best possible grey/color pixels, without any disturbance by further-away background noise pixels, thanks to the applied mask.

> **Our Reason To Want This**: we've observed some very strange LSTM outputs when feeding that engine *unmasked* greyscale image data, which carried JPEG-artifacts!

0 comments on commit 4d3246f

Please sign in to comment.