From 4d3246fe7a37a48ca0cbc6c99b6451da6619fe6b Mon Sep 17 00:00:00 2001
From: Ger Hobbelt <ger@hobbelt.com>
Date: Tue, 5 Sep 2023 22:31:41 +0200
Subject: [PATCH] (Obsidian) added notes re OCR (pre-)processing

# Conflicts:
#	MuPDF
---
 .../bezoar - OCR and related document page prep.md  | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md
index b4d298a4..e5548484 100644
--- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md	
+++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md	
@@ -6,5 +6,16 @@
   
   and, IFF possible:
   
-  3. *hopefully* add the ability to manipulate these behaviours through user-provided *scripting* for customized behaviours for individual input files.
+  3. *hopefully* add the ability to manipulate these behaviours through user-provided *scripting* for customized behaviours for individual input files. This SHOULD include arbitrary *page image* preprocessing flows, inspired on the algorithms included in `unpaper` and `libprecog` (a.k.a. `PRLib`).
+     
+      The key to this (scriptable) flow is that image (pre)processing is just not a single forward movement, but should allow for a preprocessing *graph* to be set up so we can create masks, etc., which are then to be applied to later stages in hat preprocessing graph
+      
+  1. Also add the feature to tesseract to provide a separate image for the *segmentation phase* in that codebase and/or override that phase entirely by allowing external processes (or the preprocessor) to deliver a list of segments (*bboxes*) to be OCR-ed.
+   
+      We note this as we saw that the LSTM OCR engine accepts/expects color or greyscale "original image data", but the thresholding in tesseract, while okay, delivers a mask that's sometimes unsuitable for *segmentation*, while it is fine for OCR (old skool tess v3 style): for *segmentation* we want fattened characters, possibly even connected together, i.e. some subtle shrink+grow / opening&closing before we binarize and feed *that* binarized image to the segmentation logic.
+      Meanwhile we feed *another*, *thinner*, binarized image to the old skool tesseract v3 engine. 
+      *Plus* we use a *thicker* binarized image as a *denoise/background-removal mask*, applying it to the color/greyscale "original image" data, after which we normalize the result in order to nicely span the entire available *dynamic range* in colorspace (grey or RGB), so as to feed the LSTM engine the best possible grey/color pixels, without any disturbance by further-away background noise pixels, thanks to the applied mask.
+      
+      > **Our Reason To Want This**: we've observed some very strange LSTM outputs when feeding that engine *unmasked* greyscale image data, which carried JPEG-artifacts!
+