diff --git a/book/_quarto.yml b/book/_quarto.yml index 71536cd..ba3a972 100644 --- a/book/_quarto.yml +++ b/book/_quarto.yml @@ -7,14 +7,16 @@ project: - _headers book: - title: "Text recognition and analysis" + title: "Text digitization, recognition and analysis" author: "Koen Hufkens" date: "2025/01/07" page-navigation: true chapters: - index.qmd - intro.qmd + - digitization.qmd - ml_methods.qmd + - software.qmd - open_source.qmd - references.qmd favicon: "figures/favicon.ico" diff --git a/book/digitization.qmd b/book/digitization.qmd new file mode 100644 index 0000000..2b822dd --- /dev/null +++ b/book/digitization.qmd @@ -0,0 +1,31 @@ +--- +title: "Image acquisition" +format: html +--- + +Although this course focuses on text recognition and analysis it is important to note that image acquisition, the quality of the images and the consistent collection of meta-data is key to all subsequent processing. If you start a project where the digitization is not yet completed you should consider the importance of the digitization step within the context of all subsequent post-processing and text recognition workflows. + +The quality of the collected image data and the availability of meta-data has a profound impact on your workflow. Preemptively addressing image quality and meta-data issues can save significant time and effort, even when taking up some more time in planning and data collection. + +Some general guidelines for digitization therefore include: + +- ensuring a proper digitization setup + - high quality optics (high f-stop value for sharpness) + - uniform shadowless illumination using multiple lights and ring lights + - avoid harsh flash based setups (protecting sensitive manuscripts) +- ensuring a fixed digitization protocol + - fixed sequence of tasks involved + - well documented + - collect meta-data when feasible +- ensuring dynamic back-ups to prevent data loss + +Finally, if not within your domain expertise reach out to your local collection managers for support and input on all these aspects. + +```{r} +#| label: fig-digitzation +#| fig-cap: "The COBECORE digitization station, including a reproduction stand, cold lights, a DSLR camera and a black matte background" +#| fig-align: "center" +#| out-width: "50%" +#| echo: FALSE +knitr::include_graphics("./images/digistation.jpg") +``` \ No newline at end of file diff --git a/book/images/digistation.jpg b/book/images/digistation.jpg new file mode 100644 index 0000000..5e9374c Binary files /dev/null and b/book/images/digistation.jpg differ diff --git a/book/images/image_augmentation.png b/book/images/image_augmentation.png new file mode 100644 index 0000000..c7336b3 Binary files /dev/null and b/book/images/image_augmentation.png differ diff --git a/book/index.qmd b/book/index.qmd index 26384a6..3a9a903 100644 --- a/book/index.qmd +++ b/book/index.qmd @@ -2,4 +2,4 @@ These are the materials for the course "Text recognition and analysis" given 6-7 Feb. 2025 at the Leibniz-Institut für Europäische Geschichte (IEG), Mainz. This book will serve as a reference during the course, and as a general introduction and reference for all things Handwritten Text Recognition / Optical Character Recognition (HTR/OCR). -This reference gives an overview of the most common tools for historical (handwritten) text recognition. It will discuss the practical issues of such projects and how to resolve them efficiently and cost-effectively. +This reference gives an overview of the most common tools for historical (handwritten) text recognition. In addition, I will also briefly discuss the initial digitization and potential citizen science components of such projects, leveraging my experience leading the [Congo basin eco-climatological data recovery and valorisation project](https://cobecore.org/). It will discuss the practical issues of such projects and how to resolve them efficiently and cost-effectively. This course is a practical tool, not a theoretical machine learning reference. This course will give you an idea of what it takes to start a data recovery effort. diff --git a/book/ml_methods.qmd b/book/ml_methods.qmd index 1c41b80..0681f45 100644 --- a/book/ml_methods.qmd +++ b/book/ml_methods.qmd @@ -1,13 +1,13 @@ --- -title: "ML and computer vision" +title: "Basics of machine learning and computer vision" format: html --- -To understand the software (frameworks) for HTR/OCR solutions a brief introduction in ML and computer vision methods is required. This allows you to understand potential pitfalls better. As highlighted in Figure 2.1, there are two main ML components to HTR/OCR transcription workflows, a segmentation component and a text transcription component. +As highlighted in Figure 2.1, there are two main ML components to HTR/OCR transcription workflows, a segmentation component and a text transcription component. To understand the software (frameworks) for HTR/OCR solutions a brief introduction in ML and computer vision methods is required. This allows you to understand potential pitfalls better. ## Computer vision -Although computer vision methods, broadly, include ML methods the classical approaches differ significantly from ML methods and merit a small mention. Classic computer vision methods do not rely the machine learning methods as discussed below LINK, but rather on pixel (region) or image based transformation. These methods are often used in the pre-processing of images before a machine learning algorithm is applied. Classical examples are the removal of [uneven lighting across an image using adaptive histogram equalization](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization), the detection of structuring elements such as [linear features using a Hough transform](https://en.wikipedia.org/wiki/Hough_transform), or the [adaptive thresholding of an image](https://en.wikipedia.org/wiki/Thresholding_(image_processing)) from colour to black-and-white only. These algorithms also serve an important role in the creation of additional data from a single reference dataset, through data augmentation LINK. +Although computer vision methods, broadly, include ML methods the classical approaches differ significantly from ML methods. Classic computer vision methods, as discussed below LINK, are applied on pixel (region) or image based transformation. These methods are often used in the pre-processing of images before a machine learning algorithm is applied (FIGURE LINK). Classical examples are the removal of [uneven lighting across an image using adaptive histogram equalization](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization), the detection of structuring elements such as [linear features using a Hough transform](https://en.wikipedia.org/wiki/Hough_transform), or the [adaptive thresholding of an image](https://en.wikipedia.org/wiki/Thresholding_(image_processing)) from colour to black-and-white only. These algorithms also serve an important role in the creation of additional data from a single reference dataset, through data augmentation LINK. ```{r} #| label: fig-cv @@ -20,24 +20,28 @@ knitr::include_graphics("https://docs.opencv.org/3.4/ada_threshold.jpg") ## Machine Learning -### Principles +The machine learning components of the text segmentation and transcriptions rely on common machine learning algorithms and logic. To better understand these tasks, and how training methods influences the success of these models, I'll summarize some of these common building blocks. These are vulgarized and simplified descriptions to increase the broad understanding of these processes, for in depth discussions I refer to the linked articles in the text and machine learning textbooks at the end of this course. -The machine learning components of the segmentation and transcriptions ML models rely on common ML algorithms and logic. To better understand these tasks, and how training methods influences the success of these models, I'll summarize some of these common building blocks. These are vulgarized and simplified descriptions to increase the broad understanding of these processes, for in depth discussions I refer to the linked articles in the text. - -::: callout-note -Those familiar with machine learning methods can skip this section. -::: - -#### Model training +```{r} +#| label: fig-training +#| fig-cap: "Machine Learning as summarized by XKCD (https://xkcd.com/1838/)" +#| fig-align: "center" +#| out-width: "50%" +#| echo: FALSE +knitr::include_graphics("https://imgs.xkcd.com/comics/machine_learning.png") +``` -Machine learning models are +Machine learning models are non-deterministic and rely on learning or training (an optimization method) on ground truth (reference) data. The most simple machine learning algorithm is a simple linear regression. In a simple linear regression one optimizes (trains) a slope and intercept parameter to fit the observed response (ground truth) to explanatory variables (data). The more complex the task, the more parameters and data are required. Although oversimplified, the very tongue in cheek cartoon by XKCD is a good mental model of what happens on an abstract level where we shuffle model parameters until we get good correspondence between the data input and the ground truth observations. -#### Data augmentation +From this one can deduce a number of key take-home message: +- a sufficient amount of training data +- an appropriate ML and shuffling (optimization) algorithm +- a ML model is limited by the representations within the training data -#### Detecting patterns: convolutional neural networks (CNN) +### Detecting patterns: convolutional neural networks (CNN) -The analysis of images within the context of machine learning often (but not exclusively) happens using a convolutional neural networks (CNNs). Conceptually a CNN can be see as taking sequential sections of the image and summarizing them (i.e. convolve them) using a function (a filter), to a lower aggregated resolution (FIGURE XYZ). This reduces the size of the image, while at the same time while extracting a certain characteristic using a filter function. One of the most simple functions would be taking the average value across a 3x3 window. +The analysis of images within the context of machine learning often (but not exclusively) happens using a convolutional neural networks (CNNs). Conceptually a CNN can be see as taking sequential sections of the image and summarizing them (i.e. convolve them) using a function (a filter), to a lower aggregated resolution (FIGURE XYZ). This reduces the size of the image, while at the same time while summarizing a certain characteristic using a filter function. One of the most simple functions would be taking the average value across a 3x3 window. ```{r} #| label: fig-convolution @@ -50,13 +54,13 @@ knitr::include_graphics("./images/Conv_no_padding_no_strides.gif") It is important to understand this concept within the context of text recognition and classification tasks in general. It highlights the fact that ML algorithms do not "understand" (handwritten) text. Where people can make sense of handwritten text by understanding the flow, in addition to recognizing patterns, ML approaches focus on patterns, shapes or forms. However, some form of memory can be included using other methods. -#### Memory: recurrent neural networks +### Memory and context: recurrent neural networks -A second component to many recognition tasks is form of memory [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory). +A second component to many recognition tasks is a form of memory. Where the CNN encodes for patterns it does so without explicitly taking into account the relative position of these patterns and their relationship to adjacent ones. Here, Recurrent Neural Networks ([RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network)) and Long Short-Term Memory ([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) networks provide a solution. These algorithms allow for some of the information of adjacent data (either in time or space) to be retained to provide context on the current (time or space) position. Both these approaches can be uni- or bi-directional. In the former, the direction of processing matters in the latter it doesn't. -#### Negative space: connectionist temporal classification +### Negative space: connectionist temporal classification -In speech and written text much of the structure is defined not only by what is there, the spoken and written words, but also what is not there, the pauses and spacing. Taken to the extreme the expressionist / dadaist poem "Boem paukeslag" by [Paul van Ostaijen](https://en.wikipedia.org/wiki/Paul_van_Ostaijen) is an example of irregularity in text in typeset text. These irregularities or negative space in the pace of writing is another hurdle for text recognition algorithms. +In speech and written text much of the structure is defined not only by what is there, the spoken and written words, but also what is not there, the pauses and spacing. Taken to the extreme the expressionist / dadaist poem "Boem paukeslag" by [Paul van Ostaijen](https://en.wikipedia.org/wiki/Paul_van_Ostaijen) is an example of irregularity in typeset text. These irregularities or negative space in the pace of writing is another hurdle for text recognition algorithms. Generally, we want a readable text as output of our ML models not dadaist impressions with large gaps. ```{r} #| label: fig-boem-paukenslag @@ -67,7 +71,7 @@ In speech and written text much of the structure is defined not only by what is knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/0/0c/Boempaukeslag.jpg") ``` -These issues in detecting uneven spacing are addressed using the [connectionist temporal classification (CTC)](https://en.wikipedia.org/wiki/Connectionist_temporal_classification). This function is applied to the RNN and LSTM output, where it [collapses a sequence of recurring labels](https://distill.pub/2017/ctc/) through oversampling to its most likely reduced form while respecting spacing and coherence. +These issues in detecting uneven spacing are addressed using the [Connectionist Temporal Classification (CTC)](https://en.wikipedia.org/wiki/Connectionist_temporal_classification). This function is applied to the RNN and LSTM output, where it [collapses a sequence of recurring labels](https://distill.pub/2017/ctc/) through oversampling to its most likely reduced form. ```{r} #| label: fig-ctc-loss @@ -78,19 +82,23 @@ These issues in detecting uneven spacing are addressed using the [connectionist knitr::include_graphics("images/ctc_loss_Hannun.png") ``` -#### Transformers - - - -### Implementation +### Data augmentation +Sufficient data is key in training a ML model which performs well. However, at times you might be limited in the data you can access for training. A common issue is that limited ground truth data (labels, text transcriptions) are available. Data augmentation is a way to slightly alter a smaller existing dataset in order to create a larger, partially, artificial dataset. Within the context of HTR/OCR one can generate slight variations of the same text image and label pair through computer vision (or machine learning) based alterations, such as rotating skewing and introducing noise to the image. +```{r} +#| label: fig-data-augmentation +#| fig-cap: "Data augmentation examples on the French word Juillet" +#| fig-align: "center" +#| out-width: "100%" +#| echo: FALSE +knitr::include_graphics("./images/image_augmentation.png") +``` -#### Segmentation +## Implementation -- CNN +Putting all the pieces together the most common ML implementation of text segmentation rely heavily on CNN based segmentation networks, while text recognition often if not always takes the form of a CNN + (bidirectional) LSTM/RNN + CTC network. When reading technical documentation on the architecture of models in text transcription frameworks you might come across these terms. Depending on the implementation or framework used data augmentation during training might be provided to increase the scope of the model and increase the chances of Out-Of-Distribution (OOD) generalization. -#### Text Recognition +### Out-of-distribution generalization in text transcription -- CNN + biLSTM + CTC -- CNN + RNN + CTC \ No newline at end of file +Handwritten text or old print is highly varying in shape form and retained quality. This pushes trained models towards poor performance as the chances of good OOD generalization are small. In short, two text styles are rarely the same or not similar enough for a trained model to be transferred to a new, seemingly similar, transcription task. diff --git a/book/open_source.qmd b/book/open_source.qmd index e4077ae..0ebd8a4 100644 --- a/book/open_source.qmd +++ b/book/open_source.qmd @@ -7,15 +7,15 @@ format: html ### A solved methodology - So, what is holding back universal open-source HTR/OCR? - -Data is what holds back HTR in practice. Given the many variations in handwritten text ML algorithms, which are generally based on pattern recognition and not an understanding of the writing process, need to be trained ("see") a wide variety of handwritten text characters to be able to firstly translate similarly styled handwritten text, secondly potentially apply this to other adjacent styles. How close two documents are in writing style determines how well a trained model will perform on this task. Consequently, the more variations in handwritten text styles you train an ML algorithm on the easier it will be to transcribe a wide variety of text styles. In short, the bottleneck in automated transcription is gathering sufficient training data (for your use case). +Methodologically (see section XYZ) the problem of text transcription seems to be solved. So, what is holding back universal open-source HTR/OCR? Generally, data is what holds back HTR in practice. Given the many variations in handwritten text ML algorithms need to be trained ("see") a wide variety of handwritten text characters to be able to firstly translate similarly styled handwritten text, secondly potentially apply this to other adjacent styles. How close two documents are in writing style determines how well a trained model will perform on this task. Consequently, the more variations in handwritten text styles you train an ML algorithm on the easier it will be to transcribe a wide variety of text styles. In short, the bottleneck in automated transcription is gathering sufficient training data (for your use case). ### Precious data -Unsurprisingly, although the ML code might be open-source many large training datasets are often a close guarded secrets, as are interfaces to make the generation of such datasets. It can be argued that within the context of FAIR research practices ML code disseminated without the training data, or model parameters, for a particular study is decidedly not open-source. A similar argument has been made within the context of the recent flurry of supposedly open-source Large Language Models (LLMs), such as ChatGPT. +Unsurprisingly, although the ML code might be open-source many large training datasets are not always shared as generously. It can be argued that within the context of FAIR research practices ML code disseminated without the training data, or model parameters, for a particular study is decidedly not open-source. A similar argument has been made within the context of the recent flurry of supposedly open-source Large Language Models (LLMs), such as ChatGPT. + +The lack of access to both the trainig data, or a pre-trained model, limits the re-use of the model in a new context. One can not take a model and fine-tune it, i.e. let it "see" new text styles. In short, if you only have the underlying model code you always have to train a model from scratch (anew) using your own, often limited, dataset. This context is important to understand, as this is how transcription platforms will keep you tied to their paying service. -The lack of access to both the trainig data, or a pre-trained model, limits the re-use of the model in a new context. One can not take a model and fine-tune it, i.e. let it "see" new text styles. In short, if you only have the underlying model code you always have to train a model from scratch (anew) using your own, often limited, dataset. This context is important to understand, as this is how various transcription platforms will keep you tied to their paying service. For example, Transkribus, although making the training process on data easy, using the open-source {pylaia} python library, will not allow you to export these model weights for offline use. These platforms, although providing a service, will also hold you hostage and will play on the network effect to enroll as many users as possible (i.e. sharing model weights internally - despite open-source claims of the framework). +For example, Transkribus, although making the training process on data easy, and using the open-source {pylaia} python library, will not allow you to export these model weights for offline use. These platforms, although providing a service, will also hold you hostage and will play on the network effect to enroll as many users/colleagues as possible (i.e. sharing model weights internally). This lock-in situation often comes at a cost, which does not scale in favour of users and their own contributions. Using the recovery of climate data as a worked example, a cost break-down shows that after trainig a custom model the extraction of tables (1 credit), and its fields (1 credit), and text detection and transcription (1 credit) will require 3 credits per page. For the 75K tables in the archive this would represent 225K Transkribus credits, with a data volume > 200GB requiring a Team plan requires 60 000 EURO, with the assumptions that no re-runs are required (i.e. perfect results). Experience teaches that ML is often iterative, and the true costs will probably be far higher (>150K EURO). Various vision APIs of Google or Amazon are cheaper, but don’t allow for training, and perform poor on cursive text. diff --git a/book/software.qmd b/book/software.qmd new file mode 100644 index 0000000..d7a1e4f --- /dev/null +++ b/book/software.qmd @@ -0,0 +1,105 @@ +--- +title: "Software (platforms)" +format: html +--- + +Within the context of text recognition and analysis there are number of commercial and open-source options available. Below I'll list the most common frameworks and some of their advantages and disadvantages. An in-depth discussion on how best to chose a framework within the context of your project is given in the next chapter (REFERENCE). + +## Commercial + +### [Transkribus](https://www.transkribus.org/) + +A dominant player in the transcription of historical texts is the Transkribus platform. This platform provides a way to apply layout detection, text transcription and custom model training (with on platform generated ground truth data) without coding. It offers commercial support options and a growing community of users, including their shared model zoo. The platform is currently built around the [PyLaia (python) library](LINK) (also below). + +| Pro: | Con: | +| ------------- | ------------- | +| user friendly | expensive +| support / documentation | vendor lock-in +| allows custom model training | + +### Google / Amazon / Microsoft APIs + +All three big tech platforms offer OCR based application programming interfaces (APIs) which you can access from (python) scripts. + +In particular, HTR/OCR is covered by: + +- [Microsoft Azure Document Inteligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/) +- [Google Vision AI](https://cloud.google.com/vision) +- [Amazon Textract](https://aws.amazon.com/textract/) + +| Pro: | Con: | +| ------------- | ------------- | +| support / documentation | vendor lock-in +| scalability | requires programming +| relatively cheap | custom model training is often complex or not possible| + +> Multi-modal Large Language Models +> +> Increasingly there is a consolidation of these toolboxes into multi-modal Large Language Models (LLMs). These LLMs will provide impressive results on common tasks, but will not perform well on less common or more complex data. + +## Open source + +### [eScriptorium](https://escriptorium.inria.fr/) + +eScriptorium is a software platform created to make text layout analysis and recognition easy. The underlying text recognition is based on the [Kraken framework](https://kraken.re), for which it serves as an interface. The interface allows for the user to annotate and train custom models, with no coding required, similar to Transkribus. Despite providing much the same features as Transkribus, eScriptorium is not program as such, but a service to be run on a server or in a docker image. This does require knowledge on how to setup and manage docker instances, or do a full server install. + +| Pro: | Con: | +| ------------- | ------------- | +| user friendly | complex installation for novices +| OK documentation | +| full workflow control | +| interoperability | +| shared models | + +#### Installation & Use + +A basic docker install is provided on [the project code pages](https://gitlab.com/scripta/escriptorium/-/wikis/docker-install). + +https://github.com/HTR-United/CREMMA-Medieval-LAT +https://help.transkribus.org/data-preparation +https://escriptorium.readthedocs.io/en/latest/quick-start/ +https://ub-mannheim.github.io/eScriptorium_Dokumentation/Training-with-eScriptorium-EN.html + +https://github.com/HTR-United/htr-united +https://github.com/OCR4all + +### [OCR4all](https://www.ocr4all.org/) + +OCR4all is an OCR platform built around the Calamari text recognition engine and the LAREX layout analysis tool. Similar to eScriptorium and Transkribus it aims at making the transcription of documents easy, without the need for coding. Similar to eScriptorium the setup is not program as such, but a service to be run on a server or in a docker image. + +| Pro: | Con: | +| ------------- | ------------- | +| user friendly | complex installation for novices +| OK documentation | +| full workflow control | +| interoperability | +| shared models | + +#### Installation & Use + +Quick docker install: + +```bash +sudo docker run -p 1476:8080 \ + -u `id -u root`:`id -g $USER` \ + --name ocr4all \ + -v $PWD/data:/var/ocr4all/data \ + -v $PWD/models:/var/ocr4all/models/custom \ + -it uniwuezpd/ocr4all +``` + + +### [Tesseract](https://tesseract-ocr.github.io/tessdoc/) + +Tesseract is a popular open-source OCR program and out of the box does not allow for handwritten text recognition. However, the software does allow for the retraining of models. Having been a mainstay in OCR work in the open source community [a zoo of third party software](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html) providing interfaces and additional functionality exists, as well as a [python interface (pytesseract)](https://github.com/madmaze/pytesseract). + +### Custom pipelines and libraries + +Most of the above mentioned software options are mature and require limited coding knowledge to operate. However, I would be amiss to not mention the underlying HTR/OCR libraries. Depending on the use case one could benefit from using low level libraries, rather than more user friendly platforms (built around them). Most prominent python libraries for HTR/OCR work are [Kraken](https://kraken.re/main/index.html) as used by eScriptorium, [PyLaia](https://gitlab.teklia.com/atr/pylaia) used by Transkribus, [EasyOCR](https://github.com/JaidedAI/EasyOCR) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md). + +All these libraries provide machine learning setups to train handwritten text recognition models of the CNN + LSTM/RNN + CTC kind. In addition, Kraken and PaddleOCR provide document layout analysis (segmentation) options. + +| Pro: | Con: | +| ------------- | ------------- | +| flexible | complex installation +| full workflow control | coding required