Skip to content

Commit

Permalink
updating content
Browse files Browse the repository at this point in the history
  • Loading branch information
khufkens committed Jan 8, 2025
1 parent dd5609f commit 5367ccf
Show file tree
Hide file tree
Showing 8 changed files with 183 additions and 37 deletions.
4 changes: 3 additions & 1 deletion book/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,16 @@ project:
- _headers

book:
title: "Text recognition and analysis"
title: "Text digitization, recognition and analysis"
author: "Koen Hufkens"
date: "2025/01/07"
page-navigation: true
chapters:
- index.qmd
- intro.qmd
- digitization.qmd
- ml_methods.qmd
- software.qmd
- open_source.qmd
- references.qmd
favicon: "figures/favicon.ico"
Expand Down
31 changes: 31 additions & 0 deletions book/digitization.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: "Image acquisition"
format: html
---

Although this course focuses on text recognition and analysis it is important to note that image acquisition, the quality of the images and the consistent collection of meta-data is key to all subsequent processing. If you start a project where the digitization is not yet completed you should consider the importance of the digitization step within the context of all subsequent post-processing and text recognition workflows.

The quality of the collected image data and the availability of meta-data has a profound impact on your workflow. Preemptively addressing image quality and meta-data issues can save significant time and effort, even when taking up some more time in planning and data collection.

Some general guidelines for digitization therefore include:

- ensuring a proper digitization setup
- high quality optics (high f-stop value for sharpness)
- uniform shadowless illumination using multiple lights and ring lights
- avoid harsh flash based setups (protecting sensitive manuscripts)
- ensuring a fixed digitization protocol
- fixed sequence of tasks involved
- well documented
- collect meta-data when feasible
- ensuring dynamic back-ups to prevent data loss

Finally, if not within your domain expertise reach out to your local collection managers for support and input on all these aspects.

```{r}
#| label: fig-digitzation
#| fig-cap: "The COBECORE digitization station, including a reproduction stand, cold lights, a DSLR camera and a black matte background"
#| fig-align: "center"
#| out-width: "50%"
#| echo: FALSE
knitr::include_graphics("./images/digistation.jpg")
```
Binary file added book/images/digistation.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/images/image_augmentation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion book/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

These are the materials for the course "Text recognition and analysis" given 6-7 Feb. 2025 at the Leibniz-Institut für Europäische Geschichte (IEG), Mainz. This book will serve as a reference during the course, and as a general introduction and reference for all things Handwritten Text Recognition / Optical Character Recognition (HTR/OCR).

This reference gives an overview of the most common tools for historical (handwritten) text recognition. It will discuss the practical issues of such projects and how to resolve them efficiently and cost-effectively.
This reference gives an overview of the most common tools for historical (handwritten) text recognition. In addition, I will also briefly discuss the initial digitization and potential citizen science components of such projects, leveraging my experience leading the [Congo basin eco-climatological data recovery and valorisation project](https://cobecore.org/). It will discuss the practical issues of such projects and how to resolve them efficiently and cost-effectively. This course is a practical tool, not a theoretical machine learning reference. This course will give you an idea of what it takes to start a data recovery effort.
68 changes: 38 additions & 30 deletions book/ml_methods.qmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: "ML and computer vision"
title: "Basics of machine learning and computer vision"
format: html
---

To understand the software (frameworks) for HTR/OCR solutions a brief introduction in ML and computer vision methods is required. This allows you to understand potential pitfalls better. As highlighted in Figure 2.1, there are two main ML components to HTR/OCR transcription workflows, a segmentation component and a text transcription component.
As highlighted in Figure 2.1, there are two main ML components to HTR/OCR transcription workflows, a segmentation component and a text transcription component. To understand the software (frameworks) for HTR/OCR solutions a brief introduction in ML and computer vision methods is required. This allows you to understand potential pitfalls better.

## Computer vision

Although computer vision methods, broadly, include ML methods the classical approaches differ significantly from ML methods and merit a small mention. Classic computer vision methods do not rely the machine learning methods as discussed below LINK, but rather on pixel (region) or image based transformation. These methods are often used in the pre-processing of images before a machine learning algorithm is applied. Classical examples are the removal of [uneven lighting across an image using adaptive histogram equalization](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization), the detection of structuring elements such as [linear features using a Hough transform](https://en.wikipedia.org/wiki/Hough_transform), or the [adaptive thresholding of an image](https://en.wikipedia.org/wiki/Thresholding_(image_processing)) from colour to black-and-white only. These algorithms also serve an important role in the creation of additional data from a single reference dataset, through data augmentation LINK.
Although computer vision methods, broadly, include ML methods the classical approaches differ significantly from ML methods. Classic computer vision methods, as discussed below LINK, are applied on pixel (region) or image based transformation. These methods are often used in the pre-processing of images before a machine learning algorithm is applied (FIGURE LINK). Classical examples are the removal of [uneven lighting across an image using adaptive histogram equalization](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization), the detection of structuring elements such as [linear features using a Hough transform](https://en.wikipedia.org/wiki/Hough_transform), or the [adaptive thresholding of an image](https://en.wikipedia.org/wiki/Thresholding_(image_processing)) from colour to black-and-white only. These algorithms also serve an important role in the creation of additional data from a single reference dataset, through data augmentation LINK.

```{r}
#| label: fig-cv
Expand All @@ -20,24 +20,28 @@ knitr::include_graphics("https://docs.opencv.org/3.4/ada_threshold.jpg")

## Machine Learning

### Principles
The machine learning components of the text segmentation and transcriptions rely on common machine learning algorithms and logic. To better understand these tasks, and how training methods influences the success of these models, I'll summarize some of these common building blocks. These are vulgarized and simplified descriptions to increase the broad understanding of these processes, for in depth discussions I refer to the linked articles in the text and machine learning textbooks at the end of this course.

The machine learning components of the segmentation and transcriptions ML models rely on common ML algorithms and logic. To better understand these tasks, and how training methods influences the success of these models, I'll summarize some of these common building blocks. These are vulgarized and simplified descriptions to increase the broad understanding of these processes, for in depth discussions I refer to the linked articles in the text.

::: callout-note
Those familiar with machine learning methods can skip this section.
:::

#### Model training
```{r}
#| label: fig-training
#| fig-cap: "Machine Learning as summarized by XKCD (https://xkcd.com/1838/)"
#| fig-align: "center"
#| out-width: "50%"
#| echo: FALSE
knitr::include_graphics("https://imgs.xkcd.com/comics/machine_learning.png")
```

Machine learning models are
Machine learning models are non-deterministic and rely on learning or training (an optimization method) on ground truth (reference) data. The most simple machine learning algorithm is a simple linear regression. In a simple linear regression one optimizes (trains) a slope and intercept parameter to fit the observed response (ground truth) to explanatory variables (data). The more complex the task, the more parameters and data are required. Although oversimplified, the very tongue in cheek cartoon by XKCD is a good mental model of what happens on an abstract level where we shuffle model parameters until we get good correspondence between the data input and the ground truth observations.

#### Data augmentation
From this one can deduce a number of key take-home message:

- a sufficient amount of training data
- an appropriate ML and shuffling (optimization) algorithm
- a ML model is limited by the representations within the training data

#### Detecting patterns: convolutional neural networks (CNN)
### Detecting patterns: convolutional neural networks (CNN)

The analysis of images within the context of machine learning often (but not exclusively) happens using a convolutional neural networks (CNNs). Conceptually a CNN can be see as taking sequential sections of the image and summarizing them (i.e. convolve them) using a function (a filter), to a lower aggregated resolution (FIGURE XYZ). This reduces the size of the image, while at the same time while extracting a certain characteristic using a filter function. One of the most simple functions would be taking the average value across a 3x3 window.
The analysis of images within the context of machine learning often (but not exclusively) happens using a convolutional neural networks (CNNs). Conceptually a CNN can be see as taking sequential sections of the image and summarizing them (i.e. convolve them) using a function (a filter), to a lower aggregated resolution (FIGURE XYZ). This reduces the size of the image, while at the same time while summarizing a certain characteristic using a filter function. One of the most simple functions would be taking the average value across a 3x3 window.

```{r}
#| label: fig-convolution
Expand All @@ -50,13 +54,13 @@ knitr::include_graphics("./images/Conv_no_padding_no_strides.gif")

It is important to understand this concept within the context of text recognition and classification tasks in general. It highlights the fact that ML algorithms do not "understand" (handwritten) text. Where people can make sense of handwritten text by understanding the flow, in addition to recognizing patterns, ML approaches focus on patterns, shapes or forms. However, some form of memory can be included using other methods.

#### Memory: recurrent neural networks
### Memory and context: recurrent neural networks

A second component to many recognition tasks is form of memory [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
A second component to many recognition tasks is a form of memory. Where the CNN encodes for patterns it does so without explicitly taking into account the relative position of these patterns and their relationship to adjacent ones. Here, Recurrent Neural Networks ([RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network)) and Long Short-Term Memory ([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) networks provide a solution. These algorithms allow for some of the information of adjacent data (either in time or space) to be retained to provide context on the current (time or space) position. Both these approaches can be uni- or bi-directional. In the former, the direction of processing matters in the latter it doesn't.

#### Negative space: connectionist temporal classification
### Negative space: connectionist temporal classification

In speech and written text much of the structure is defined not only by what is there, the spoken and written words, but also what is not there, the pauses and spacing. Taken to the extreme the expressionist / dadaist poem "Boem paukeslag" by [Paul van Ostaijen](https://en.wikipedia.org/wiki/Paul_van_Ostaijen) is an example of irregularity in text in typeset text. These irregularities or negative space in the pace of writing is another hurdle for text recognition algorithms.
In speech and written text much of the structure is defined not only by what is there, the spoken and written words, but also what is not there, the pauses and spacing. Taken to the extreme the expressionist / dadaist poem "Boem paukeslag" by [Paul van Ostaijen](https://en.wikipedia.org/wiki/Paul_van_Ostaijen) is an example of irregularity in typeset text. These irregularities or negative space in the pace of writing is another hurdle for text recognition algorithms. Generally, we want a readable text as output of our ML models not dadaist impressions with large gaps.

```{r}
#| label: fig-boem-paukenslag
Expand All @@ -67,7 +71,7 @@ In speech and written text much of the structure is defined not only by what is
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/0/0c/Boempaukeslag.jpg")
```

These issues in detecting uneven spacing are addressed using the [connectionist temporal classification (CTC)](https://en.wikipedia.org/wiki/Connectionist_temporal_classification). This function is applied to the RNN and LSTM output, where it [collapses a sequence of recurring labels](https://distill.pub/2017/ctc/) through oversampling to its most likely reduced form while respecting spacing and coherence.
These issues in detecting uneven spacing are addressed using the [Connectionist Temporal Classification (CTC)](https://en.wikipedia.org/wiki/Connectionist_temporal_classification). This function is applied to the RNN and LSTM output, where it [collapses a sequence of recurring labels](https://distill.pub/2017/ctc/) through oversampling to its most likely reduced form.

```{r}
#| label: fig-ctc-loss
Expand All @@ -78,19 +82,23 @@ These issues in detecting uneven spacing are addressed using the [connectionist
knitr::include_graphics("images/ctc_loss_Hannun.png")
```

#### Transformers



### Implementation
### Data augmentation

Sufficient data is key in training a ML model which performs well. However, at times you might be limited in the data you can access for training. A common issue is that limited ground truth data (labels, text transcriptions) are available. Data augmentation is a way to slightly alter a smaller existing dataset in order to create a larger, partially, artificial dataset. Within the context of HTR/OCR one can generate slight variations of the same text image and label pair through computer vision (or machine learning) based alterations, such as rotating skewing and introducing noise to the image.

```{r}
#| label: fig-data-augmentation
#| fig-cap: "Data augmentation examples on the French word Juillet"
#| fig-align: "center"
#| out-width: "100%"
#| echo: FALSE
knitr::include_graphics("./images/image_augmentation.png")
```

#### Segmentation
## Implementation

- CNN
Putting all the pieces together the most common ML implementation of text segmentation rely heavily on CNN based segmentation networks, while text recognition often if not always takes the form of a CNN + (bidirectional) LSTM/RNN + CTC network. When reading technical documentation on the architecture of models in text transcription frameworks you might come across these terms. Depending on the implementation or framework used data augmentation during training might be provided to increase the scope of the model and increase the chances of Out-Of-Distribution (OOD) generalization.

#### Text Recognition
### Out-of-distribution generalization in text transcription

- CNN + biLSTM + CTC
- CNN + RNN + CTC
Handwritten text or old print is highly varying in shape form and retained quality. This pushes trained models towards poor performance as the chances of good OOD generalization are small. In short, two text styles are rarely the same or not similar enough for a trained model to be transferred to a new, seemingly similar, transcription task.
Loading

0 comments on commit 5367ccf

Please sign in to comment.