Skip to content

Commit

Permalink
new content
Browse files Browse the repository at this point in the history
  • Loading branch information
khufkens committed Jan 7, 2025
1 parent 9e65824 commit dd5609f
Show file tree
Hide file tree
Showing 13 changed files with 622 additions and 264 deletions.
10 changes: 5 additions & 5 deletions book/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,20 @@ project:
- _headers

book:
title: "Book Title"
title: "Text recognition and analysis"
author: "Koen Hufkens"
date: "2023/01/01"
date: "2025/01/07"
page-navigation: true
chapters:
- index.qmd
- intro.qmd
- basicr.qmd
- ggplot.qmd
- ml_methods.qmd
- open_source.qmd
- references.qmd
favicon: "figures/favicon.ico"
twitter-card: true
search: true
repo-url: https://github.com/bluegreen-labs/R_book_template/
repo-url: https://github.com/bluegreen-labs/text_recognition_and_analysis/
sharing: [twitter, facebook]
navbar:
title: ""
Expand Down
48 changes: 0 additions & 48 deletions book/basicr.qmd

This file was deleted.

55 changes: 0 additions & 55 deletions book/ggplot.qmd

This file was deleted.

Binary file added book/images/Conv_no_padding_no_strides.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions book/images/HTR_workflow.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/images/ctc_loss_Hannun.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 2 additions & 6 deletions book/index.qmd
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# Preface {.unnumbered}

This is a Quarto book.
These are the materials for the course "Text recognition and analysis" given 6-7 Feb. 2025 at the Leibniz-Institut für Europäische Geschichte (IEG), Mainz. This book will serve as a reference during the course, and as a general introduction and reference for all things Handwritten Text Recognition / Optical Character Recognition (HTR/OCR).

To learn more about Quarto books visit <https://quarto.org/docs/books>.

```{r}
1 + 1
```
This reference gives an overview of the most common tools for historical (handwritten) text recognition. It will discuss the practical issues of such projects and how to resolve them efficiently and cost-effectively.
19 changes: 16 additions & 3 deletions book/intro.qmd
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
# Introduction

This is a book created from markdown and executable code.
Understanding or translating large volumes of handwritten historical text is critical for historical analysis preservation of text, dissemination of knowledge and valorization of archived measurements and/or other scientific observations. However, reading and processing these large volumes of historical texts (at scale) is often difficult and time consuming. The automation of this process would therefore help in many historical analysis, data recovery and other digital preservation efforts.

See @knuth84 for additional discussion of literate programming.
Handwritten text recognition (HTR), contrary to optical character recognition (OCR) for typed texts, is a relatively complex process. Handwritten text (or old fonts) are surprisingly varied, with characters varying from one person (or book) to the next. These variations make HTR/OCR at times an intractable problem.

## The HTR/OCR workflow

Generally, an HTR/OCR workflow follows two general steps: line/text detection and text transcription. The former detects lines or written text, once detected these lines or text elements are evaluated one-by-one using a text transcription method and combined to form the final digital text document.

HIGHLIGHT/REFERENCE THE COLOURED BITS IN THE FIGURE

```{r}
1 + 1
#| label: fig-workflow
#| fig-cap: "The HTR/OCR workflow, from image acquisition to transcribed HTR/OCR results."
#| fig-align: "left"
#| out-width: "100%"
#| echo: FALSE
knitr::include_graphics("./images/HTR_workflow.drawio.svg")
```

Depending on the framework or workflow different machine learning (ML) methods of text detection and transcription will be used. It is also key to understand that from a practical computer science perspective the problem of HTR/OCR is solved. Although algorithmic improvements will continue to be developed the current state-of-the-art machine learning (ML) methods perform well for many applications. Most of these algorithms, in the abstract, are relatively easy to understand and with today's software libraries and platforms even quicker to implement. I will briefly discuss various algorithms in section XYZ. A list of common frameworks and software is given in chapters XYZ.
96 changes: 96 additions & 0 deletions book/ml_methods.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "ML and computer vision"
format: html
---

To understand the software (frameworks) for HTR/OCR solutions a brief introduction in ML and computer vision methods is required. This allows you to understand potential pitfalls better. As highlighted in Figure 2.1, there are two main ML components to HTR/OCR transcription workflows, a segmentation component and a text transcription component.

## Computer vision

Although computer vision methods, broadly, include ML methods the classical approaches differ significantly from ML methods and merit a small mention. Classic computer vision methods do not rely the machine learning methods as discussed below LINK, but rather on pixel (region) or image based transformation. These methods are often used in the pre-processing of images before a machine learning algorithm is applied. Classical examples are the removal of [uneven lighting across an image using adaptive histogram equalization](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization), the detection of structuring elements such as [linear features using a Hough transform](https://en.wikipedia.org/wiki/Hough_transform), or the [adaptive thresholding of an image](https://en.wikipedia.org/wiki/Thresholding_(image_processing)) from colour to black-and-white only. These algorithms also serve an important role in the creation of additional data from a single reference dataset, through data augmentation LINK.

```{r}
#| label: fig-cv
#| fig-cap: "Example of various thresholding methods as implemented in the OpenCV computer vision library (https://opencv.org)"
#| fig-align: "center"
#| out-width: "50%"
#| echo: FALSE
knitr::include_graphics("https://docs.opencv.org/3.4/ada_threshold.jpg")
```

## Machine Learning

### Principles

The machine learning components of the segmentation and transcriptions ML models rely on common ML algorithms and logic. To better understand these tasks, and how training methods influences the success of these models, I'll summarize some of these common building blocks. These are vulgarized and simplified descriptions to increase the broad understanding of these processes, for in depth discussions I refer to the linked articles in the text.

::: callout-note
Those familiar with machine learning methods can skip this section.
:::

#### Model training

Machine learning models are

#### Data augmentation


#### Detecting patterns: convolutional neural networks (CNN)

The analysis of images within the context of machine learning often (but not exclusively) happens using a convolutional neural networks (CNNs). Conceptually a CNN can be see as taking sequential sections of the image and summarizing them (i.e. convolve them) using a function (a filter), to a lower aggregated resolution (FIGURE XYZ). This reduces the size of the image, while at the same time while extracting a certain characteristic using a filter function. One of the most simple functions would be taking the average value across a 3x3 window.

```{r}
#| label: fig-convolution
#| fig-cap: "An example convolution of a 3x3 window across a larger blue image summarizing values (squares) to a smaller green image (by Kaivan Kamali at https://galaxyproject.org/)"
#| fig-align: "center"
#| out-width: "30%"
#| echo: FALSE
knitr::include_graphics("./images/Conv_no_padding_no_strides.gif")
```

It is important to understand this concept within the context of text recognition and classification tasks in general. It highlights the fact that ML algorithms do not "understand" (handwritten) text. Where people can make sense of handwritten text by understanding the flow, in addition to recognizing patterns, ML approaches focus on patterns, shapes or forms. However, some form of memory can be included using other methods.

#### Memory: recurrent neural networks

A second component to many recognition tasks is form of memory [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).

#### Negative space: connectionist temporal classification

In speech and written text much of the structure is defined not only by what is there, the spoken and written words, but also what is not there, the pauses and spacing. Taken to the extreme the expressionist / dadaist poem "Boem paukeslag" by [Paul van Ostaijen](https://en.wikipedia.org/wiki/Paul_van_Ostaijen) is an example of irregularity in text in typeset text. These irregularities or negative space in the pace of writing is another hurdle for text recognition algorithms.

```{r}
#| label: fig-boem-paukenslag
#| fig-cap: "Boem paukenslag by Paul van Ostaijen"
#| fig-align: "left"
#| out-width: "30%"
#| echo: FALSE
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/0/0c/Boempaukeslag.jpg")
```

These issues in detecting uneven spacing are addressed using the [connectionist temporal classification (CTC)](https://en.wikipedia.org/wiki/Connectionist_temporal_classification). This function is applied to the RNN and LSTM output, where it [collapses a sequence of recurring labels](https://distill.pub/2017/ctc/) through oversampling to its most likely reduced form while respecting spacing and coherence.

```{r}
#| label: fig-ctc-loss
#| fig-cap: "A visualization of the CTC algorithm adapted from Hannun, 'Sequence Modeling with CTC', Distill, 2017. doi: 10.23915/distill.00008"
#| fig-align: "left"
#| out-width: "100%"
#| echo: FALSE
knitr::include_graphics("images/ctc_loss_Hannun.png")
```

#### Transformers



### Implementation



#### Segmentation

- CNN

#### Text Recognition

- CNN + biLSTM + CTC
- CNN + RNN + CTC
43 changes: 43 additions & 0 deletions book/open_source.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: "Open Source"
format: html
---

## (not) Open-source HTR/OCR

### A solved methodology

So, what is holding back universal open-source HTR/OCR?

Data is what holds back HTR in practice. Given the many variations in handwritten text ML algorithms, which are generally based on pattern recognition and not an understanding of the writing process, need to be trained ("see") a wide variety of handwritten text characters to be able to firstly translate similarly styled handwritten text, secondly potentially apply this to other adjacent styles. How close two documents are in writing style determines how well a trained model will perform on this task. Consequently, the more variations in handwritten text styles you train an ML algorithm on the easier it will be to transcribe a wide variety of text styles. In short, the bottleneck in automated transcription is gathering sufficient training data (for your use case).

### Precious data

Unsurprisingly, although the ML code might be open-source many large training datasets are often a close guarded secrets, as are interfaces to make the generation of such datasets. It can be argued that within the context of FAIR research practices ML code disseminated without the training data, or model parameters, for a particular study is decidedly not open-source. A similar argument has been made within the context of the recent flurry of supposedly open-source Large Language Models (LLMs), such as ChatGPT.

The lack of access to both the trainig data, or a pre-trained model, limits the re-use of the model in a new context. One can not take a model and fine-tune it, i.e. let it "see" new text styles. In short, if you only have the underlying model code you always have to train a model from scratch (anew) using your own, often limited, dataset. This context is important to understand, as this is how various transcription platforms will keep you tied to their paying service. For example, Transkribus, although making the training process on data easy, using the open-source {pylaia} python library, will not allow you to export these model weights for offline use. These platforms, although providing a service, will also hold you hostage and will play on the network effect to enroll as many users as possible (i.e. sharing model weights internally - despite open-source claims of the framework).

This lock-in situation often comes at a cost, which does not scale in favour of users and their own contributions. Using the recovery of climate data as a worked example, a cost break-down shows that after trainig a custom model the extraction of tables (1 credit), and its fields (1 credit), and text detection and transcription (1 credit) will require 3 credits per page. For the 75K tables in the archive this would represent 225K Transkribus credits, with a data volume > 200GB requiring a Team plan requires 60 000 EURO, with the assumptions that no re-runs are required (i.e. perfect results). Experience teaches that ML is often iterative, and the true costs will probably be far higher (>150K EURO). Various vision APIs of Google or Amazon are cheaper, but don’t allow for training, and perform poor on cursive text.

This shows that when tasks become large, with more complex workflows, alternatives might be cost effective. Despite the merits of some of these platforms in usability, how easy is it to escape the faustian bargain of platforms (or APIs) and their lock-in?


## True open-source?

As shown, foregoing interoperability and independence of your processing might betray you in the long run should data volumes increase. So what are the open-source options in this context?

### eScriptorium + Kraken

https://github.com/HTR-United/CREMMA-Medieval-LAT
https://help.transkribus.org/data-preparation
https://escriptorium.readthedocs.io/en/latest/quick-start/
https://ub-mannheim.github.io/eScriptorium_Dokumentation/Training-with-eScriptorium-EN.html
https://kraken.re
https://github.com/OCR4all

### Custom pipelines




https://github.com/HTR-United/htr-united
4 changes: 0 additions & 4 deletions book/summary.qmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
# Summary

In summary, this book has no content whatsoever.

```{r}
1 + 1
```
8 changes: 5 additions & 3 deletions renv.lock
Original file line number Diff line number Diff line change
Expand Up @@ -374,10 +374,12 @@
},
"renv": {
"Package": "renv",
"Version": "0.16.0",
"Source": "Repository",
"Version": "1.0.11",
"OS_type": NA,
"NeedsCompilation": "no",
"Repository": "CRAN",
"Hash": "c9e8442ab69bc21c9697ecf856c1e6c7",
"Path": NA,
"Source": "Repository",
"Requirements": []
},
"rlang": {
Expand Down
Loading

0 comments on commit dd5609f

Please sign in to comment.