Skip to content

Latest commit

 

History

History
161 lines (130 loc) · 6.38 KB

README.md

File metadata and controls

161 lines (130 loc) · 6.38 KB

Text correction benchmarks

Benchmarks and baselines for various text correction tasks. Light-weight and easy to use.

Installation

pip install text-correction-benchmarks

or

git clone https://github.com/ad-freiburg/text-correction-benchmarks
cd text-correction-benchmarks && pip install .

After installation you will have two commands available to you:

  • tcb.evaluate for evaluating model predictions on benchmarks
  • tcb.baseline for running baselines on benchmarks

Usage

This repository contains benchmarks for text correction tasks such as

  • Whitespace correction (wsc)
  • Spelling error correction (sec)
  • Word-level spelling error detection (sedw)
  • Sequence-level spelling error detection (seds)

in the following simple text file format:

  • Whitespace correction
    • corrupt.txt: Input text with whitespace errors (can also contains spelling errors, but they remain uncorrected in the groundtruth)
    • correct.txt: Groundtruth text without whitespace errors

    "Th isis a tset." > "This is a tset."

  • Spelling error correction
    • corrupt.txt: Input text with spelling errors (can also contain whitespace errors to make the task harder)
    • correct.txt: Groundtruth text without whitespace and spelling errors

    "Th isis a tset." > "This is a test."

  • Word-level spelling error detection:
    • corrupt.txt: Input text with spelling errors (should not contain whitespace errors, since we assume they are already fixed)
    • correct.txt: Groundtruth label for each word in the input (split by whitespace), indicating whether a word contains a spelling error (1) or not (0)

    "This is a tset." > "0 0 0 1"

  • Sequence-level spelling error detection:
    • corrupt.txt: Input text with spelling errors (can also contain whitespace errors)
    • correct.txt: Groundtruth label for each sequence in the input, indicating whether the sequence contains a spelling error (1) or not (0)

    "this is a tset." > "1"

For each format one line corresponds to one benchmark sample.

Note that for some benchmarks we also provide versions other than test, e.g. dev or tuning, which can be used to assess performance of your method during developement. Final evaluations should always be done on the test split.

To evaluate predictions on a benchmark using tcb.evaluate, the following procedure is recommended:

  1. Run your model on benchmarks/<split>/<task>/<benchmark>/corrupt.txt
  2. Save your predictions in the expected format for the benchmark under in benchmarks/<split>/<task>/<benchmark>/predictions/<model_name>.txt
  3. Evaluate your predictions on a benchmark using tcb.evaluate:
    # evaluate your predictions on the benchmark
    tcb.evaluate -b benchmarks/<split>/<task>/<benchmark>
    
    # optionally sort by some metric and highlight the best predictions
    tcb.evaluate -b benchmarks/<split>/<task>/<benchmark> --sort "<metric>" --highlight

You can also evaluate across multiple benchmarks like so:

# when evaluating across multiple benchmarks you always need to specify a metric,
# otherwise you will get an error 

# listing multiple benchmarks
tcb.evaluate -b benchmarks/<split>/<task>/<benchmark1> \
    benchmarks/<split>/<task>/<benchmark2> ... -m "<metric>"

# using glob pattern
tcb.evaluate -b benchmarks/<split>/<task>/<gl*ob_patte*rn> -m "<metric>"

# you can highlight the best predictions per benchmark
tcb.evaluate -b benchmarks/<split>/<task>/<gl*ob_patte*rn> -m "<metric>" --highlight

Depending on the task the following metrics are calculated:

  • Whitespace correction
    • F1 (micro-averaged)
    • F1 (sequence-averaged)
    • Sequence accuracy
  • Spelling error correction
    • F1 (micro-averaged)
    • F1 (sequence-averaged)
    • Sequence accuracy
  • Word-level spelling error detection
    • Word accuracy
    • Binary F1 (micro-averaged)
  • Sequence-level spelling error detection
    • Binary F1
    • Sequence accuracy

Baselines

We also provide baselines for each task:

  • Whitespace correction:
    • Dummy (wsc_dummy)
  • Spelling error correction:
  • Word-level spelling error detection:
    • Dummy (sedw_dummy)
    • Out of dictionary (sedw_ood)
    • From spelling error correction1 (sedw_from_sec)
  • Sequence-level spelling error detection:
    • Dummy (seds_dummy)
    • Out of dictionary (seds_ood)
    • From spelling error correction1 (seds_from_sec)

The dummy baselines produce the predictions one gets by leaving the inputs unchanged.

1 We can reuse spelling error correction baselines to detect spelling errors both on a word and sequence level. For the word level we simply predict that all words changed by a spelling corrector contain a spelling error. For the sequence level we predict that a sequence contains a spelling error if it is changed by a spelling corrector. All spelling error correction baselines or prediction files can be used as underlying spelling correctors for this purpose.

You can run a baseline using tcb.baseline:

# run baseline on stdin and output to stdout
tcb.baseline <baseline_name>

# run baseline on file and output to stdout
tcb.baseline <baseline_name> -f <input_file>

# run baseline on file and write predictions to file
tcb.baseline <baseline_name> -f <input_file> -o <output_file>

# some baselines require you to pass additional arguments,
# you will get error messages if you dont
# e.g. all dictionary based baselines like the out of dictionary baseline
# for word-level spelling error detection need the path to a dictionary
# as additional argument
tcb.baseline sedw_ood -f <input_file> --dictionary <dictionary_file>

Dictionaries can be found here.

Predictions of the baselines and other models from the literature can be found in a subdirectory predictions in each benchmark, see e.g. here.


This repository is backed by the text-correction-utils package.