OCR Post Correction for Endangered Language Texts

Last update: Dec 31, 2022

Related tags

Overview

📌 Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transactions of the Association for Computational Linguistics (TACL)!

Check out the paper here.

OCR Post Correction for Endangered Language Texts

This repository contains code for models and experiments from the paper "OCR Post Correction for Endangered Language Texts".

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books. Extracting the text is challenging because there is typically no annotated data to train an OCR system for each endangered language. Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

📌 In the paper, we present a dataset containing annotations for documents in three critically endangered languages: Ainu, Griko, Yakkha.

📌 Our model reduces the recognition error rate by 34% on average, over a state-of-the-art OCR system.

Learn more about the paper here!

OCR Post-Correction

The goal of OCR post-correction is to automatically correct errors in the text output from an existing OCR system.

The existing OCR system is used to obtain a first pass transcription of the input image (example below in the endangered language Griko):

The incorrectly recognized characters in the first pass are then corrected by the post-correction model.

Model

As seen in the example above, OCR post-correction is a text-based sequence-to-sequence task.

📌 We use a character-level encoder-decoder architecture with attention and add several adaptations for the low-resource setting. The paper has all the details!

📌 The model is trained in a supervised manner. The training data consists of first pass OCR outputs as the source with corresponding manually corrected transcriptions as the target.

📌 Some books that contain texts in endangered languages also contain translations of the text in another (usually high-resource) language. We incorporate an additional encoder in the model, with a multisource framework, to use the information from these translations if they are available.

We provide instructions for both single-source and multisource models:

The single-source model can be used for almost any document and is significantly easier to set up.
The multisource model can only be used if translations are available.

Dataset

This repository contains a sample from our dataset in sample_dataset, which you can use to train the post-correction model. Get the full dataset here!

However, this repository can be used to train OCR post-correction models for documents in any language!

🚀 If you want to use our model with a new set of documents, construct a dataset by following the steps here.

🚀 We'd love to hear about the new datasets and models you build: send us an email at [email protected]!

Running Experiments

Once you have a suitable dataset (e.g., sample_dataset or your own dataset), you can train a model and run experiments on OCR post-correction.

If you have your own dataset, you can use the utils/prepare_data.py script to create train, development, and test splits (see the last step here).

The steps are described below, illustrated with sample_dataset/postcorrection. If using another dataset, simply change the experiment settings to point to your dataset and run the same scripts.

Requirements

Python 3+ is required. Pip can be used to install the packages:

pip install -r postcorr_requirements.txt

Training

The process of training the post-correction model has two main steps:

Pretraining with first pass OCR outputs.
Training with manually corrected transcriptions in a supervised manner.

For a single-source model, modify the experimental settings in train_single-source.sh to point to the appropriate dataset and desired output folder. It is currently set up to use sample_dataset.

Then run

bash train_single-source.sh

For multisource, use train_multi-source.sh.

Log files and saved models are written to the user-specified experiment folder for both the pretraining and training steps. For a list of all available hyperparameters and options, look at postcorrection/constants.py and postcorrection/opts.py.

Testing

For testing with a single-source model, modify the experimental settings in test_single-source.sh. It is currently set up to use sample_dataset.

Then run

bash test_single-source.sh

For multisource, use test_multi-source.sh.

Citation

Please cite our paper if this repository was useful.

@inproceedings{rijhwani-etal-2020-ocr,
    title = "{OCR} {P}ost {C}orrection for {E}ndangered {L}anguage {T}exts",
    author = "Rijhwani, Shruti  and
      Anastasopoulos, Antonios  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.478",
    doi = "10.18653/v1/2020.emnlp-main.478",
    pages = "5931--5942",
}

OCR Post Correction for Endangered Language Texts

Related tags

Overview

OCR Post Correction for Endangered Language Texts

OCR Post-Correction

Model

Dataset

Running Experiments

Requirements

Training

Testing

Citation

License

Owner

Shruti Rijhwani

Style-based Neural Drum Synthesis with GAN inversion

alfred-py: A deep learning utility library for human

Unofficial PyTorch Implementation of AHDRNet (CVPR 2019)

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers.

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

PyTorch implementation of DeepLab v2 on COCO-Stuff / PASCAL VOC

A package for "Procedural Content Generation via Reinforcement Learning" OpenAI Gym interface.

How Effective is Incongruity? Implications for Code-mix Sarcasm Detection.

S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

Trading Strategies for Freqtrade

Very deep VAEs in JAX/Flax

Rlmm blender toolkit - A set of tools to streamline level generation in UDK straight from Blender

The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training

Simple and understandable swin-transformer OCR project

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

OCR Post Correction for Endangered Language Texts

Related tags

Overview

OCR Post Correction for Endangered Language Texts

OCR Post-Correction

Model

Dataset

Running Experiments

Requirements

Training

Testing

Citation

License

Owner

Shruti Rijhwani

Style-based Neural Drum Synthesis with GAN inversion

alfred-py: A deep learning utility library for **human**

Unofficial PyTorch Implementation of AHDRNet (CVPR 2019)

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers.

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

PyTorch implementation of DeepLab v2 on COCO-Stuff / PASCAL VOC

A package for "Procedural Content Generation via Reinforcement Learning" OpenAI Gym interface.

How Effective is Incongruity? Implications for Code-mix Sarcasm Detection.

S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

Trading Strategies for Freqtrade

Very deep VAEs in JAX/Flax

Rlmm blender toolkit - A set of tools to streamline level generation in UDK straight from Blender

The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training

Simple and understandable swin-transformer OCR project

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

alfred-py: A deep learning utility library for human