Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Last update: Dec 31, 2022

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

The official code of ABINet (CVPR 2021, Oral).

ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.

Runtime Environment

We provide a pre-built docker image using the Dockerfile from docker/Dockerfile

Running in Docker

$ [email protected]:FangShancheng/ABINet.git
$ docker run --gpus all --rm -ti --ipc=host -v $(pwd)/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash

(Untested) Or using the dependencies
```
pip install -r requirements.txt
```

Datasets

Training datasets
1. MJSynth (MJ):
  - Use tools/create_lmdb_dataset.py to convert images into LMDB dataset
  - LMDB dataset BaiduNetdisk(passwd:n23k)
2. SynthText (ST):
  - Use tools/crop_by_word_bb.py to crop images from original SynthText dataset, and convert images into LMDB dataset by tools/create_lmdb_dataset.py
  - LMDB dataset BaiduNetdisk(passwd:n23k)
3. WikiText103, which is only used for pre-trainig language models:
  - Use notebooks/prepare_wikitext103.ipynb to convert text into CSV format.
  - CSV dataset BaiduNetdisk(passwd:dk01)
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
1. ICDAR 2013 (IC13)
2. ICDAR 2015 (IC15)
3. IIIT5K Words (IIIT)
4. Street View Text (SVT)
5. Street View Text-Perspective (SVTP)
6. CUTE80 (CUTE)

The structure of data directory is

data
├── charset_36.txt
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   └── ST
├── WikiText-103.csv
└── WikiText-103_eval_d1.csv

Pretrained Models

Get the pretrained models from BaiduNetdisk(passwd:kwck), GoogleDrive. Performances of the pretrained models are summaried as follows:

Model	IC13	SVT	IIIT	IC15	SVTP	CUTE	AVG
ABINet-SV	97.1	92.7	95.2	84.0	86.7	88.5	91.4
ABINet-LV	97.0	93.4	96.4	85.9	89.5	89.2	92.7

Training

Pre-train vision model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml

Train ABINet

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml

Note:

You can set the checkpoint path for vision and language models separately for specific pretrained model, or set to None to train from scratch

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Visualization

Successful and failure cases on low-quality images:

Citation

If you find our method useful for your reserach, please cite

@article{fang2021read,
  title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

License

This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.

Feel free to contact [email protected] if you have any questions.

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Runtime Environment

Datasets

Pretrained Models

Training

Evaluation

Visualization

Citation

License

Owner

FG-transformer-TTS Fine-grained style control in transformer-based text-to-speech synthesis

Heart Arrhythmia Classification

Instance-conditional Knowledge Distillation for Object Detection

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Light-weight network, depth estimation, knowledge distillation, real-time depth estimation, auxiliary data.

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

ADGAN - The Implementation of paper Controllable Person Image Synthesis with Attribute-Decomposed GAN

WRENCH: Weak supeRvision bENCHmark

[ECE NTUA] 👁 Computer Vision - Lab Projects & Theoretical Problem Sets (2020-2021)

Causal estimators for use with WhyNot

Accurate Phylogenetic Inference with Symmetry-Preserving Neural Networks

Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

An implementation of the paper "A Neural Algorithm of Artistic Style"

Implementation of Feedback Transformer in Pytorch

SegNet including indices pooling for Semantic Segmentation with tensorflow and keras

Office source code of paper UniFuse: Unidirectional Fusion for 360$^\circ$ Panorama Depth Estimation

CvT-ASSD: Convolutional vision-Transformerbased Attentive Single Shot MultiBox Detector (ICTAI 2021 CCF-C 会议)The 33rd IEEE International Conference on Tools with Artificial Intelligence

Automated Melanoma Recognition in Dermoscopy Images via Very Deep Residual Networks

Implemenets the Contourlet-CNN as described in C-CNN: Contourlet Convolutional Neural Networks, using PyTorch

Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out) created with Python.