OCR-Pipeline-with-Keras
The keras-ocr package generally consists of two parts: a Detector and a Recognizer:
Detectoris responsible for creating bounding boxes for the words of the text.Recognizeris responsible for processing batch of cropped parts of the initial image.
Keras-ocr connects this two parts into seamless pipeline. "Out of the box", it can handle a wide range of images with texts. But in a specific task, when the field of possible images with texts is greatly narrowed, it shows itself badly in the Recognizer part of the task.
In this regard, the task of fine-tuning Recognizer on a custom dataset was set.
Virtual environment and packages
$ python3 -m venv keras_ocr
$ pip install keras-ocr
And TRDG library for synthetic text generation.
$ pip install trdg
Synthetic data generation
We will use the TRDG library to generate synthetic text. All necessary code presented in the data_generation.py. Things you need to know:
-
You choose template for generating text, e.g. if template is
"({}{}/{})", then all brackets will be randomly filled with symbols fromalphabet. You need to specify your own instance ofStringTemplateclasss. -
You choose the
alphabet. In our example case it contains only digits. P.S. Some of the repeated indata_generation.py, hence emperical distribution probability for each symbol defined as fraction of n_repeats to alphabet_size. -
You can choose your own fonts. To do this, follow instruction:
- Download needed fonts as
.ttffiles - Go to
trdgfonts directory./keras_ocr/lib/python3.8/site-packages/trdg/fonts/ - Create directory
$ mkdir cs(cs means custom fonts), you can chooce the disered name - Place fonts files in this dir
- (For Mac users only) Don't forget to remove
.DS_Storefrom this folder
- Download needed fonts as
-
You can chooce image background for text. When creating instance of
GeneratorFromStringsin functiongenerate_data_units(...), provide folder with images with argimage_dir
High-level API in the data_generation.py
data_generator = DataGenerator(string_templates=[StringTemplate('{}{}{}{}{}{}{}', 7)])
data_generator.generate(n_patches=20000, n_total_samples=550, path='DigitsBracketsDataset/train')
n_patches-- number of different strings from provided templaten_total_samples-- number of total samples from patchespath-- dir to save samples
Fine tuning Recognizer
Follow instruction in fine_tuning.ipynb. Don't forget to add function get_custom_dataset(...) to datasets.py in keras-ocr package directory (./keras_ocr/lib/python3.8/site-packages/keras_ocr/datasets.py):
def get_custom_dataset(path: str, split: str):
"""
param: path: path to dataset root dir (include train/test dirs)
Returns:
A recognition dataset as a list of (filepath, box, word) tuples
"""
data = []
if split == 'train':
train_dir = os.path.join(path, 'train')
data.extend(
_read_born_digital_labels_file(
labels_filepath=os.path.join(train_dir, "gt.txt"),
image_folder=train_dir,
)
)
elif split == 'test':
test_dir = os.path.join(path, 'test')
data.extend(
_read_born_digital_labels_file(
labels_filepath=os.path.join(test_dir, 'gt.txt'),
image_folder=test_dir
)
)
return data