~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Last update: Dec 06, 2022

Related tags

Overview

cosc428-structor

I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Conventional OCR engines like Tesseract weren't able to accurately recognise the page structure, which led to many transcription errors. If I could tell Tesseract to ignore certain regions (like images or repeated headers), then I could greatly reduce the number of errors in the resulting ebook. Thus: for my assignment, I wrote a program that takes an image and uses computer vision magick to determine the page's structure. So far, my program can detect and locate:

lines of text,
paragraphs,
section titles,
images and their associated captions,
boilerplate like page numbers, and
chapter titles.

Ain't it grand?

Dependencies

The project is written in Python 2.7.3 and uses the cv2 library for interacting with openCV. It also uses numpy for some of the mathematical operations. On windows, the best way to get these dependencies is to install the Python(x,y) suite (https://code.google.com/p/pythonxy/), which combines python with a customisable set of scientific computing libraries.

Program Structure

The program's root is main.py, but this simply iterates through images in a folder and constructs a Page instance from each image. Thus, the real work happens in page.py.

page.py contains a few utility methods and the Page class. The constructor calls the appropriate methods in order to determine the logical structure of the page. This structure is stored in three objects: self.margin, self.content, and self.boilerplate (which contains such non-content text objects as the page number and header).

The getBuildingBlocks method is responsible for finding words, grouping words into textual lines, discarding marginal noise, and fitting a Margin instance around the remaining lines. Most of these tasks are preformed by calling other functions.

The self.content object is found by passing the set of lines to the Content() constructor. This uses a state machine to group lines into figures, paragraphs, section titles, etc. The Content class, along with a class for each content type, is found in content.py.

The other files can generally be ignored when trying to understand the program; they are largely just convenience classes which represent page elements (such as points, geometric lines, words, text lines, and boxes), as well as supporting tools such as the Stopwatch.

How to Run the Code

Run main.py using the python interpreter. This will process each page in ./images, and for each page a series of 'snapshot' images will be displayed in order to illustrate the algorithm. To show only the final result for each image, set showSteps in main.py to False.

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Comments

The getBuildingBlocks

Hello, Recently, I have some task about the document layout analysis. The description in "README.md" is very consistent with my mission. But when I try to run the code as README.md: How to Run the Code, there just some red line in each dobule word and have no resault of the detect and locate of "line of text", "paragraphs", "section titles" , etc. So I want to know what has happend to the code. Very thankful

opened by lvbohui 3

Releases(v1.0)

v1.0(Nov 7, 2013)

This is the version that I used to write the first draft of my conference paper.
Source code(tar.gz)
Source code(zip)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Related tags

Overview

cosc428-structor

Dependencies

Program Structure

How to Run the Code

You might also like...

Basic functions manipulating images using the OpenCV library

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Text page dewarping using a "cubic sheet" model

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Comments

The getBuildingBlocks

Releases(v1.0)

v1.0(Nov 7, 2013)

Owner

Chad Oliver

The virtual calculator will be above the live streaming from your camera

Omdena-abuja-anpd - Automatic Number Plate Detection for the security of lives and properties using Computer Vision.

[python3.6] 运用tf实现自然场景文字检测,keras/pytorch实现ctpn+crnn+ctc实现不定长场景文字OCR识别

This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

a micro OCR network with 0.07mb params.

Textboxes : Image Text Detection Model : python package (tensorflow)

Kornia is a open source differentiable computer vision library for PyTorch.

A version of nrsc5-gui that merges the interface developed by cmnybo with the architecture developed by zefie in order to start a new baseline that is not heavily dependent upon Python processing.

A pure pytorch implemented ocr project including text detection and recognition

3点クリックで円を指定し、極座標変換を行うサンプルプログラム

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Learning Camera Localization via Dense Scene Matching, CVPR2021

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

Smart computer vision application

An Optical Character Recognition system using Pytesseract/Extracting data from Blood Pressure Reports.

kaldi-asr/kaldi is the official location of the Kaldi project.

Recognizing cropped text in natural images.