~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Last update: Dec 06, 2022

Related tags

Overview

cosc428-structor

I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Conventional OCR engines like Tesseract weren't able to accurately recognise the page structure, which led to many transcription errors. If I could tell Tesseract to ignore certain regions (like images or repeated headers), then I could greatly reduce the number of errors in the resulting ebook. Thus: for my assignment, I wrote a program that takes an image and uses computer vision magick to determine the page's structure. So far, my program can detect and locate:

lines of text,
paragraphs,
section titles,
images and their associated captions,
boilerplate like page numbers, and
chapter titles.

Ain't it grand?

Dependencies

The project is written in Python 2.7.3 and uses the cv2 library for interacting with openCV. It also uses numpy for some of the mathematical operations. On windows, the best way to get these dependencies is to install the Python(x,y) suite (https://code.google.com/p/pythonxy/), which combines python with a customisable set of scientific computing libraries.

Program Structure

The program's root is main.py, but this simply iterates through images in a folder and constructs a Page instance from each image. Thus, the real work happens in page.py.

page.py contains a few utility methods and the Page class. The constructor calls the appropriate methods in order to determine the logical structure of the page. This structure is stored in three objects: self.margin, self.content, and self.boilerplate (which contains such non-content text objects as the page number and header).

The getBuildingBlocks method is responsible for finding words, grouping words into textual lines, discarding marginal noise, and fitting a Margin instance around the remaining lines. Most of these tasks are preformed by calling other functions.

The self.content object is found by passing the set of lines to the Content() constructor. This uses a state machine to group lines into figures, paragraphs, section titles, etc. The Content class, along with a class for each content type, is found in content.py.

The other files can generally be ignored when trying to understand the program; they are largely just convenience classes which represent page elements (such as points, geometric lines, words, text lines, and boxes), as well as supporting tools such as the Stopwatch.

How to Run the Code

Run main.py using the python interpreter. This will process each page in ./images, and for each page a series of 'snapshot' images will be displayed in order to illustrate the algorithm. To show only the final result for each image, set showSteps in main.py to False.

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Comments

The getBuildingBlocks

Hello, Recently, I have some task about the document layout analysis. The description in "README.md" is very consistent with my mission. But when I try to run the code as README.md: How to Run the Code, there just some red line in each dobule word and have no resault of the detect and locate of "line of text", "paragraphs", "section titles" , etc. So I want to know what has happend to the code. Very thankful

opened by lvbohui 3

Releases(v1.0)

v1.0(Nov 7, 2013)

This is the version that I used to write the first draft of my conference paper.
Source code(tar.gz)
Source code(zip)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Related tags

Overview

cosc428-structor

Dependencies

Program Structure

How to Run the Code

You might also like...

Basic functions manipulating images using the OpenCV library

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Text page dewarping using a "cubic sheet" model

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Comments

The getBuildingBlocks

Releases(v1.0)

v1.0(Nov 7, 2013)

Owner

Chad Oliver

Resizing Canny Countour In Python

SRA's seminar on Introduction to Computer Vision Fundamentals

A version of nrsc5-gui that merges the interface developed by cmnybo with the architecture developed by zefie in order to start a new baseline that is not heavily dependent upon Python processing.

Rotational region detection based on Faster-RCNN.

Repository collecting all the submodules for the new PyTorch-based OCR System.

Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

Motion detector, Full body detection, Upper body detection, Cat face detection, Smile detection, Face detection (haar cascade), Silverware detection, Face detection (lbp), and Sending email notifications

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

Hand gesture detection project with aweome UI implementation.

LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

⛓ marc is a small, but flexible Markov chain generator

fishington.io bot with OpenCV and NumPy

Train custom VR face tracking parameters

A tool to enhance your old/damaged pictures built using python & opencv.

Image processing using OpenCv

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Generic framework for historical document processing

Um simples projeto para fazer o reconhecimento do captcha usado pelo jogo bombcrypto

Using python libraries to track hands