Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

Deep Learning Chinese Word Segment

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

Optical character recognition for Japanese text, with the main focus being Japanese manga

基于openpose和图像分类的手语识别项目

LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Single Shot Text Detector with Regional Attention

Converts an image into funny, smaller amongus characters

QED-C: The Quantum Economic Development Consortium provides these computer programs and software for use in the fields of quantum science and engineering.

Primary QPDF source code and documentation

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

A simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dropbox account at every 5 seconds

OpenCV-Erlang/Elixir bindings

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Program created with opencv that allows you to automatically count your repetitions on several fitness exercises.

Recognizing cropped text in natural images.

SemTorch

Rotational region detection based on Faster-RCNN.

Let's explore how we can extract text from forms

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

Deep Learning Chinese Word Segment

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

Optical character recognition for Japanese text, with the main focus being Japanese manga

基于openpose和图像分类的手语识别项目

LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Single Shot Text Detector with Regional Attention

Converts an image into funny, smaller amongus characters

QED-C: The Quantum Economic Development Consortium provides these computer programs and software for use in the fields of quantum science and engineering.

Primary QPDF source code and documentation

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

A simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dropbox account at every 5 seconds

OpenCV-Erlang/Elixir bindings

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Program created with opencv that allows you to automatically count your repetitions on several fitness exercises.

Recognizing cropped text in natural images.

SemTorch

Rotational region detection based on Faster-RCNN.

Let's explore how we can extract text from forms

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約