Python library to extract tabular data from images and scanned PDFs

Overview

image

image image image

Overview

ExtractTable - API to extract tabular data from images and scanned PDFs

The motivation is to make it easy for developers to extract tabular data from images or scanned PDF files without worrying about the table area, column coordinates, rotation et al.

Prerequisite

API Key: All requests to ExtractTable are authorized by an API Key. FREE credits here. The same API Key can also be used for conversions on the browser at Web Pro.

Installation

pip install -U ExtractTable

Basic Usage

Ok, enough selling. Let the ease in coding do the talk, and the output encourages you to buy credits; put that timer on and count the LOC.

from ExtractTable import ExtractTable
et_sess = ExtractTable(api_key=YOUR_API_KEY)        # Replace your VALID API Key here
print(et_sess.check_usage())        # Checks the API Key validity as well as shows associated plan usage 
table_data = et_sess.process_file(filepath=Location_of_Image_with_Tables, output_format="df")

# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
table_data = et_sess.process_file(filepath=Location_of_PDF_with_Tables, output_format="df", pages="all")

Detailed Library Usage

The tutorial available at Open In Colab takes you through

1. Installation
2. Import and check version
3. Create Session & Validate API Key
    3.1 Create Session with your API Key
    3.2 Validate the Key and check the plan usage
    3.3 Check Usage Details
4. Trigger the extraction process
    4.1 Accepted Input Types
    4.2 Process an IMAGE Input
    4.3 Process a PDF Input
    4.4 Output options
    4.5 Explore session objects
5. Explore the Output
    5.1 Output Structure
    5.2 Output Details
6. Make Corrections
    6.1 Split Merged Rows
    6.2 Split Merged Columns
    6.3 Fix Decimal Format
    6.4 Fix Date Format
7. Helpful Code Snippets
    7.1 Get text data
    7.2 Table output to Excel

Woahh, as simple as that ?!

Certainly. Do you know the current ExtractTable users use it for

  • Bank Statement
  • Medical Records
  • Invoice Details
  • Tax forms
  • Tender Notices

Its up to you now to explore the ways.

Explore

check the complete server response of the latest job with et_sess.ServerResponse.json()

{
    "JobStatus": <string>,                              # Status of the triggered Process  @ JOB-LEVEL
    "Pages": <integer>,                                 # Number of pages processed in this request @ PAGE-LEVEL
    "Tables": [<list of key-value objects of table>     # List of all tables found @ TABLE-LEVEL
        {
            "Page": <integer>,                              ## Page number in which this table is found
            "CharacterConfidence": <float>,                 ## Accuracy of Characters recognized from the input-page
            "LayoutConfidence": <float>,                    ## Accuracy of table layout's design decision
            "TableJson": <dict>,                            ## Table Cell Text in key-value format with index orientation - {row#: {col#: <str>}}
            "TableCoordinates": <dict>,                     ## Top-left & Bottom-right Cell Coordinates - {row#: {col#: <list(x1,y1,x2,y2)>}}
            "TableConfidence": <dict>                       ## Cell level accuracy of detected characters - {row#: {col#: <float>}}
        },
    {...}                                               ## ... more "Tables" objects
    ],
    "Lines": [<list of key-value objects>               # Pagewise Line details @ PAGE-LEVEL
        {
            "Page": <integer>,                          # Page number in which the lines are found
            "CharacterConfidence": <float>,             # Average Accuracy of all Characters recognized from the input-page
            "LinesArray": [
                <list of key-value objects of line>     # Ordered list of lines in this page @ LINE-LEVEL
                {
                    "Line": <str>,                          ## Detected text of the complete line
                    "WordsArray": [
                        <list of key-value objects>         ## Word level datails in this line @ WORD-LEVEL
                        {
                            "Conf": <float>,                    ### Accuracy of recognized characters of the word
                            "Word": <str>,                      ### Detected text of the word
                            "Loc": [x1, y1, x2, y2]             ### Top-left & Bottom-right coordinates, w.r.t the input-page width-height dimensions
                        },
                    {...}                                   ### More "WordsArray" objects
                    ]
                },
            {...}                                       ## More "LinesArray" objects
            ]
        },
    {...}                                               # More Pagewise "Lines" details
    ]
}

Bug Reports

Bug reports/fixes are most welcome and greatly appreciated with API credits. For support reach us at [email protected]

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Social Media

Follow us on Social media for library updates and free credits.

Image      Image

Comments
  • bug: holding when the program running after some samples

    bug: holding when the program running after some samples

    Describe the bug A clear and concise description of what the bug is. keep holding my apI key prefix is o6No6aqYRhrQ2MWxtDDyTeHiiUg**** image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 5
  • bug: function

    bug: function "et_sess.save_output(output_folder, output_format="csv")" output file, the file name lack some alpha of the origin full name

    Describe the bug A clear and concise description of what the bug is. my picture name is all suffix png. such as "[email protected]_14-1-4.png"

    image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 3
  • found some bugs and list the bugs out

    found some bugs and list the bugs out

    Describe the bug A clear and concise description of what the bug is. 1.不能识别出垮列的文本,识别成表格时,不符合逻辑的分开成两边 image image

    2.不能识别加减号,can not recognize Plus minus sign. 31.2 + 4.98 image 3.不能够识别上下标,can not recognize subscript and supscript. image 4.ocr识别丢失字符 loss some recognized tokens image 5.长的表格,有部分没有识别出来 long size table,can not recognize the bottem part image image 6.cell中有化学式的,识别不出来,when there is chemical formulate in cell, can not recognize the table image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen. I can solve these problems with us.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 2
  • question: what meaning is LayoutConfidence?

    question: what meaning is LayoutConfidence?

    "CharacterConfidence": , # Average Accuracy of all Characters recognized from the input-page "LayoutConfidence": , ## Accuracy of table layout's design decision please give out the detaild decription or calculate function code about CharacterConfidence,LayoutConfidence

    good first issue 
    opened by franztao 2
  • Invalid cross-device link

    Invalid cross-device link

    Describe the bug On some OS, we can not save output file to temporary directory (let's say /tmp) and move it to a new place. It throws the following error :

    os.replace(each_tbl_path, os.path.join(output_folder, input_fname+os.path.basename(each_tbl_path)))
    OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp7hqcm0fh/_table_1.csv' -> '/var/www/python/app/tmp/details_table_1.csv'
    

    After checking the source code, it appears ExtractTable use os.replace to move the file. This method does not support moving file from a partition to an other : https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link

    To Reproduce I use Python 3.6 in a venv. You will need two different system parts, and invoke save_output from ExtractTable-py library, to save file from a filesystem to an other. I have not tried, but I think you can simply reproduce this bug by invoking os.replace without calling ExtractTable-py.

    Expected behavior Move the file from a filesystem to an other. I think using shutil.move would be a preferable way to achieve file moving than os.replace.

    bug 
    opened by Elegye 2
  • MakeCorrections API - How do you chain corrections

    MakeCorrections API - How do you chain corrections

    Hi there, I'm trying to use multiple correction commands but it isn't working as the object becomes a list after the first correction. Is there something I'm missing here? Thanks!

    good first issue 
    opened by kylebutts 1
  • character ocr can support latex format?

    character ocr can support latex format?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by franztao 1
  • please, do you have tools of transform ExtracTable output file type to CoCo file type(other open source Detection file type)?

    please, do you have tools of transform ExtracTable output file type to CoCo file type(other open source Detection file type)?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by franztao 1
  • Custom output path when the output_format is csv

    Custom output path when the output_format is csv

    Is your feature request related to a problem? Please describe. When the output_format is set to csv the csv file is written to some random path in /tmp location.

    Describe the solution you'd like [optional, but helpful] Define a parameter in the process_file like output_file which takes the absolute path where the file needs to be written along with the file name

    opened by padmano 1
  • Is it possible to  get the data in excel by maintaining table structure?

    Is it possible to get the data in excel by maintaining table structure?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by jcthink 1
  • Character and Layout Confidence

    Character and Layout Confidence

    Hi, need some definition material for Character and Layout Confidence like how it is calculated mathematically using below code. Thanks.

    for idx, each_table in enumerate(et_sess.ServerResponse.json()['Tables']):
        print("CharacterConfidence = ", each_table['CharacterConfidence'])
        print("LayoutConfidence = ", each_table['LayoutConfidence'])
    
    good first issue 
    opened by muhdzubair 1
  • Consider user hints on the table structure information

    Consider user hints on the table structure information

    Is your feature request related to a problem? Please describe. "while you do whatever you want, why not consider the our hints" is the developers feedback on many instances

    Describe alternatives you've considered Developers are tackling with their custom post processing.

    Describe the solution you'd like [optional, but helpful] Pros: May be it is a worth taking a look as most of the post processing involves in similar approaches that resolves majority issues. Cons: computing cost

    feature/idea 
    opened by akshowhini 0
  • Capture Vertically center aligned columns

    Capture Vertically center aligned columns

    Refer: https://stackoverflow.com/questions/58238981/extracting-table-from-a-pdf-table-without-vertical-lines

    Do not miss: Joelgeraci's comment to the question

    feature/idea 
    opened by akshowhini 0
Releases(v2.4.0)
Owner
Org. Account
You, I and they have the same problem to solve !?!?
Org. Account
Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation, CVPR 2020 (Oral)

SEAM The implementation of Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentaion. You can also download the repos

Hibercraft 459 Dec 26, 2022
code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

DeepCAD This repository provides source code for our paper: DeepCAD: A Deep Generative Network for Computer-Aided Design Models Rundi Wu, Chang Xiao,

Rundi Wu 85 Dec 31, 2022
OpenGait is a flexible and extensible gait recognition project

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Shiqi Yu 335 Dec 22, 2022
Virtualdragdrop - Virtual Drag and Drop Using OpenCV and Arduino

Virtualdragdrop - Virtual Drag and Drop Using OpenCV and Arduino

Rizky Dermawan 4 Mar 10, 2022
Document Image Dewarping

Document image dewarping using text-lines and line Segments Abstract Conventional text-line based document dewarping methods have problems when handli

Taeho Kil 268 Dec 23, 2022
CNN+Attention+Seq2Seq

Attention_OCR CNN+Attention+Seq2Seq The model and its tensor transformation are shown in the figure below It is necessary ch_ train and ch_ test the p

Tsukinousag1 2 Jul 14, 2022
"Very simple but works well" Computer Vision based ID verification solution provided by LibraX.

ID Verification by LibraX.ai This is the first free Identity verification in the market. LibraX.ai is an identity verification platform for developers

LibraX.ai 46 Dec 06, 2022
基于openpose和图像分类的手语识别项目

手语识别 0、使用到的模型 (1). openpose,作者:CMU-Perceptual-Computing-Lab https://github.com/CMU-Perceptual-Computing-Lab/openpose (2). 图像分类classification,作者:Bubbl

20 Dec 15, 2022
Demo processor to illustrate OCR-D Python API

ocrd_vandalize/ Demo processor to illustrate the OCR-D/core Python API Description :TODO: write docs :) Installation From PyPI pip3 install ocrd_vanda

Konstantin Baierer 5 May 05, 2022
Document Layout Analysis Projects

Layout_Analysis Introduction This is an implementation of RLSA and X-Y Cut with OpenCV Dependencies OpenCV 3.0+ How to use Compile with g++ : g++ -std

22 Dec 08, 2022
DouZero is a reinforcement learning framework for DouDizhu - 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

Kwai 3.1k Jan 05, 2023
Assignment work with webcam

work with webcam : Press key 1 to use emojy on your face Press key 2 to use lip and eye on your face Press key 3 to checkered your face Press key 4 to

Hanane Kheirandish 2 May 31, 2022
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

Murtaza Hassan 815 Dec 29, 2022
Lightning Fast Language Prediction 🚀

whatthelang Lightning Fast Language Prediction 🚀 Dependencies The dependencies can be installed using the requirements.txt file: $ pip install -r req

Indix 152 Oct 16, 2022
Single Shot Text Detector with Regional Attention

Single Shot Text Detector with Regional Attention Introduction SSTD is initially described in our ICCV 2017 spotlight paper. A third-party implementat

Pan He 215 Dec 07, 2022
Image processing using OpenCv

Image processing using OpenCv Write a program that opens the webcam, and the user selects one of the following on the video: ✅ If the user presses the

M.Najafi 4 Feb 18, 2022
🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Charset Detection, for Everyone 👋 The Real First Universal Charset Detector A library that helps you read text from an unknown charset encoding. Moti

TAHRI Ahmed R. 332 Dec 31, 2022
A machine learning software for extracting information from scholarly documents

GROBID GROBID documentation Visit the GROBID documentation for more detailed information. Summary GROBID (or Grobid, but not GroBid nor GroBiD) means

Patrice Lopez 1.9k Jan 08, 2023
OpenCVを用いたカメラキャリブレーションのサンプルです。2021/06/21時点でPython実装のある3種類(通常カメラ向け、魚眼レンズ向け(fisheyeモジュール)、全方位カメラ向け(omnidirモジュール))について用意しています。

OpenCV-CameraCalibration-Example FishEyeCameraCalibration.mp4 OpenCVを用いたカメラキャリブレーションのサンプルです 2021/06/21時点でPython実装のある以下3種類について用意しています。 通常カメラ向け 魚眼レンズ向け(

KazuhitoTakahashi 34 Nov 17, 2022