Camelot is a Python library that can help you extract tables from PDFs!

Overview

Camelot: PDF Table Extraction for Humans

tests Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!


Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]

>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

Camelot also comes packaged with a command-line interface!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

  • Configurability: Camelot gives you control over the table extraction process with tweakable settings.
  • Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
  • Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

$ pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[base]"

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Owner
Camelot and Excalibur: PDF Table Extraction for Humans
Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

Otto 1 Nov 08, 2020
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 05, 2021
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
This is PDF Merger Application Developed using Just Python

This is PDF Merger Application Developed using Just Python

Sandeep Kumar Reddy 2 Nov 18, 2021
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
A tool for certificate PDF generation.

certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克

Wanghao Xu 4 Nov 28, 2022
Svg2pdfgen - Svg To PDF gen with python

Svg2pdfgen - Svg To PDF gen with python

Robert Urbańczyk 3 May 30, 2022
Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

Emilio Kartono 20 Nov 25, 2022