Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings Improved Speed Decreased Accel Eliminate Stops Decreased Idle 2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4% 2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7% 4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3% 2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2% 4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5% Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-pyUsing pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelotand install Camelot using pip:
$ cd camelot $ pip install ".[base]"Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Owner
Camelot and Excalibur: PDF Table Extraction for HumansProgram that locks/unlocks pdf files🐍
🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente
1 Nov 04, 2021Table automatically extraction from PDF Document
PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve
1 Jan 10, 2022Pdfencrypt is a tool to encrypt/lock PDFs
Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:
5 Nov 28, 2021pdf_sprinkles: sprinkles text in your PDFs
pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc
2 Dec 17, 2021PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files
5k Jan 04, 2023A bot for PDF for doing Many Things....
Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif
60 Dec 27, 2022A bulk pdf generator. This application can generate PDFs in bulk by using just one click.
A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f
3 Apr 23, 2022Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.
About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro
27 Jul 13, 2022pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input
pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark
387 Dec 10, 2022Merge multiple PDF files into one.
PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen
6 Oct 03, 2022Python PDF Parser (Not actively maintained). Check out pdfminer.six.
PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi
4.9k Jan 04, 2023Convert given source code into .pdf with syntax highlighting and more features
Code2pdf 📠 Convert given source code into .pdf with syntax highlighting and more features Build Status Version Downloads Python Demo Installation Bui
343 Jan 05, 2023A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.
mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf
49 Dec 27, 2022A tool for certificate PDF generation.
certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克
4 Nov 28, 2022A simple Python script to convert multiple images (well technically also a single image) into a pdf.
PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m
1 Jun 28, 2022DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata
6 Jul 27, 2022Telegram bot that can do a lot of things related to PDF files.
Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif
130 Dec 26, 2022Convert Lecture Videos to PDF
Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide
20 Nov 25, 2022Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.
Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u
1 Nov 09, 2021Produce pdf in python backend from simple bootstrap vue frontend and download to browser
vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough
1 Nov 08, 2020