A python library for extracting text from PDFs without losing the formatting of the PDF content.

Overview

Open In Colab Multilingual PDF to Text

Install Package from Pypi

  1. Install it using pip.
pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

  1. Use it in your code
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

  • afr Afrikaans
  • amh Amharic
  • ara Arabic
  • asm Assamese
  • aze Azerbaijani
  • aze_cyrl Azerbaijani - Cyrillic aze_
  • bel Belarusian
  • ben Bengali
  • bod Tibetan
  • bos Bosnian
  • bul Bulgarian
  • cat Catalan; Valencian
  • ceb Cebuano
  • ces Czech
  • chi_sim Chinese - Simplified chi_
  • chi_tra Chinese - Traditional chi_
  • chr Cherokee
  • cym Welsh
  • dan Danish
  • deu German
  • dzo Dzongkha
  • ell Greek, Modern (1453-)
  • eng English
  • enm English, Middle (1100-1500)
  • epo Esperanto
  • est Estonian
  • eus Basque
  • fas Persian
  • fin Finnish
  • fra French
  • frk German Fraktur
  • frm French, Middle (ca. 1400-1600)
  • gle Irish
  • glg Galician
  • grc Greek, Ancient (-1453)
  • guj Gujarati
  • hat Haitian; Haitian Creole
  • heb Hebrew
  • hin Hindi
  • hrv Croatian
  • hun Hungarian
  • iku Inuktitut
  • ind Indonesian
  • isl Icelandic
  • ita Italian
  • ita_old Italian - Old ita_
  • jav Javanese
  • jpn Japanese
  • kan Kannada
  • kat Georgian
  • kat_old Georgian - Old kat_
  • kaz Kazakh
  • khm Central Khmer
  • kir Kirghiz; Kyrgyz
  • kor Korean
  • kur Kurdish
  • lao Lao
  • lat Latin
  • lav Latvian
  • lit Lithuanian
  • mal Malayalam
  • mar Marathi
  • mkd Macedonian
  • mlt Maltese
  • msa Malay
  • mya Burmese
  • nep Nepali
  • nld Dutch; Flemish
  • nor Norwegian
  • ori Oriya
  • pan Panjabi; Punjabi
  • pol Polish
  • por Portuguese
  • pus Pushto; Pashto
  • ron Romanian; Moldavian; Moldovan
  • rus Russian
  • san Sanskrit
  • sin Sinhala; Sinhalese
  • slk Slovak
  • slv Slovenian
  • spa Spanish; Castilian
  • spa_old Spanish; Castilian - Old spa_
  • sqi Albanian
  • srp Serbian
  • srp_latn Serbian - Latin srp_
  • swa Swahili
  • swe Swedish
  • syr Syriac
  • tam Tamil
  • tel Telugu
  • tgk Tajik
  • tgl Tagalog
  • tha Thai
  • tir Tigrinya
  • tur Turkish
  • uig Uighur; Uyghur
  • ukr Ukrainian
  • urd Urdu
  • uzb Uzbek
  • uzb_cyrl Uzbek - Cyrillic uzb_
  • vie Vietnamese
  • yid Yiddish
Owner
Shahrukh Khan
CS Grad Student @ Saarland University
Shahrukh Khan
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Simple python tool created for downloading PDF.

PDFdownloader Usage Open PDF in full-screen mode Run scan.exe Enter how many pages you want to scan Focus PDF After scanning is done, run merge.exe En

5 Oct 27, 2021
Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro

Zen Reportz 27 Jul 13, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

1.8k Jan 03, 2023
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

Marshal Miller 22 Nov 21, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I

PyMuPDF 1.9k Jan 03, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 01, 2023
Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

Prajwol Lamichhane 131 Dec 16, 2022
Svg2pdfgen - Svg To PDF gen with python

Svg2pdfgen - Svg To PDF gen with python

Robert Urbańczyk 3 May 30, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 05, 2021
Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

David Chuan-En Lin 51 Sep 22, 2022
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021