A python library for extracting text from PDFs without losing the formatting of the PDF content.

Overview

Open In Colab Multilingual PDF to Text

Install Package from Pypi

  1. Install it using pip.
pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

  1. Use it in your code
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

  • afr Afrikaans
  • amh Amharic
  • ara Arabic
  • asm Assamese
  • aze Azerbaijani
  • aze_cyrl Azerbaijani - Cyrillic aze_
  • bel Belarusian
  • ben Bengali
  • bod Tibetan
  • bos Bosnian
  • bul Bulgarian
  • cat Catalan; Valencian
  • ceb Cebuano
  • ces Czech
  • chi_sim Chinese - Simplified chi_
  • chi_tra Chinese - Traditional chi_
  • chr Cherokee
  • cym Welsh
  • dan Danish
  • deu German
  • dzo Dzongkha
  • ell Greek, Modern (1453-)
  • eng English
  • enm English, Middle (1100-1500)
  • epo Esperanto
  • est Estonian
  • eus Basque
  • fas Persian
  • fin Finnish
  • fra French
  • frk German Fraktur
  • frm French, Middle (ca. 1400-1600)
  • gle Irish
  • glg Galician
  • grc Greek, Ancient (-1453)
  • guj Gujarati
  • hat Haitian; Haitian Creole
  • heb Hebrew
  • hin Hindi
  • hrv Croatian
  • hun Hungarian
  • iku Inuktitut
  • ind Indonesian
  • isl Icelandic
  • ita Italian
  • ita_old Italian - Old ita_
  • jav Javanese
  • jpn Japanese
  • kan Kannada
  • kat Georgian
  • kat_old Georgian - Old kat_
  • kaz Kazakh
  • khm Central Khmer
  • kir Kirghiz; Kyrgyz
  • kor Korean
  • kur Kurdish
  • lao Lao
  • lat Latin
  • lav Latvian
  • lit Lithuanian
  • mal Malayalam
  • mar Marathi
  • mkd Macedonian
  • mlt Maltese
  • msa Malay
  • mya Burmese
  • nep Nepali
  • nld Dutch; Flemish
  • nor Norwegian
  • ori Oriya
  • pan Panjabi; Punjabi
  • pol Polish
  • por Portuguese
  • pus Pushto; Pashto
  • ron Romanian; Moldavian; Moldovan
  • rus Russian
  • san Sanskrit
  • sin Sinhala; Sinhalese
  • slk Slovak
  • slv Slovenian
  • spa Spanish; Castilian
  • spa_old Spanish; Castilian - Old spa_
  • sqi Albanian
  • srp Serbian
  • srp_latn Serbian - Latin srp_
  • swa Swahili
  • swe Swedish
  • syr Syriac
  • tam Tamil
  • tel Telugu
  • tgk Tajik
  • tgl Tagalog
  • tha Thai
  • tir Tigrinya
  • tur Turkish
  • uig Uighur; Uyghur
  • ukr Ukrainian
  • urd Urdu
  • uzb Uzbek
  • uzb_cyrl Uzbek - Cyrillic uzb_
  • vie Vietnamese
  • yid Yiddish
Owner
Shahrukh Khan
CS Grad Student @ Saarland University
Shahrukh Khan
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
Program that locks/unlocks pdf files🐍

🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 04, 2021
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

Marshal Miller 22 Nov 21, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 07, 2022
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
Excalibur: A web interface to extract tabular data from PDFs

Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i

1.2k Jan 04, 2023
Python bindings for MuPDF's rendering library.

PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM

Jorj X. McKie 0 Nov 03, 2022
This is PDF Merger Application Developed using Just Python

This is PDF Merger Application Developed using Just Python

Sandeep Kumar Reddy 2 Nov 18, 2021
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 09, 2021