A modern pure-Python library for reading PDF files

Last update: Apr 06, 2022

Related tags

Overview

pdf

A modern pure-Python library for reading PDF files.

The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.

The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.

The default backend could be PyPDF2.

Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).

WARNING: This library is UNSTABLE at the moment! Expect many changes!

Installation

pip install pdffile

Usage

Retrieve Metadata

>>> import pdf

>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1

>>> doc.metadata
Metadata(
    title=None,
    producer='pdfTeX-1.40.23',
    creator='TeX',
    creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
    modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
    other={
         '/CreationDate': "D:20220403180542+02'00'",
         '/ModDate': "D:20220403180542+02'00'",
         '/Trapped': '/False',
         '/PTEX.Fullbanner': 'This is pdfTeX, V...'})

Encrypted PDFs

If you have an encrypted PDF, just provide the key:

doc = pdf.PdfFile(pdf_path, password=password)

All following operations work just as described.

Get Outline

>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
    Links(page=5, text='1 Header'),
    Links(page=5, text='1.1 A section'),
    Links(page=9, text='2 Foobar'),
    Links(page=108, text='References')
]

Extract Text

>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'

Alternatively, you can use doc.text to get the text of all pages.

A modern pure-Python library for reading PDF files

Related tags

Overview

pdf

Installation

Usage

Retrieve Metadata

Encrypted PDFs

Get Outline

Extract Text

Owner

A general framework for deep learning experiments under PyTorch based on pytorch-lightning

A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors, CVPR 2021

The code for two papers: Feedback Transformer and Expire-Span.

Some methods for comparing network representations in deep learning and neuroscience.

An Implementation of Transformer in Transformer in TensorFlow for image classification, attention inside local patches

Embeddinghub is a database built for machine learning embeddings.

Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

A toy project using OpenCV and PyMunk

A chemical analysis of lipophilicities & molecule drawings including ML

Code for the paper "Jukebox: A Generative Model for Music"

Implementation of ICCV 2021 oral paper -- A Novel Self-Supervised Learning for Gaussian Mixture Model

My implementation of transformers related papers for computer vision in pytorch

Semantic segmentation models, datasets and losses implemented in PyTorch.

A curated list of awesome deep long-tailed learning resources.

FCOS: Fully Convolutional One-Stage Object Detection (ICCV'19)

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds