PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

Overview

PDFSanitizer

Renders possibly malicious PDF files and outputs harmless PDF files

To do this, the PDF files are rendered and converted to images using PyMuPDF. The images are then saved to a new PDF file using img2pdf. This ensures no visual data is lost, but any scripts/external references/flash files are removed.

Instalation:

git clone https://github.com/lacioffi/PDFSanitizer
cd PDFSanitizer
pip install -r requirements.txt 

Usage:

pdfsanitizer.py 
    
    

   

This project uses the following libraries:

PyMuPDF - By Jorj X. McKie (@JorjMcKie)

img2pdf - By Johannes Schauer Marin Rodrigues ([email protected])

Special thanks to @CoolerVoid for helping with the sandboxing part <3

Security Considerations

This script will render possibly malicious PDFs to generate the sanitized file using MuPDF, which has had exploits in the past: https://www.cvedetails.com/vulnerability-list/vendor_id-10846/product_id-20840/Artifex-Mupdf.html Therefore, assume the machine this runs on will be pwned eventually.

To mitigate this risk, I recommend two techinques:

Running the script on a Sandbox

Firejail

This sandbox will restrict all network access, unnecessary syscalls and reading/writing from unexpected files/folders. I recommend this method.

Install it using:

sudo apt-get install firejail

The profile is already included in this repo, but it assumes you're running from ~/PDFSanitizer. Setup and use it by running the following commands:

cd ~
git clone https://github.com/lacioffi/PDFSanitizer
cd ~/PDFSanitizer/ 
firejail --profile=pdfsanitizer.profile python3 ~/PDFSanitizer/pdfsanitizer.py file.pdf ~/PDFSanitizer/out

Plase note that the input file must be located in ~/PDFSanitizer. The output folder can have any name, but it must already exist and also be inside ~/PDFSanitizer.

If you want to run PDFSanitizer from another folder, change the following line in "pdfsanitizer.profile":

whitelist ${HOME}/PDFSanitizer/

To wherever you're running the program from.

If you want the output folder or the input file to be outside of PDFSanitizer's folder, simply add a line in "pdfsanitizer.profile":

whitelist /full/path/to/output/folder
whitelist /full/path/to/file.pdf

Again, please note that the output folder must already exist.

CloudFlare Sandbox

This sandbox will restrict all network access and unnecessary syscalls, but will not restrict reading/writing to arbitrary files/folders.

Download and build it from here: https://github.com/cloudflare/sandbox Take the generated "libsandbox.so" file, put it in this PDFSanitizer's folder and run the following command:

LD_PRELOAD=./libsandbox.so SECCOMP_SYSCALL_ALLOW="read:write:lseek:close:openat:brk:stat:munmap:fstat:getdents64:ioctl:rt_sigaction:mmap:mprotect:pread64:lstat:dup:mremap:futex:getegid:getuid:getgid:geteuid:sigaltstack:rt_sigprocmask:access:uname:fcntl:getcwd:readlink:sysinfo:arch_prctl:gettid:set_tid_address:set_robust_list:prlimit64:getrandom:exit_group" python3 pdfsanitizer.py file.pdf ./output

Running the script on an isolated environment

For maximum security, run this script on an isolated, ephemeral instance or even a serverless environment. Block all network communications, maybe kill the instance after the job is done, and only allow reading from an input folder/bucket and writing to an output folder/bucket.

I didn't try this method, but I believe you can do it with some Cloud Majyks.

To-do

This method removes EVERYTHING from the PDF. It would be nice to at least keep the text copy-pasteable.

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022
Svg2pdfgen - Svg To PDF gen with python

Svg2pdfgen - Svg To PDF gen with python

Robert Urbańczyk 3 May 30, 2022
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
This is PDF Merger Application Developed using Just Python

This is PDF Merger Application Developed using Just Python

Sandeep Kumar Reddy 2 Nov 18, 2021
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
Simple python tool created for downloading PDF.

PDFdownloader Usage Open PDF in full-screen mode Run scan.exe Enter how many pages you want to scan Focus PDF After scanning is done, run merge.exe En

5 Oct 27, 2021
Convert given source code into .pdf with syntax highlighting and more features

Code2pdf 📠 Convert given source code into .pdf with syntax highlighting and more features Build Status Version Downloads Python Demo Installation Bui

Tushar Gautam 343 Jan 05, 2023
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 09, 2022