当前位置:网站首页>【pypdf2】安装、读取和保存、访问页面、获取文本、读写元数据、加密解密
【pypdf2】安装、读取和保存、访问页面、获取文本、读写元数据、加密解密
2022-08-10 23:50:00 【冰冷的希望】
1.安装pypdf2
pip install PyPDF2
2.打开和保存PDF文件
pypdf2有PdfReader和PdfWriter两个对象分别用于读和写,reader()方法直接指定PDF文件的路径即可读取PDF文件,writer可以临时保存PDF内容,然后调用write()方法传入文件句柄即可保存到硬盘
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf") # 打开一个reader用于读取PDF文件
writer = PdfWriter() # 打开一个writer用于写入PDF
writer.add_page(reader.getPage(0)) # # 把PDF第一页添加到writer
# 保存PDF
with open("test2.pdf", "wb") as f:
writer.write(f)
添加空白页可以通过addBlankPage()方法,但注意,如果PdfWriter对象是空的,你需要指定宽高才能添加空白页,如果PdfWriter已有页面不指定宽高则采用上一页的宽高。可以通过PageObject对象的mediabox属性查看宽高信息
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
writer.addBlankPage(612, 810) # 如果writer是一个空白的页面
writer.addBlankPage() # 添加一页空白页
print(writer.getPage(0).mediabox) # 查看页面宽高
with open("test2.pdf", "wb") as f:
writer.write(f)
3.获取页面
我们可以通过下标获取页面或者直接遍历所有页面,返回得到PageObject对象
from PyPDF2 import PdfReader
reader = PdfReader("test.pdf")
# 获取总页数
page_count = reader.getNumPages()
# 下标取值,获取第1页
page = reader.getPage(0)
# 遍历所有页面
for page in reader.pages:
print(reader.getPageNumber(page)) # 获取page所在页码号
4.获取PDF文本内容
PageObject对象有一个extract_text()方法可以获取该页面的文本字符串,但是注意,官方文档提到只是简单获取文本,不保证公式等排版等信息是否正确
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
for index, page in enumerate(reader.pages): # 遍历所有页面
print(f"第{
index}页文本:")
print(page.extract_text())
5.读写元数据
一份PDF可以保存标题、作者、修改时间等元数据信息,当然我们也可以修改
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
# 获取元数据信息
meta = reader.metadata # 返回一个类字典的DocumentInformation对象
# meta = reader.getDocumentInfo()) # 也reader.metadata等价
print(type(meta), len(meta), meta) # {'/Producer': 'Adobe LiveCycle PDF Generator', '/ModDate': ...}
print(meta.author) # 作者
print(meta.creator) # 创建者
print(meta.producer) # 制作者
print(meta.title) # 标题
print(meta.subject) # 子标题
print(meta.getText("/ModDate")) # 获取其他键值
# 更新元数据
meta.update({
"/Author": "pan"})
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.add_metadata(meta) # 该参数也可以直接是字典类型
with open("test2.pdf", "wb") as f:
writer.write(f)
如果某个元数据键值不存在则返回None
6.加密解密
加密解密过程很简单,只需要传入秘钥调用对应的encrypt()和decrypt()即可
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
writer.addBlankPage(612, 810)
secret = "fLa5fpao%3paH" # 密码
# 加密
writer.encrypt(secret)
with open("test2.pdf", "wb") as f:
writer.write(f)
reader = PdfReader("test2.pdf")
# 判断是否加密
if reader.is_encrypted:
# 解密
reader.decrypt(secret)
需要注意的是,调用decrypt()解密即使密码不对也不会报错,但后期你访问或操作该PDF会报错
边栏推荐
- App的回归测试,有什么高效的测试方法?
- The Missing Semester of Your CS Education
- I caught a 10-year-old Ali test developer, and after talking about it, I made a lot of money...
- Ali P7 bask in January payroll: hard to fill the, really sweet...
- HGAME 2022 Final Pokemon v2 writeup
- 13. Content Negotiation
- How to determine how many bases a number is?
- 2022下半年软考「高项」易混淆知识点汇总(2)
- CF1427F-Boring Card Game【贪心】
- ROS实验笔记之——安装QPEP以及Intel-MKL
猜你喜欢
CDN原理与应用简要介绍
Promise in detail
How to recover deleted files from the recycle bin, two methods of recovering files from the recycle bin
iNFTnews | In the Web3 era, users will have data autonomy
How to recover data from accidentally deleted U disk, how to recover deleted data from U disk
“蔚来杯“2022牛客暑期多校训练营2 DGHJKL题解
There is no recycle bin for deleted files on the computer desktop, what should I do if the deleted files on the desktop cannot be found in the recycle bin?
I caught a 10-year-old Ali test developer, and after talking about it, I made a lot of money...
基于Web的疫情隔离区订餐系统
[C language] Implementation of guessing number game
随机推荐
oai 采样频率计算
[C language] Implementation of guessing number game
13. Content Negotiation
Dump file generation, content, and analysis
Mysql. Slow Sql
14. Thymeleaf
翻译软件哪个准确度高【免费】
5. Lombok
How to quickly grasp industry opportunities and introduce new ones more efficiently is an important proposition
Multilingual Translation - Multilingual Translation Software Free
proxy代理服务_2
SAS数据处理技术(一)
如何快速把握行业机会,更高效地推陈出新,是一个重要的命题
CF1534F2-Falling Sand (Hard Version)
In 22 years, the salary of programmers nationwide in January was released, only to know that there are so many with annual salary of more than 400,000?
如果纯做业务测试的话,在测试行业有出路吗?
力扣每日一题-第52天-387. 字符串中的第一个唯一字符
Part of the reserve bank is out of date
“蔚来杯“2022牛客暑期多校训练营3 DF题解
有哪些可以投稿软件工程/系统软件/程序设计语言类外文期刊、会议?