PyPDF
概念
pypdf 主要用于读取、合并、加密等操作,不能直接修改文字内容。如需高级编辑,考虑使用 reportlab 生成新 PDF,或商业库如 PyMuPDF(fitz)。
如果提取文字是乱码:可能是字体嵌入问题或 PDF 使用了特殊编码。尝试用
page.extract_text(extraction_mode="layout")或改用 OCR 工具(如pytesseract+pdf2image)。
安装
bash
pip install pypdf读取PDF文件信息
python
from pypdf import PdfReader
# 打开 PDF 文件
pdf = PdfReader("example.pdf")
# 获取页数
print(f"总页数: {len(pdf.pages)}")
# 获取文档信息(作者、标题等)
meta = pdf.metadata
print("标题:", meta.title)
print("作者:", meta.author)提取文本内容
python
from pypdf import PdfReader
pdf = PdfReader("example.pdf")
text = ""
for page in pdf.pages:
text += page.extract_text() + "\n"
print(text)合并多个PDF
python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
# 添加多个 PDF
for filename in ["file1.pdf", "file2.pdf", "file3.pdf"]:
reader = PdfReader(filename)
for page in reader.pages:
writer.add_page(page)
# 保存合并后的 PDF
with open("merged.pdf", "wb") as output_file:
writer.write(output_file)拆分PDF
python
from pypdf import PdfReader
# 为每页保存为单独文件
reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as f:
writer.write(f)旋转页面
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
page.rotate(90) # 顺时针旋转 90 度
writer.add_page(page)
with open("rotated.pdf", "wb") as f:
writer.write(f)