Automate PDF Processing with Python: Extract, Transform, and Generate Documents
PDFs are everywhere in business—invoices, contracts, reports, forms. Manually processing them wastes hours. Python makes PDF automation accessible, from simple text extraction to complex document generation. Here's how to automate your PDF workflows.
Common PDF Automation Use Cases
- Extract text and data from invoices
- Merge multiple PDFs into one
- Split large documents into sections
- Fill PDF forms programmatically
- Convert PDFs to other formats
- Add watermarks or stamps
- Generate reports as PDFs
- OCR scanned documents
Essential Python PDF Libraries
PyPDF2 / PyPDF
Basic PDF manipulation:
- Merge and split PDFs
- Extract text (limited formatting)
- Rotate pages
- Add watermarks
- Encrypt/decrypt PDFs
pdfplumber
Better text and table extraction:
- Extract text with position info
- Parse tables accurately
- Handle complex layouts
- Good for invoices and statements
ReportLab
Generate PDFs from scratch:
- Create professional documents
- Add images, tables, charts
- Full layout control
- Template-based generation
PyMuPDF (fitz)
High-performance PDF handling:
- Fast processing
- Image extraction
- Text search and highlight
- Annotation handling
pytesseract
OCR for scanned documents:
- Extract text from images
- Works with scanned PDFs
- Multiple language support
- Combine with pdf2image
Use Case 1: Invoice Data Extraction
Extract key data from invoice PDFs:
import pdfplumber
def extract_invoice_data(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[0]
text = page.extract_text()
# Parse for common invoice fields
data = {
'invoice_number': find_pattern(text, r'Invoice[:#]?\s*(\w+)'),
'date': find_pattern(text, r'Date[:]?\s*([\d/\-]+)'),
'total': find_pattern(text, r'Total[:]?\s*\$?([\d,.]+)'),
'vendor': extract_vendor_name(text)
}
# Extract line items from tables
tables = page.extract_tables()
data['line_items'] = parse_line_items(tables)
return data
Use Case 2: Report Generation
Generate professional PDF reports:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
def generate_report(data, output_path):
doc = SimpleDocTemplate(output_path, pagesize=letter)
elements = []
styles = getSampleStyleSheet()
# Title
elements.append(Paragraph("Monthly Sales Report", styles['Title']))
# Data table
table_data = [['Product', 'Units', 'Revenue']]
for item in data:
table_data.append([item['product'], item['units'], f"${item['revenue']}"])
table = Table(table_data)
elements.append(table)
doc.build(elements)
Use Case 3: PDF Merge and Split
from PyPDF2 import PdfMerger, PdfReader, PdfWriter
def merge_pdfs(pdf_list, output_path):
merger = PdfMerger()
for pdf in pdf_list:
merger.append(pdf)
merger.write(output_path)
merger.close()
def split_pdf(input_path, output_dir):
reader = PdfReader(input_path)
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"{output_dir}/page_{i+1}.pdf", 'wb') as out:
writer.write(out)
Use Case 4: Form Filling
from PyPDF2 import PdfReader, PdfWriter
def fill_pdf_form(template_path, data, output_path):
reader = PdfReader(template_path)
writer = PdfWriter()
writer.append(reader)
writer.update_page_form_field_values(
writer.pages[0],
data # {'field_name': 'value', ...}
)
with open(output_path, 'wb') as out:
writer.write(out)
OCR for Scanned PDFs
import pytesseract
from pdf2image import convert_from_path
def ocr_pdf(pdf_path):
# Convert PDF pages to images
images = convert_from_path(pdf_path)
text = ""
for image in images:
text += pytesseract.image_to_string(image)
return text
Integration with Automation Platforms
Connect PDF processing to n8n or Make.com:
- Create a Python API endpoint (Flask/FastAPI)
- Accept PDF upload or URL
- Process and return extracted data
- Call from automation workflow
Best Practices
- Handle errors: PDFs vary wildly in structure
- Validate outputs: Check extracted data makes sense
- Use appropriate library: Match tool to task
- Consider OCR: Scanned PDFs need different approach
- Test with variety: PDFs from different sources behave differently
Need Custom PDF Automation?
Our Python automation team builds custom PDF processing solutions for businesses—from invoice extraction to report generation.
Contact us to discuss your PDF automation needs.
0 comments