From PDF Chaos to AI-Powered Organization: An Engineer’s Journey into Document Processing

How I built an intelligent document processing system using OCR (Optical Character Recognition), local LLMs, and Python – leveraging my code generation AI experience
The Problem: Drowning in a Digital Morass
Like many people, I had accumulated a large collection of PDF documents over the years – medical records, invoices, receipts, tax documents, insurance papers. They were scattered across folders with cryptic names, making it nearly impossible to find anything when I needed it. The thought of organizing them manually was daunting – it would take forever!
As a software engineer with some AI experience (mainly using ChatGPT for boilerplate code), I decided to explore whether I could use AI to solve this organizational problem. The eureka moment came when I realized I could simply ask an LLM to extract structured information from unstructured text – no complex programming required. What I discovered was fortunate – my experience with AI-assisted coding translated surprisingly well to building document processing systems, though it required me to learn some new concepts around text processing and entity extraction.
The Journey: From Simple OCR to Intelligent Organization
Phase 1: Basic PDF to Text Conversion
I started with the fundamentals – converting PDFs to text. This required two main components:
PDF to Image Conversion:
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
def convert_from_pdf(source_folder, filename):
"""Convert PDF to text and return organized data structure."""
output = []
file_path = os.path.join(source_folder, filename)
pages = convert_from_path(file_path, dpi=300)
for i, page_image in enumerate(pages):
output_path = os.path.join(output_folder, f"{filename}_{i+1}.png")
page_image.save(output_path, "PNG")
# Extract text using OCR
text = pytesseract.image_to_string(Image.open(output_path))
page_data = {
'source_folder': source_folder,
'filename': filename,
'text': text,
'page': i + 1,
'image_path': output_path
}
output.append(page_data)
return output
Key Learning: The quality of OCR depends heavily on image resolution. Using 300 DPI made a significant difference in text extraction accuracy. OCR is push-button easy to incorporate.
Phase 2: Enter Local LLMs – The Plot Thickens
This is where things got interesting. Instead of trying to write complex regex patterns – those complicated expressions that attempt to capture every possible variation of organization names, date formats, and monetary amounts – I discovered I could use a local Large Language Model (LLM) running on Ollama. The traditional approach would have required crafting patterns like
(Jan|Feb|Mar)\.?\s+\d{1,2},?\s+\d{4}
just to handle basic date variations, let alone the challenge of parsing company names with their many abbreviations and formatting quirks.
Why Local LLMs? – Privacy: Documents never leave your machine – Cost: No API fees for processing hundreds of documents – Control: Can run offline and customize as needed
Setting up Ollama: First, visit ollama.com to learn more about the platform. Ollama dramatically lowers the barrier to using powerful LLMs by providing a simple, Docker-like interface for running models locally. This is particularly crucial when processing sensitive documents like medical records, financial statements, and personal correspondence – your data never leaves your machine, eliminating privacy concerns that would arise from sending documents to cloud-based AI services.
# Install Ollama
brew install ollama
# Start the service
brew services start ollama
# Pull a model (I used llama3.2)
ollama pull llama3.2
Phase 3: Building the AI Analyzer
With this insight in mind, I built the AI analyzer:
class OllamaAnalyzer:
def __init__(self, model_name: str = "llama3.2"):
self.model_name = model_name
self.api_url = "http://localhost:11434/api/generate"
def extract_entities(self, text: str) -> Dict[str, List[str]]:
"""Extract named entities with focus on organizations"""
prompt = f"""
Analyze the following text and extract key information. Focus especially on identifying ALL organization names, including:
- Company names
- Medical facilities and departments
- Government agencies
- Financial institutions
- Service providers
Return the results in JSON format with these categories:
- "organizations": List of ALL company/organization names
- "dates": List of dates found
- "amounts": List of monetary amounts
- "keywords": List of important keywords
Text to analyze:
{text}
Please return only valid JSON format.
"""
response = self.generate_response(prompt)
# Parse JSON response and return structured data
return self.parse_json_response(response)
Key Learning: LLMs are surprisingly good at understanding context and extracting relevant information, even from poorly formatted OCR text.
Phase 4: The Organization Mapping Challenge
One of the biggest challenges was dealing with inconsistent organization names. The same company might appear as: – “SecureTech” – “SecureTech.com” – “SocureToch.com” (thanks, OCR – you had one job!)
I solved this with a normalization system:
def normalize_organization_name(self, org_name: str) -> str:
"""Normalize organization names to group similar ones"""
org_mappings = {
'securetech': ['securetech.com', 'securetechcom', 'securetech.com', 'securetoch.com', 'securetech security'],
'metro health system': ['metro health', 'metro health medical group'],
'first national bank': ['first national', 'first national bank na'],
# ... more mappings
}
normalized = org_name.lower().strip()
# Check for exact and partial matches
for canonical, variations in org_mappings.items():
if normalized == canonical or normalized in [v.lower() for v in variations]:
return canonical.title()
return org_name.strip()
Phase 5: Smart Document Grouping
The next challenge was intelligently grouping documents. I wanted receipts and invoices from the same date to be grouped together:
def group_by_date_and_type(pages):
"""Group pages by date and document type to combine receipts with invoices"""
date_groups = defaultdict(list)
# Group by date first
for page in pages:
date_key = page['date_str']
date_groups[date_key].append(page)
# Within each date, try to group receipts with invoices
final_groups = []
for date_key, date_pages in date_groups.items():
invoices = [p for p in date_pages if 'invoice' in p['document_type'].lower()]
receipts = [p for p in date_pages if 'receipt' in p['document_type'].lower()]
# If we have both invoices and receipts on the same date, group them
if invoices and receipts:
combined_group = invoices + receipts
final_groups.append({
'date': date_key,
'type': 'Invoice+Receipt',
'pages': combined_group
})
# Handle other cases...
return final_groups
Phase 6: PDF Generation
The culminating step was generating organized PDF documents from the grouped pages. Using the ReportLab library, I created chronologically sorted PDFs for each organization:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Image, Spacer
from reportlab.lib.units import inch
def create_organization_pdf(org_name, pages, output_dir):
"""Generate a PDF containing all pages for an organization"""
filename = f"{org_name.replace(' ', '_')}_chronological.pdf"
filepath = os.path.join(output_dir, filename)
doc = SimpleDocTemplate(filepath, pagesize=letter)
story = []
# Sort pages chronologically
sorted_pages = sorted(pages, key=lambda x: x.get('parsed_date', datetime.min))
for page in sorted_pages:
if os.path.exists(page['image_path']):
# Add the scanned document image
img = Image(page['image_path'])
img.drawHeight = 10*inch
img.drawWidth = 7.5*inch
story.append(img)
story.append(Spacer(1, 0.2*inch))
doc.build(story)
return filepath
This approach preserves the original document appearance while organizing everything chronologically by organization. Each generated PDF becomes a comprehensive record of all interactions with that entity, making it trivial to find historical documents when needed.
Technical Architecture
The enhanced system consists of several interconnected components with improved AI integration:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Source │ │ PDF2Image │ │ OCR │
│ PDFs │───▶│ Conversion │───▶│ (Text │
│ │ │ │ │ Extraction) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Enhanced │ │ AI │ │ Enhanced │
│ LLM │───▶│ Powered │───▶│ Entity │
│ Analysis │ │Organization │ │ Extraction │
│ │ │Normalization│ │& Categories │
└─────────────┘ └─────────────┘ └─────────────┘
▲ │ │
│ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Document │ │ Smart │ │ Web │
│Classification│ │ Document │ │ Interface │
│ │ │ Grouping │ │ (Flask) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────┐
│ Final │
│ PDFs │
│ + Search │
│ Interface │
└─────────────┘
Key Files:
– convert_source.py
: PDF to text conversion with structured output
– ollama_analyzer.py
: Enhanced LLM-based entity extraction with organization categorization, AI-powered normalization, and document classification
– organize_by_organization.py
: Document grouping and PDF generation
– reprocess_organizations.py
: Batch reprocessing with updated mappings
– web_interface.py
: Flask-based web interface for document search and management
– web_interface_organizations.py
: Organization-focused web interface
Enhanced Features in ollama_analyzer.py:
– Organization Categorization: Automatically categorizes organizations by type (healthcare, financial, government, service, utility, education, retail)
– AI-Powered Normalization: Uses LLM to standardize organization names instead of manual mapping rules
– Document Classification: Automatically identifies document types (invoice, receipt, medical record, etc.)
– DocumentProcessor Class: Comprehensive workflow management for batch processing
– Advanced Search: Keyword search with context extraction across all documents
– Connection Testing: Validates Ollama connectivity before processing
– Error Handling: Robust JSON parsing and graceful failure recovery
– Rate Limiting: Built-in delays to prevent API overload
– Comprehensive Reporting: Detailed summaries with categorized organization listings
Lessons Learned
1. Start Simple, Iterate Quickly
I began with basic OCR and gradually added AI components. This allowed me to understand each piece before adding complexity.
2. Local LLMs Are Surprisingly Capable
For document processing tasks, local models like Llama 3.2 perform well and offer significant advantages over cloud-based alternatives.
3. Data Quality Matters More Than Model Sophistication
Spending time on OCR quality (proper DPI, image preprocessing) had more impact than trying different LLM models. I used only the local Ollama llama3.2 model throughout this project, and it proved more than capable for the task. The document processing requirements were straightforward enough that model experimentation wasn’t necessary – the single model handled entity extraction, organization mapping suggestions, and text analysis with remarkable consistency. This reinforced that for many practical AI applications, focusing on clean input data and well-crafted prompts yields better results than chasing the latest model releases.
4. Error Handling Is Critical
Real-world documents are messy. Building robust error handling and fallback mechanisms was essential. This included handling network timeouts when communicating with the local LLM, gracefully managing malformed JSON responses, dealing with OCR failures on corrupted or low-quality images, and providing sensible defaults when document parsing fails. The system needed to continue processing other documents even when individual pages encountered errors, ensuring that one problematic document wouldn’t derail the entire batch processing operation.
5. Organization Mapping Requires Domain Knowledge
The AI could extract organization names, but understanding that “SecureTech” and “SecureToch.com” are the same company required human insight encoded into mapping rules. However, this presents an intriguing opportunity for AI-assisted rule generation.
While I initially created mapping rules manually, I discovered that AI could actually help generate these mappings. By feeding the LLM examples of organization name variations, it could suggest potential matches:
def generate_mapping_suggestions(self, org_names: List[str]) -> Dict[str, List[str]]:
"""Use AI to suggest organization name mappings"""
prompt = f"""
Analyze these organization names and group similar ones that likely refer to the same entity.
Consider variations like:
- OCR errors (letter substitutions)
- Different formats (.com, Inc, LLC variations)
- Abbreviations vs full names
- Spacing and punctuation differences
Organization names: {org_names}
Return JSON with canonical names as keys and variations as arrays.
"""
response = self.generate_response(prompt)
return self.parse_json_response(response)
This hybrid approach proved remarkably effective – the AI could identify patterns I missed (like “St. Mary’s” and “Saint Mary’s Hospital”) while I provided domain-specific knowledge about which
organizations were actually the same entity. The AI essentially became a sophisticated fuzzy matching engine that could understand context beyond simple string similarity.
The key insight is that AI excels at pattern recognition and can encode complex matching rules that would be tedious to write manually, but human oversight remains crucial for validating that the suggested mappings make business sense.
Results and Impact
The enhanced system successfully processes documents with significantly improved capabilities:
Processing Results:
– Comprehensive Entity Extraction: Names, dates, organizations, locations, amounts, and keywords
– Automatic Organization Categorization: Groups entities by type (healthcare, financial, government, etc.)
– AI-Powered Normalization: Standardizes organization names without manual mapping
– Document Classification: Automatically identifies document types (invoices, receipts, medical records, etc.)
– Advanced Search: Keyword search with context extraction across all documents
– Robust Error Handling: Graceful handling of OCR errors and malformed responses
Key Improvements Over Original:
– Scalability: No manual mapping rules required – AI handles normalization automatically
– Accuracy: Better organization name standardization and grouping
– Insights: Organization categorization provides better understanding of document types
– Reliability: Enhanced error handling and connection testing
– Performance: Rate limiting prevents API overload
Before: Hours of tedious work sorting through cryptically named PDF files
After: Finding any document in seconds with intelligent categorization and normalization – a complete transformation from chaos to order
For Fellow Engineers: Getting Started
If you want to build something similar, especially if you have experience with AI code generation:
- Start with the basics: Get PDF→Image→Text working first
- Install Ollama: It’s the easiest way to run local LLMs
- Apply your prompt engineering skills: If you’ve used AI for coding, you already understand how LLMs respond to different prompt styles
- Build incrementally: Add one AI component at a time (similar to iterating on code generation prompts)
- Focus on data quality: Clean inputs matter more than fancy models
The complete code is available in my repository, and I encourage you to adapt it for your own document processing needs.
What’s Next?
The system has evolved significantly, and several enhancements are already implemented or in progress:
Recently Implemented:
1. Enhanced Categorization: ✅ Completed – Organizations are now automatically categorized by type (healthcare, financial, government, etc.)
2. AI-Powered Normalization: ✅ Completed – Organization names are standardized using LLM analysis instead of manual rules
3. Document Classification: ✅ Completed – Automatic document type identification (invoice, receipt, medical record, etc.)
4. Web Interface: ✅ Implemented – web_interface.py
and web_interface_organizations.py
provide Flask-based UIs for document search and management
Future Improvements Being Considered:
1. Semantic Search: Using embeddings to find documents by content similarity, not just keyword matching 2. Advanced
Analytics: Document trend analysis and relationship mapping
between organizations 3. Batch Processing Optimization:
Handling very large document collections more efficiently 4.
Export Capabilities: Generate reports and summaries in
various formats 5. Integration APIs: Connect with
document management systems and cloud storage
Conclusion
What started as a frustrating pile of disorganized PDFs became an
opportunity to explore practical AI applications beyond code generation.
This project demonstrated that local LLMs, combined with traditional
programming techniques, can solve real organizational problems without
requiring cloud services or specialized ML expertise.
The journey from basic OCR to intelligent document processing
revealed several key insights: data quality trumps model sophistication,
local AI provides both privacy and cost benefits, and iterative
development allows you to understand each component before adding
complexity. Most importantly, the skills I’d developed using AI for code
generation – prompt engineering, understanding LLM capabilities and
limitations, and treating AI as a powerful API – translated directly to
document processing.
The system now automatically categorizes hundreds of documents,
normalizes organization names, and provides instant search capabilities.
What once took hours of manual sorting now happens in minutes,
transforming document chaos into organized, searchable archives. For any
engineer dealing with similar document management challenges, the
combination of OCR, local LLMs, and solid software engineering practices
offers a practical path forward.
Code can be found here https://github.com/klhammond99/pdf-order