2022 - 2023
Staff Engineer
AI Document/Table Extraction Pipeline
A production-grade extraction pipeline that converts messy PDFs into validated, structured data for finance teams.
Problem
Finance analysts were spending hours rekeying tables from inconsistent PDFs.
The existing OCR stack produced noisy outputs with no quality guarantees.
Approach
Paired layout-aware extraction with structured validation rules.
Built a review UI to capture corrections and retrain the model weekly.
Architecture
Microservice pipeline for OCR, extraction, validation, and enrichment.
Vector search for template matching, enabling adaptive parsing strategies.
Impact
Consistent data delivery for downstream risk and compliance workflows.
Reduced onboarding time for new document types from weeks to days.
Stack
Python, FastAPI, OpenAI, Docker, Kubernetes, Snowflake, Prefect.
Learnings
Transparency in extraction confidence was as important as raw accuracy.
Human-in-the-loop design drives continuous model reliability.
Impact
- ●Cut manual review time by 65% through confidence scoring and feedback loops.
- ●Achieved 92% extraction accuracy across 40+ document templates.