SIMONE PIPITONE
Language toggle

2022 - 2023

Staff Engineer

AI Document/Table Extraction Pipeline

A production-grade extraction pipeline that converts messy PDFs into validated, structured data for finance teams.

PythonFastAPIKubernetesOpenAISnowflake

Problem

Finance analysts were spending hours rekeying tables from inconsistent PDFs.

The existing OCR stack produced noisy outputs with no quality guarantees.

Approach

Paired layout-aware extraction with structured validation rules.

Built a review UI to capture corrections and retrain the model weekly.

Architecture

Microservice pipeline for OCR, extraction, validation, and enrichment.

Vector search for template matching, enabling adaptive parsing strategies.

Impact

Consistent data delivery for downstream risk and compliance workflows.

Reduced onboarding time for new document types from weeks to days.

Stack

Python, FastAPI, OpenAI, Docker, Kubernetes, Snowflake, Prefect.

Learnings

Transparency in extraction confidence was as important as raw accuracy.

Human-in-the-loop design drives continuous model reliability.

Impact

  • Cut manual review time by 65% through confidence scoring and feedback loops.
  • Achieved 92% extraction accuracy across 40+ document templates.

Stack

PythonFastAPIKubernetesOpenAISnowflake