Invoice / Receipt / PO Extraction
Pulling structured fields (vendor, totals, line items, dates) from PDFs that arrive as email attachments, vendor portal downloads, or scanned uploads.
An "AI extraction" system is rarely just an LLM call. It's a pipeline — scrape, download, parse, structure, validate, store — where the LLM does one job (turn messy text into clean fields) and everything around it does the unsexy work that makes the LLM's output trustworthy. This topic covers how I build those.
An AI data extraction pipeline takes unstructured input — a webpage, a PDF, a scanned document, an email — and produces structured output: a row in a database, a record in a JSON file, a clean entry in a downstream system. The "AI" part is one stage in that pipeline (usually an LLM call with a JSON schema or a function-calling response format). Most of the engineering is everything else: getting the source material, parsing it into something the LLM can read, validating what comes back, and recovering when it doesn't.
This topic covers the full pipeline: scrape → download → parse (HTML, PDF, OCR) → LLM-extract → validate → store. The patterns work whether the volume is 50 documents a day or 50,000.
Pulling structured fields (vendor, totals, line items, dates) from PDFs that arrive as email attachments, vendor portal downloads, or scanned uploads.
Extracting clauses, parties, dates, dollar amounts, and references from long-form legal or compliance documents — feeding into review queues or compliance dashboards.
Pages where the data you want is there but isn't in a clean DOM structure — product specs scattered across different layouts, listing details written as free-form prose, etc.
Customer service emails, vendor confirmations, shipping notifications — extracting the order ID, the issue type, the requested action, the delivery date, and routing accordingly.
Public filings (customs records, regulatory disclosures, court documents) where the data is published as PDFs or unstructured pages and needs to become queryable.
Pulling structured fields out of news articles, research papers, or analyst reports for downstream analysis and aggregation.
The pipeline is structured the same way as my scraping systems: independent stages, each writing to a shared database, all idempotent and resumable. The stages are discover → fetch → download → parse → extract → validate → store → export. Skipping or re-running any single stage is a normal operation, not a recovery effort.
Discover finds the documents (a directory listing, an inbox query, a search result). Fetch / download pulls them to local disk, gzipped, with a deterministic cache key so the same source is never downloaded twice. Parse turns them into clean text — HTML via BeautifulSoup, PDFs via pdfplumber or pymupdf, scanned images via tesseract with image preprocessing. Extract sends the cleaned text to the LLM with a strict JSON schema (function calling or structured outputs) — usually GPT-4o or Claude Sonnet, sometimes a local model for cost-sensitive volume work.
Validate is the stage most pipelines skip and most pipelines need. Every LLM response is checked against the schema (Pydantic / JSON Schema), against a set of business rules (totals add up, dates are in range, foreign keys resolve), and against a confidence threshold. Failed records go to a review queue, not the main output. Store writes the final structured record to MySQL or SQLite. Export produces CSV / XLSX / JSONL for downstream consumers.
Cost discipline matters at scale. Every LLM call is cached by content hash — re-running yesterday's pipeline against the same documents costs nothing in API fees. Token counts are logged per call so cost-per-record is visible from day one. For high-volume work, the pipeline routes "easy" documents (high parse confidence, short text) to a cheaper model and reserves the expensive model for hard cases.
Failure handling is loud. An LLM call that returns invalid JSON is retried once with a "your previous response wasn't valid JSON, here it is again" prompt. A second failure marks the document extract_failed with the raw response saved for manual review. The pipeline keeps moving.
OpenAI GPT-4o providing AI-generated chat suggestions inside a multi-session browser automation platform, with structured outcome capture and a training-data feedback loop.
Read the case study →Multi-stage pipeline turning raw HTML into a confidence-scored dataset — same architectural pattern as a full LLM extraction pipeline, with rule-based scoring instead of an LLM at the extract stage.
Read the case study →AI extraction work usually pairs with Python Web Scraping (front half of the pipeline) and Automation Dashboards (the review queue and operator UI on top).
From single-document parsers to multi-stage scrape-download-extract pipelines processing thousands of files a day, I build AI extraction systems with proper validation, structured output, and recoverable failure modes.