An AI Document Extraction Workflow — Scrape, Download, Parse, Structure

The LLM Is the Easy Part

If you've used GPT-4o or Claude with structured outputs, you know the basic loop: hand the model some text and a JSON schema, get a structured object back. The hello-world version is genuinely magical and gives the impression that "AI extraction" is a one-call problem. It isn't.

Real document extraction pipelines spend maybe 5% of their code on the LLM call and 95% on everything else: getting the source documents (scraping, downloading, monitoring inboxes), turning them into clean text the model can read (HTML parsing, PDF extraction, OCR), validating the model's output against business rules, handling failures, caching aggressively to control costs, and storing the structured results in a queryable form. The LLM's accuracy is a function of all that surrounding work; bad input produces bad extraction, no matter how good the model is.

This article walks through the pipeline shape I use for production AI extraction work. Same architecture for invoices, customs filings, contracts, news articles, vendor confirmations — only the parser and the schema change.

The Pipeline Stages

Eight stages, each idempotent, each writing to a shared database:

Discover — find the documents to process (a search result, a directory listing, an inbox query, a vendor portal).
Fetch / download — pull each source artifact (HTML page, PDF, image) to local disk under a content-hashed filename.
Parse — turn the raw artifact into clean text. HTML via BeautifulSoup, PDFs via pdfplumber or pymupdf, scanned images via tesseract.
Extract — send the cleaned text to the LLM with a strict JSON schema. Get back a structured object.
Validate — check the LLM output against the schema, business rules, and confidence thresholds.
Store — write the validated record to MySQL (or wherever it needs to live).
Review-queue — anything that failed validation goes here for human inspection.
Export — produce the downstream output (CSV, API push, etc.).

The stages don't talk to each other directly. They read from and write to the same set of database tables. A run of "stage 4 + 5 only" against documents stage 3 already parsed is a normal operation, not a recovery effort. Re-running stage 4 with a new prompt, against last week's parsed text, costs nothing extra and gives you an apples-to-apples comparison of prompt iterations.

Stage 3: Parsing Matters More Than the LLM

The single biggest lever in extraction quality is what you feed the model. A clean, well-structured chunk of text produces clean output. A noisy chunk full of headers, footers, watermarks, and OCR garbage produces garbage no matter what model you use.

For PDFs, pdfplumber is my default for text-based files (invoices, statements, vendor docs that were generated digitally). It handles layout reasonably well, preserves table structure, and lets you crop to specific page regions when the layout is consistent. pymupdf is faster but less layout-aware; good for high-volume work on uniform documents.

For scanned PDFs and images, tesseract with image preprocessing (grayscale, deskew, denoise via OpenCV) gets you reasonable text on most legible scans. For low-quality scans, the cost of better OCR (commercial services like Azure Document Intelligence or Google Document AI) is usually worth it — sending bad OCR output to GPT-4o produces bad extraction and burns LLM tokens on noise.

For HTML, the parser strips boilerplate (nav, footer, ads, comment sections) before the text reaches the LLM. readability-lxml or trafilatura handle this well for article-shaped pages. For structured pages (product pages, listings), a per-site parser that targets the relevant sections directly is usually cleaner than relying on a generic content extractor.

Whatever the parser, output a single string with semantic structure preserved (headings as Markdown #, tables as pipe-delimited rows, paragraphs separated by blank lines). This format costs slightly more tokens than raw text but improves extraction accuracy noticeably — the model uses the structural cues.

Stage 4: The LLM Call

Use structured outputs (OpenAI's response_format={"type": "json_schema", ...}, or Anthropic's tool-use pattern) and a Pydantic schema. Don't try to parse free-form prose responses; the validation overhead alone makes structured outputs the right choice.

from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

class Invoice(BaseModel):
    vendor_name: str
    invoice_number: str
    issue_date: str = Field(description="ISO 8601 date")
    due_date: str | None
    total_cents: int = Field(description="Total in smallest currency unit")
    currency: str = Field(description="ISO 4217 code")
    line_items: list["LineItem"]

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price_cents: int
    total_cents: int

def extract_invoice(text: str) -> Invoice:
    response = client.chat.completions.parse(
        model="gpt-4o-2024-11-20",
        messages=[
            {"role": "system", "content": "Extract invoice data. Use ISO formats. Currency in smallest unit (cents)."},
            {"role": "user", "content": text},
        ],
        response_format=Invoice,
    )
    return response.choices[0].message.parsed

Schema design tips that have paid off repeatedly:

Use the smallest currency unit. Always. The model handles "1499 cents" more reliably than "$14.99" and you avoid floating-point issues downstream.
Specify date format in the field description. "ISO 8601 date" is the magic string that gets you 2026-05-01 instead of "May 1, 2026".
Allow nullable fields explicitly. The model will hallucinate values for missing fields rather than admit absence; str | None with a clear description ("null if not present in document") cuts this dramatically.
Avoid free-text fields if you can. Anywhere you can replace a string field with an enum or a constrained set of values, do it. The model gets the categorical answer right; the model invents the prose answer.

Stage 5: Validation Is Non-Negotiable

The LLM will get things wrong. Sometimes spectacularly — a hallucinated vendor name, a date from a different document, a total that doesn't match the line items. Validation is the layer that catches these before they hit the downstream system.

Three validation passes:

Schema validation — the model returned a parseable object that matches the Pydantic schema. Almost always passes when you're using structured outputs. When it doesn't, retry once with the original text and a "your previous response wasn't valid JSON" prompt.

Business-rule validation — line items sum to the total, dates fall in plausible ranges, foreign keys resolve to existing entities, currency codes are valid. Each rule is a small Python function returning a ValidationResult with a reason. Failures don't necessarily kill the record; they downgrade its confidence.

Cross-source validation — when the same field can be checked against another data source (invoice total against the matching purchase order, vendor name against the vendor master), do it. Disagreements get logged and routed to review.

Records that pass all three validations go to store. Records that fail any validation go to the review queue with the failure reason attached. Operators see exactly why a record needs attention and can correct it in one screen.

Stage 7: The Review Queue

The review queue is what makes the rest of the system trustable. Without it, every borderline case becomes a choice between "ship bad data" and "drop the record." With it, borderline cases get the human review they deserve and the cleaner cases flow through automatically.

The queue is just a database table with an extracted_records row, a status (auto_approved, needs_review, operator_corrected, operator_rejected), and a review_reason. The operator UI is a simple list of needs_review rows, side-by-side with the original document, with the extracted fields editable. Approve, correct, or reject. The decision gets written back to the record and the document moves out of the queue.

The operator's decisions are also training data. Periodically, a sample of corrections gets fed back into the prompt as few-shot examples, and the prompt's accuracy rises. This is a slow loop, not a real-time learning system, but over months it noticeably reduces the volume hitting the review queue.

Cost Discipline

LLM pricing makes naive extraction pipelines expensive at scale. A few practices keep costs predictable:

Cache by content hash. Every LLM call includes the model name, the prompt template version, and a hash of the input text in the cache key. Re-running yesterday's pipeline against the same documents costs zero in API fees.

Tier the model. Easy cases (short documents, high parse confidence, simple schemas) go to the cheaper model. Hard cases get the expensive one. The router decides based on document length, layout complexity, and historical success rate. For invoice-style work, this typically routes 70-80% of documents to the cheap tier with no measurable accuracy loss.

Log token counts per call. Cost-per-record is one of the few metrics the operations team will ask about. Make it queryable from day one rather than a post-hoc reconstruction.

Truncate aggressively. A 50-page contract doesn't need to go to the model in full. A pre-pass that identifies the relevant sections (using cheap embeddings or keyword search) and sends only those sections to the extractor cuts token usage by an order of magnitude on long documents.

Wrap-Up

An AI extraction pipeline is a scraping system with an LLM stage in the middle. The architecture I use for both is the same — independent stages, shared database, idempotent, resumable, observable. The LLM does one well-defined job; everything around it makes the LLM's job possible and its output trustworthy.

The biggest mistakes I see in this space are skipping the validation layer ("the model usually gets it right"), skipping the review queue ("we'll handle errors later"), and underinvesting in the parser ("we'll just dump the whole PDF in"). All three are tempting because they let you ship the demo faster. All three produce systems that quietly poison their downstream consumers with bad data.

Get the surrounding work right and the LLM will reward you. For the broader picture — operator dashboards, the scraping front-end, schema patterns — see the AI Data Extraction, Python Web Scraping, and Automation Dashboards hubs.