Pricing & Catalog Tracking
Watching competitor prices, listings, availability, or assortment changes across hundreds or thousands of pages, repeatedly, with structured diffs feeding into reporting.
Python is the default language for serious scraping work, but the gap between a 50-line script and a scraper that runs reliably for months is enormous. This topic covers the architecture, tooling, and operational discipline that close that gap.
Python web scraping in the production sense — not "open requests, parse HTML, print results." The systems I build collect data on a schedule, recover from blocks and failures without human intervention, cache aggressively so re-runs cost nothing, and produce clean, deduplicated, queryable output that downstream tools (dashboards, exports, AI pipelines) can actually consume.
The stack varies by site — sometimes plain HTTP with requests + BeautifulSoup, sometimes Playwright with a real browser, sometimes Kameleo when bot protection is aggressive — but the surrounding architecture is the same: a queue, a fetcher, a parser, a normalizer, a storage layer, and a recovery loop.
Watching competitor prices, listings, availability, or assortment changes across hundreds or thousands of pages, repeatedly, with structured diffs feeding into reporting.
Walking public directories or marketplace pages to build a structured dataset for sales, recruiting, or research — with deduplication, enrichment, and clean exports.
Polling product, shipment, or third-party portal pages on a schedule to detect changes (back-in-stock, status flips, missing data) and trigger downstream actions.
Building pipelines around public datasets (customs filings, regulatory disclosures, government APIs) where the source publishes the data but doesn't make it convenient to query.
One-time-but-large scrapes for studies, market sizing, or model training datasets — where reproducibility and a clean cache matter more than ongoing freshness.
Scraping is the front half of most "AI extraction" projects. Clean, deduplicated source material is what makes the LLM half actually work.
Every scraper I ship is structured as a pipeline of independent stages — typically discover → fetch → parse → enrich → score/transform → export — with a shared database (SQLite for small jobs, MySQL for ongoing operational systems) holding the state between stages. Each stage is idempotent and resumable. Crashes don't lose work. Parser bugs don't burn proxies. Re-running yesterday's pipeline against a fresh cache costs zero network requests.
The fetcher is the hardest part. The default approach is plain requests with politeness delays, retries on transient errors only, and an on-disk cache. When that gets blocked, the next layer is httpx with browser-like headers, then Playwright with a real headless browser, then Playwright + Kameleo with a real fingerprint and a proxy pool. The rule is "use the lightest tool that gets the data" — every step up in sophistication costs time, money, and complexity.
Storage is opinionated: SQLite in WAL mode for parallel-worker pipelines, MySQL when the data needs to be queried by a dashboard or another system, JSONL append streams when downstream consumers want to tail -f the output mid-run. Output formats are standardized: CSV for analysts, XLSX for spreadsheet folks, JSONL for engineers.
Recovery is built in from day one. Every fetch that gets blocked is logged with the response signal that triggered it (403, 429, 503, CAPTCHA marker), retried exactly once on a fresh proxy/profile, and marked error if the retry fails. No retry storms, no silent failures. The pipeline keeps moving.
Six-stage Python pipeline turning public customs filings into a confidence-scored CSV of likely tariff-exposed importers — Cloudflare-resilient, resumable, 6 parallel workers on one SQLite DB.
Read the case study →Production-grade purchasing engine combining real-time monitoring, multi-carrier tracking, email intelligence, AI-assisted chat, and warehouse capture — coordinated across dozens of servers.
Read the case study →Python scraping projects often combine with: Web Scraping & Data Collection, API Integration & Backend Development, and Custom Automation Development.
From small targeted scrapers to multi-stage pipelines processing thousands of pages a night, I build Python scraping systems designed to survive real-world conditions.