High-confidence U.S. importers written to analyst-ready deliverables.
Tariff-Exposure Data Pipeline — A Resumable, Cloudflare-Resilient Scraper
A six-stage Python pipeline — discover, fetch, parse, enrich, score, export — that turns public U.S. customs Bill-of-Lading data into a confidence-scored CSV of tariff-exposed importers, while respecting robots.txt, surviving Cloudflare bot protection, and resuming cleanly across multi-hour parallel runs.
This is an open-source proof-of-concept built by ThinkGenius. The data source, ImportYeti, is publicly accessible and the pipeline is designed around responsible-use constraints. Output is intended to support tariff-refund triage, not to substitute for legal or trade analysis.
- Python
- Playwright
- Kameleo
- SQLite (WAL)
- BeautifulSoup
- Rotating Proxies
- Multi-Process Workers
- CSV / XLSX / JSONL
- Responsible Scraping
Mean score across exported companies on a 0–5 rubric.
Cloudflare blocks detected, logged, recovered, and moved past.
Block-only rotation kept almost the entire pool cold.
Parallel browser workers against a single SQLite WAL database.
The Brief
Find likely tariff-exposed U.S. importers from public Bill-of-Lading data, then package the result as a clean, deduplicated file an analyst could triage immediately. The system needed to survive real browser defenses, avoid wasteful proxy churn, recover from long-run failures, and make every scoring decision traceable.
- Project Type
- Public-data scraping, normalization, and confidence scoring pipeline.
- Domain
- Trade data and tariff-exposure analysis for U.S. imports from tariff-affected regions.
- Source
- Public ImportYeti.com pages republishing U.S. customs Bill-of-Lading data.
- Architecture
- Six-stage pipeline backed by one SQLite database in WAL mode.
- Concurrency
- Kameleo + Playwright worker processes, each assigned a non-overlapping proxy slice.
- Deliverables
- CSV, XLSX, and streaming JSONL exports of confidence-scored company records.
The Problem
U.S. companies that import goods from China and other tariff-affected regions may be eligible for duty refunds, exclusions, or other tariff-relief programs — but identifying which companies are exposed is a manual research slog. The underlying evidence (Bills of Lading filed with U.S. Customs) is public, and ImportYeti republishes it in a navigable form, but the site is fronted by Cloudflare bot protection and structured for human browsing, not bulk analysis.
The goal: produce a clean, deduplicated, analyst-ready list of likely tariff-exposed importers, scored by how confident the system is that each company actually imports physical goods from a high-tariff country — without bypassing CAPTCHAs, ignoring robots.txt, or burning through proxy infrastructure.
The Solution
ThinkGenius built a six-stage Python pipeline — discover → fetch → parse → enrich → score → export — where every stage writes its results to a shared SQLite database before the next stage reads them. Each stage is independently runnable, idempotent, and resumable. The fetcher routes every request through a Kameleo anti-detect browser profile attached to Playwright, with a rotating pool of mobile or static dedicated proxies, and rotates only on observed blocks rather than proactively. The scorer assigns each candidate a 0–5 confidence score based on direct evidence of China shipments, supplier diversity, product-token relevance, and address verification. Anything scoring ≥ 4 is exported.
The Six-Stage Pipeline
The architecture separates network risk, parsing, relationship-building, scoring, and export into independent stations. That makes the system debuggable: a parser tweak never burns proxies, and a crashed worker does not poison the export.
Discover
Walks ~40 hand-curated product tag pages (industrial goods, hardware, packaging, automotive, etc.), then snowballs through supplier and company pages — every supplier reveals new buyers, every buyer reveals new suppliers. Bounded by configurable keyword, pagination, and page-count limits. Duplicates are silently rejected at the URL key.
Fetch
The hardest stage. Every page is loaded through a Kameleo browser profile (real fingerprint: canvas, WebGL, fonts, timezone, audio context) attached to Playwright, routed through a proxy. The disk cache (gzipped HTML) means parser changes never re-burn proxies and DB resets cost zero network requests.
Parse
Pure CPU work — no network, no risk. Two parsers extract company-page fields (name, address, shipment count, supplier list, country breakdown, recent BOL descriptions) and supplier-page fields (name, country, customer list). Customer/supplier graphs are stored as JSON blobs for join-free reconstruction downstream.
Enrich
Resolves URL references into actual relationships. Each company candidate is cross-linked with its parsed suppliers, the per-country breakdown is recomputed from crawled data (overriding the site's headline numbers when ours are more complete), and customer-side cross-references seed stub records for companies the next pass should fetch.
Score
Rule-based 0–5 confidence model. +2 for at least one China supplier, +1 each for >5 distinct foreign suppliers, physical-goods keywords in recent BOL descriptions, parseable U.S. address, and ≥ 25% supplier concentration in tariff-affected regions. Threshold defaults to 4. Average exported confidence on real runs: 4.88.
Export
Three deliverables: a deduplicated CSV (the headline output), an XLSX twin for spreadsheet analysts, and a streaming JSONL sidecar that's appended live as each record is scored — so a multi-hour run can be tail -f'd in real time without waiting for the final pass.
Surviving Cloudflare Without Bypassing It
ImportYeti is fronted by Cloudflare's bot protection. A plain Python HTTP client receives a 403 "Just a moment…" challenge every time. Rather than reach for CAPTCHA solvers or fingerprint-spoofing tooling, the fetcher detects the challenge, aborts cleanly, and logs the block — and the supported path forward is browser-based: Kameleo profiles attached to Playwright, presenting a real fingerprint Cloudflare treats as a normal user, routed through proxies that aren't on the project author's residential IP.
Two proxy pools are supported: rotating mobile carrier IPs (each with a rotate_url for forced refresh) and ~580 static dedicated Webshare datacenter IPs (no rotation URL, so they're put on a 24-hour cooldown after exhaustion). The scaling recipe currently uses the static pool — the larger pool absorbs blocks more gracefully, and Cloudflare doesn't appear to penalize Webshare IPs more than mobile when the browser fingerprint is convincing.
The Block-Only Rotation Rule
Proxy rotation is treated as a recovery action, not a superstition. The fetcher uses each IP until the site itself says the session is blocked, then recycles the browser profile and retries once on a fresh proxy.
Never Rotate Proactively
No pageview counters, no time-based churn. Each proxy is used until ImportYeti's ~25-pageview wall (or any other 403/429/503/CAPTCHA marker) actually fires. Only then does the fetcher rotate. This extracts every available view per IP instead of burning through the pool ~20% faster than necessary.
Rotate, Recycle, Retry Once
On a real block: mark the proxy exhausted (24-hour cooldown for static, rotate_url call for mobile), recycle the Kameleo profile to kill cookies and the fingerprint session, pick the next available proxy from the LRU pool, and retry the URL exactly once. Failed twice → marked error, pipeline moves on. No retry storms.
5 Proxies of 580 Consumed
A typical 6-worker run consumes only 5–10 proxies out of 580 — the rest stay cool for the next day. The validated 844-record run absorbed 89 blocks and never exhausted the pool. This is the difference between a fragile scraper and one that runs nightly without intervention.
Parallel Workers on a Single SQLite Database
The fetch stage is the slowest by far — browser startup, network latency, Cloudflare patience time. To keep wall-clock time reasonable, run-parallel shards the pending URL queue across N worker processes, each with its own Kameleo profile and its own non-overlapping slice of the proxy pool (assigned via the IMPORTYETI_PROXY_IDS env var, partitioned round-robin so workers never compete for the same outbound IP).
All workers write to the same SQLite database in WAL mode with a 10-second busy timeout. Six workers is the empirical sweet spot — eight pushed SQLite into occasional database is locked retries. Worker fetch loops only insert page records; relationship building and scoring happen in single-process enrich/score stages where contention is zero.
Resumability Superpowers
The pipeline is designed for overnight reality: crashes, parser changes, partial exports, and stale snapshots. Every stage can be restarted from the last durable database state.
Pick Up Exactly Where It Left Off
--resume re-runs the pipeline without redoing any work: pending URLs get fetched, fetched-but-unparsed pages get parsed, in-progress rows from a crashed worker get reset to pending and retried. Nothing is duplicated.
Stage-Level Reruns, Network-Free
python -m src.main parse reprocesses every cached HTML page after a parser tweak — zero network calls, zero proxy burn. Same for score and export: each stage is independently runnable against the existing DB.
The Cache Outlives the Database
Every successfully fetched page is gzipped to data/cache/. If the SQLite DB is wiped, a re-discover + load-cache pass rebuilds the database from disk without burning a single proxy. Debugging a parser edge case takes seconds, not minutes.
Streaming JSONL Sidecar
The CSV is rewritten from scratch on each export, so during overnight runs there's no partial visibility. The JSONL sidecar fixes that: every record is appended the moment it's scored. tail -f the file to watch results land in real time.
The live_export.sh Loop
A small companion script runs score && export on a configurable interval (default 5 minutes) against the live SQLite database while workers are still fetching. The CSV/XLSX stay reasonably fresh throughout a multi-hour run, then a final snapshot is taken when workers exit.
Offline Replay From Saved HTML
load-fixtures registers manually-saved HTML files (one per company / supplier / tag page) into the cache + DB without any network requests. The full parse → score → export chain runs against them — useful for testing, demos, and air-gapped reproduction.
Confidence Scoring
The scorer is deliberately conservative — better to under-export than ship false positives to a tariff analyst. The rubric:
- +2 — at least one supplier in China (or another high-tariff country)
- +1 — more than 5 distinct foreign suppliers
- +1 — at least one recent BOL description containing physical-goods keywords
- +1 — address present and parseable to a real U.S. state
- +1 — country breakdown shows ≥ 25% of suppliers in tariff-affected regions
Score ≥ 4 → written to final_companies and exported. Score < 4 → kept in the candidate table for inspection but never reaches the deliverable. Threshold is configurable via min_score_to_export.
What the Pipeline Deliberately Doesn't Do
Every "don't" is a conscious tradeoff against scope creep, fragility, or ethical risk:
- No CAPTCHA solvers. If Kameleo + a real fingerprint can't get through, the run aborts cleanly.
- No proactive proxy rotation. Burns IPs for no reason. Rotate only on observed blocks.
- No headless mode. Cloudflare's bot signal weights headless flags heavily. Kameleo runs headed; windows just sit minimized.
- No retry storms. Exactly one retry per blocked page on a fresh proxy. After that, the URL is marked
errorand the pipeline moves on. - No disallowed paths.
/api/*,/cdn-cgi/,/mx/,/us-export/,/partnerships/are hard-blocked./search?qis disallowed by robots.txt and is opt-in only. - No PII enrichment. Publicly listed business addresses and shipment metadata only. No people, no executives, no email harvesting.
- No multi-table writes from worker fetch loops. Workers only insert page records; everything relational happens in single-process stages where SQLite contention is zero.
Validated Outcomes
The proof-of-concept produced a practical analyst deliverable while keeping block recovery, proxy usage, and output quality observable throughout the run.
844 Records, 4.88 Confidence
The proof-of-concept run shipped 844 high-confidence U.S. importers in roughly an hour with 4 workers and default limits. Average confidence score across the export: 4.88 / 5. 89 Cloudflare blocks were absorbed and recovered cleanly during the run.
5 of 580 IPs Consumed
The block-only rotation rule meant only 5 proxies hit cooldown across the entire validation run. The remaining 575 stayed cold and ready for the next job. Sustainable cost-per-record at scale.
~10k Companies / Overnight
The current overnight runbook: 6 workers against the static proxy pool, all ~40 keywords, deep pagination, 25k company-page cap. Watcher loop reports progress every 2 minutes; the live JSONL grows record-by-record as the pipeline scores.
Output Schema
Each exported row carries the fields a tariff analyst actually needs to triage the candidate:
company_name, normalized_company_name, website, industry_or_product_type, product_description, source_country, supplier_name, import_indicator, confidence_score, confidence_reasoning, importyeti_url, source_keyword, date_scraped, notes, company_city, company_state, company_address, phone, email, contact_page_url
Three deliverables are produced from the same underlying data: importyeti_tariff_sample.csv (rebuilt from the DB on every export), importyeti_tariff_sample.xlsx (same content, Excel-formatted), and importyeti_tariff_sample.jsonl (append-only stream, one record per scored company, written live).
Why This Architecture Holds Up
Stages As Independent Stations
Each stage reads from one table and writes to the next. A slow or broken stage queues work in its predecessor's buffer instead of crashing the line. The whole pipeline behaves like an assembly line that can be paused, restarted, or extended without coupling.
One Source of Truth
SQLite holds discovered URLs, raw page records, parsed candidates, enriched relationships, final scored companies, and per-run errors. Every operational question — "why didn't this URL get fetched?", "how confident is this record?" — has a SQL answer.
Defensive Parsing
ImportYeti renders most data server-side, but the DOM isn't officially stable. Parsers are field-by-field tolerant: a missing block degrades the row's confidence score instead of throwing. Schema drift becomes a re-parse from cache, not an outage.
Responsible-Use Posture
The pipeline ships with a descriptive USER_AGENT that names the project and provides contact info, a default 5–7 second delay between requests (with jitter), persistent caching so re-runs don't re-burn the source, and an explicit allow/deny list against ImportYeti's robots.txt. It walks tag pages and individual company/supplier profiles only — no account creation, no credential reuse, no CAPTCHA solving, no calls to disallowed API paths. For sustained high-volume work, the recommended path is ImportYeti's official paid API or a licensed trade-data provider (Panjiva, Descartes Datamyne, S&P Global). The proof-of-concept exists to prove the modeling approach, not to scrape forever.
The Stack
- Python 3
- Playwright
- Kameleo (anti-detect browser)
- BeautifulSoup
- SQLite (WAL mode)
- Webshare static datacenter proxies
- Mobile rotating proxies (iproxy)
- Multi-process worker pool
- CSV / XLSX / JSONL exporters
- Click-style CLI subcommands
- Structured logging
- Disk-based gzipped HTML cache
Why It Matters
Most "scraping" projects fail not because the parser was wrong, but because the system around the parser couldn't handle real conditions: blocks, rate limits, parser drift, mid-run crashes, partial output, and proxy waste. This pipeline is built around those failure modes from the start — every stage is independent, every fetch is cached, every block is recoverable, every run is resumable, and the deliverable updates live so the operator can watch quality emerge instead of waiting hours to find out something went wrong. That's the difference between a one-shot scraper and a system you can put on a schedule.
Need a Resilient Data Pipeline or Scraping System Built?
ThinkGenius designs and builds production data pipelines — discovery, fetching, parsing, enrichment, scoring, and export — that survive real-world conditions and produce clean, analyst-ready output.