This is an open-source proof-of-concept built by ThinkGenius. The data source (ImportYeti) is publicly accessible and the pipeline is designed around responsible-use constraints. Output is intended to support tariff-refund triage, not to substitute for legal or trade analysis.

  • Python
  • Playwright
  • Kameleo
  • SQLite (WAL)
  • BeautifulSoup
  • Rotating Proxies
  • Multi-Process Workers
  • CSV / XLSX / JSONL
  • Responsible Scraping

Project Snapshot

  • Project type: Public-data scraping, normalization, and scoring pipeline
  • Domain: Trade data & tariff-exposure analysis (U.S. imports from China and other tariff-affected regions)
  • Source: Public ImportYeti.com pages (republished U.S. customs Bill-of-Lading data)
  • Architecture: Six-stage pipeline backed by a single SQLite database in WAL mode
  • Concurrency: Up to 6 parallel Kameleo + Playwright worker processes, each with its own proxy slice
  • Output: Deduplicated CSV / XLSX / streaming JSONL of confidence-scored U.S. importers
  • Validated run: 844 high-confidence records exported, average confidence 4.88 / 5, 89 blocks absorbed, 5 of 580 proxies consumed

The Problem

U.S. companies that import goods from China and other tariff-affected regions may be eligible for duty refunds, exclusions, or other tariff-relief programs — but identifying which companies are exposed is a manual research slog. The underlying evidence (Bills of Lading filed with U.S. Customs) is public, and ImportYeti republishes it in a navigable form, but the site is fronted by Cloudflare bot protection and structured for human browsing, not bulk analysis.

The goal: produce a clean, deduplicated, analyst-ready list of likely tariff-exposed importers, scored by how confident the system is that each company actually imports physical goods from a high-tariff country — without bypassing CAPTCHAs, ignoring robots.txt, or burning through proxy infrastructure.

The Solution

ThinkGenius built a six-stage Python pipeline — discover → fetch → parse → enrich → score → export — where every stage writes its results to a shared SQLite database before the next stage reads them. Each stage is independently runnable, idempotent, and resumable. The fetcher routes every request through a Kameleo anti-detect browser profile attached to Playwright, with a rotating pool of mobile or static dedicated proxies, and rotates only on observed blocks rather than proactively. The scorer assigns each candidate a 0–5 confidence score based on direct evidence of China shipments, supplier diversity, product-token relevance, and address verification. Anything scoring ≥ 4 is exported.

The Six-Stage Pipeline

Stage 1

Discover

Walks ~40 hand-curated product tag pages (industrial goods, hardware, packaging, automotive, etc.), then snowballs through supplier and company pages — every supplier reveals new buyers, every buyer reveals new suppliers. Bounded by configurable keyword, pagination, and page-count limits. Duplicates are silently rejected at the URL key.

Stage 2

Fetch

The hardest stage. Every page is loaded through a Kameleo browser profile (real fingerprint: canvas, WebGL, fonts, timezone, audio context) attached to Playwright, routed through a proxy. The disk cache (gzipped HTML) means parser changes never re-burn proxies and DB resets cost zero network requests.

Stage 3

Parse

Pure CPU work — no network, no risk. Two parsers extract company-page fields (name, address, shipment count, supplier list, country breakdown, recent BOL descriptions) and supplier-page fields (name, country, customer list). Customer/supplier graphs are stored as JSON blobs for join-free reconstruction downstream.

Stage 4

Enrich

Resolves URL references into actual relationships. Each company candidate is cross-linked with its parsed suppliers, the per-country breakdown is recomputed from crawled data (overriding the site's headline numbers when ours are more complete), and customer-side cross-references seed stub records for companies the next pass should fetch.

Stage 5

Score

Rule-based 0–5 confidence model. +2 for at least one China supplier, +1 each for >5 distinct foreign suppliers, physical-goods keywords in recent BOL descriptions, parseable U.S. address, and ≥ 25% supplier concentration in tariff-affected regions. Threshold defaults to 4. Average exported confidence on real runs: 4.88.

Stage 6

Export

Three deliverables: a deduplicated CSV (the headline output), an XLSX twin for spreadsheet analysts, and a streaming JSONL sidecar that's appended live as each record is scored — so a multi-hour run can be tail -f'd in real time without waiting for the final pass.

Surviving Cloudflare Without Bypassing It

ImportYeti is fronted by Cloudflare's bot protection. A plain Python HTTP client receives a 403 "Just a moment…" challenge every time. Rather than reach for CAPTCHA solvers or fingerprint-spoofing tooling, the fetcher detects the challenge, aborts cleanly, and logs the block — and the supported path forward is browser-based: Kameleo profiles attached to Playwright, presenting a real fingerprint Cloudflare treats as a normal user, routed through proxies that aren't on the project author's residential IP.

Two proxy pools are supported: rotating mobile carrier IPs (each with a rotate_url for forced refresh) and ~580 static dedicated Webshare datacenter IPs (no rotation URL, so they're put on a 24-hour cooldown after exhaustion). The scaling recipe currently uses the static pool — the larger pool absorbs blocks more gracefully, and Cloudflare doesn't appear to penalize Webshare IPs more than mobile when the browser fingerprint is convincing.

The Block-Only Rotation Rule

Behavior

Never Rotate Proactively

No pageview counters, no time-based churn. Each proxy is used until ImportYeti's ~25-pageview wall (or any other 403/429/503/CAPTCHA marker) actually fires. Only then does the fetcher rotate. This extracts every available view per IP instead of burning through the pool ~20% faster than necessary.

Recovery

Rotate, Recycle, Retry Once

On a real block: mark the proxy exhausted (24-hour cooldown for static, rotate_url call for mobile), recycle the Kameleo profile to kill cookies and the fingerprint session, pick the next available proxy from the LRU pool, and retry the URL exactly once. Failed twice → marked error, pipeline moves on. No retry storms.

Result

5 Proxies of 580 Consumed

A typical 6-worker run consumes only 5–10 proxies out of 580 — the rest stay cool for the next day. The validated 844-record run absorbed 89 blocks and never exhausted the pool. This is the difference between a fragile scraper and one that runs nightly without intervention.

Parallel Workers on a Single SQLite Database

The fetch stage is the slowest by far — browser startup, network latency, Cloudflare patience time. To keep wall-clock time reasonable, run-parallel shards the pending URL queue across N worker processes, each with its own Kameleo profile and its own non-overlapping slice of the proxy pool (assigned via the IMPORTYETI_PROXY_IDS env var, partitioned round-robin so workers never compete for the same outbound IP).

All workers write to the same SQLite database in WAL mode with a 10-second busy timeout. Six workers is the empirical sweet spot — eight pushed SQLite into occasional database is locked retries. Worker fetch loops only insert page records; relationship building and scoring happen in single-process enrich/score stages where contention is zero.

Resumability Superpowers

Resume

Pick Up Exactly Where It Left Off

--resume re-runs the pipeline without redoing any work: pending URLs get fetched, fetched-but-unparsed pages get parsed, in-progress rows from a crashed worker get reset to pending and retried. Nothing is duplicated.

Re-Parse

Stage-Level Reruns, Network-Free

python -m src.main parse reprocesses every cached HTML page after a parser tweak — zero network calls, zero proxy burn. Same for score and export: each stage is independently runnable against the existing DB.

Cache

The Cache Outlives the Database

Every successfully fetched page is gzipped to data/cache/. If the SQLite DB is wiped, a re-discover + load-cache pass rebuilds the database from disk without burning a single proxy. Debugging a parser edge case takes seconds, not minutes.

Live Output

Streaming JSONL Sidecar

The CSV is rewritten from scratch on each export, so during overnight runs there's no partial visibility. The JSONL sidecar fixes that: every record is appended the moment it's scored. tail -f the file to watch results land in real time.

Live Snapshots

The live_export.sh Loop

A small companion script runs score && export on a configurable interval (default 5 minutes) against the live SQLite database while workers are still fetching. The CSV/XLSX stay reasonably fresh throughout a multi-hour run, then a final snapshot is taken when workers exit.

Fixtures

Offline Replay From Saved HTML

load-fixtures registers manually-saved HTML files (one per company / supplier / tag page) into the cache + DB without any network requests. The full parse → score → export chain runs against them — useful for testing, demos, and air-gapped reproduction.

Confidence Scoring

The scorer is deliberately conservative — better to under-export than ship false positives to a tariff analyst. The rubric:

  • +2 — at least one supplier in China (or another high-tariff country)
  • +1 — more than 5 distinct foreign suppliers
  • +1 — at least one recent BOL description containing physical-goods keywords
  • +1 — address present and parseable to a real U.S. state
  • +1 — country breakdown shows ≥ 25% of suppliers in tariff-affected regions

Score ≥ 4 → written to final_companies and exported. Score < 4 → kept in the candidate table for inspection but never reaches the deliverable. Threshold is configurable via min_score_to_export.

What the Pipeline Deliberately Doesn't Do

Every "don't" is a conscious tradeoff against scope creep, fragility, or ethical risk:

  • No CAPTCHA solvers. If Kameleo + a real fingerprint can't get through, the run aborts cleanly.
  • No proactive proxy rotation. Burns IPs for no reason. Rotate only on observed blocks.
  • No headless mode. Cloudflare's bot signal weights headless flags heavily. Kameleo runs headed; windows just sit minimized.
  • No retry storms. Exactly one retry per blocked page on a fresh proxy. After that, the URL is marked error and the pipeline moves on.
  • No disallowed paths. /api/*, /cdn-cgi/, /mx/, /us-export/, /partnerships/ are hard-blocked. /search?q is disallowed by robots.txt and is opt-in only.
  • No PII enrichment. Publicly listed business addresses and shipment metadata only. No people, no executives, no email harvesting.
  • No multi-table writes from worker fetch loops. Workers only insert page records; everything relational happens in single-process stages where SQLite contention is zero.

Validated Outcomes

Validation Run

844 Records, 4.88 Confidence

The proof-of-concept run shipped 844 high-confidence U.S. importers in roughly an hour with 4 workers and default limits. Average confidence score across the export: 4.88 / 5. 89 Cloudflare blocks were absorbed and recovered cleanly during the run.

Proxy Efficiency

5 of 580 IPs Consumed

The block-only rotation rule meant only 5 proxies hit cooldown across the entire validation run. The remaining 575 stayed cold and ready for the next job. Sustainable cost-per-record at scale.

Scale Target

~10k Companies / Overnight

The current overnight runbook: 6 workers against the static proxy pool, all ~40 keywords, deep pagination, 25k company-page cap. Watcher loop reports progress every 2 minutes; the live JSONL grows record-by-record as the pipeline scores.

Output Schema

Each exported row carries the fields a tariff analyst actually needs to triage the candidate:

company_name, normalized_company_name, website, industry_or_product_type, product_description, source_country, supplier_name, import_indicator, confidence_score, confidence_reasoning, importyeti_url, source_keyword, date_scraped, notes, company_city, company_state, company_address, phone, email, contact_page_url

Three deliverables are produced from the same underlying data: importyeti_tariff_sample.csv (rebuilt from the DB on every export), importyeti_tariff_sample.xlsx (same content, Excel-formatted), and importyeti_tariff_sample.jsonl (append-only stream, one record per scored company, written live).

Why This Architecture Holds Up

Engineering

Stages As Independent Stations

Each stage reads from one table and writes to the next. A slow or broken stage queues work in its predecessor's buffer instead of crashing the line. The whole pipeline behaves like an assembly line that can be paused, restarted, or extended without coupling.

Engineering

One Source of Truth

SQLite holds discovered URLs, raw page records, parsed candidates, enriched relationships, final scored companies, and per-run errors. Every operational question — "why didn't this URL get fetched?", "how confident is this record?" — has a SQL answer.

Engineering

Defensive Parsing

ImportYeti renders most data server-side, but the DOM isn't officially stable. Parsers are field-by-field tolerant: a missing block degrades the row's confidence score instead of throwing. Schema drift becomes a re-parse from cache, not an outage.

Responsible-Use Posture

The pipeline ships with a descriptive USER_AGENT that names the project and provides contact info, a default 5–7 second delay between requests (with jitter), persistent caching so re-runs don't re-burn the source, and an explicit allow/deny list against ImportYeti's robots.txt. It walks tag pages and individual company/supplier profiles only — no account creation, no credential reuse, no CAPTCHA solving, no calls to disallowed API paths. For sustained high-volume work, the recommended path is ImportYeti's official paid API or a licensed trade-data provider (Panjiva, Descartes Datamyne, S&P Global). The proof-of-concept exists to prove the modeling approach, not to scrape forever.

The Stack

  • Python 3
  • Playwright
  • Kameleo (anti-detect browser)
  • BeautifulSoup
  • SQLite (WAL mode)
  • Webshare static datacenter proxies
  • Mobile rotating proxies (iproxy)
  • Multi-process worker pool
  • CSV / XLSX / JSONL exporters
  • Click-style CLI subcommands
  • Structured logging
  • Disk-based gzipped HTML cache

Why It Matters

Most "scraping" projects fail not because the parser was wrong, but because the system around the parser couldn't handle real conditions: blocks, rate limits, parser drift, mid-run crashes, partial output, and proxy waste. This pipeline is built around those failure modes from the start — every stage is independent, every fetch is cached, every block is recoverable, every run is resumable, and the deliverable updates live so the operator can watch quality emerge instead of waiting hours to find out something went wrong. That's the difference between a one-shot scraper and a system you can put on a schedule.

Need a Resilient Data Pipeline or Scraping System Built?

ThinkGenius designs and builds production data pipelines — discovery, fetching, parsing, enrichment, scoring, and export — that survive real-world conditions and produce clean, analyst-ready output.