Detecting and Recovering Failed Scraping Sessions

The Failure Mode Nobody Talks About

The failure that ruins scraping projects isn't a crash. It's the scraper that runs to "completion," reports success, writes 50,000 rows to the output file — and 80% of those rows contain garbage because the site started serving challenge pages halfway through the run and the parser was happily extracting empty strings out of HTML that didn't contain the data anymore.

Crashes are easy to handle. The exception is loud, the error is in the log, the operator sees it. Silent corruption is the real enemy. By the time you notice — usually when the analyst on the receiving end says "this data looks wrong" — the run is finished, the cache is full of challenge pages, and you don't know which subset of rows is real and which is noise.

This article is about building the detection and recovery layer that prevents that scenario. The components: precise failure classification, surgical retries, clean abort conditions, and an audit trail an operator can actually use.

Classify the Failure Precisely

Every fetch outcome falls into one of five buckets. The recovery logic is different for each, and conflating them is what produces retry storms or silent corruption.

1. Success. HTTP 200, expected content shape, no challenge markers. Move on. Cache the response.

2. Soft block. HTTP 200 but the page is a challenge or a "no results" decoy. Treat as a block, not a success. Don't cache. Retry on a fresh proxy + profile, then if still bad, mark the URL error.

3. Hard block. HTTP 403 / 429 / 503 with a known protection signature. Same recovery as a soft block — fresh identity, retry once, then error. Critically: don't cache the block response.

4. Transient error. Network timeout, DNS failure, TLS handshake failure, HTTP 5xx with no protection signature. Probably not your fault, probably not a block. Retry on the same proxy after a short backoff (10s, then 30s, then give up). Don't burn a proxy on a network blip.

5. Permanent error. HTTP 404, HTTP 410, malformed URL, navigation failure to an unrelated domain. The URL is broken; rotation won't fix it. Mark it error immediately, don't retry, don't burn proxies.

The classifier is a small function that takes the response (or exception) and returns one of these five enums. Every fetch goes through it. The worker dispatches on the result. There is no "default to retry" path — every outcome has a specific recovery rule, and unknown shapes are logged as a separate diagnostic event so you can refine the classifier.

The Single-Retry Rule

Every retry doubles your blast radius — twice the bandwidth, twice the proxy consumption, twice the time, twice the risk of confusing the protection system. The rule that holds across every scraping system I've built is: at most one retry on a fresh identity per failure.

Concretely:

A hard or soft block triggers exactly one rotation + retry. Second block on a fresh proxy → mark URL error. No third attempt.
A transient error triggers up to two retries on the same proxy with backoff (10s, 30s). Third failure → mark URL error. No proxy rotation for transient errors; that's how you mistakenly burn good proxies on a network blip.
A permanent error triggers zero retries.

The reason this rule matters: without a hard cap, the recovery logic itself becomes a retry storm. A flaky URL or an aggressive protection system that's blocking every proxy in your pool will, with unbounded retries, churn through your entire pool in minutes. With the single-retry rule, the worst case is "this URL fails twice, gets marked error, the worker moves on" — bounded, predictable, and visible in the audit trail.

Run-Wide Abort Conditions

One URL failing is normal. Half the URLs failing is a system-level problem — the site changed its bot protection, the entire proxy pool got flagged, your authentication expired. The worker shouldn't keep grinding when the surrounding system is broken; it should abort cleanly and surface the problem.

The conditions I check in every worker, evaluated on a sliding window:

Block rate > 30% over the last 50 fetches → pause for 10 minutes, then resume. If still > 30% after the pause, abort.
Same protection signal on 5+ consecutive fetches across 3+ different proxies → abort. The protection layer has identified your fingerprint, not your IP.
Proxy pool depletion — fewer than 10% of pool proxies still healthy → abort. There's no point continuing if the next block has nothing to rotate to.
Authentication failure on a logged-in workflow → abort immediately and log loudly. The session expired; continuing without re-auth will produce garbage.

"Abort" means the worker exits cleanly: marks any in-progress rows as pending (so the next run picks them up), writes a final summary row to the run log, and terminates with a non-zero exit code so the launcher knows. The other workers in the pool keep running — abort is per-worker, not per-pool, unless the launcher escalates.

The Audit Trail

The recovery logic above is only useful if the operator can reconstruct what happened after the fact. That requires a structured audit trail, not just text logs.

Two database tables carry the history.

fetch_log — one row per fetch attempt:

id, url_id, worker_id, proxy_id, profile_id,
attempted_at, http_status, classification,
duration_ms, error_message

blocks — one row per detected block:

id, fetch_log_id, signal_type, signal_value,
proxy_id, profile_id, body_snippet

With these two tables, every operator question is a SQL query away:

What happened to URL 4729? → join urls to fetch_log by url_id.
How much did each proxy block? → group blocks by proxy_id.
What protection signals are showing up most this week? → group blocks by signal_type with a date filter.
Did the failure rate spike at a specific time? → bucket fetch_log by hour with a classification != 'success' filter.

Disk usage is small — a few hundred bytes per fetch. For a system doing a million fetches a month, the log table is well under a gigabyte. Keep 90 days online, archive the rest. The cost of having this data is trivial; the cost of not having it shows up the first time a downstream consumer asks "what happened on April 14th?"

Disk Artifacts on Failure

For any URL marked error, capture three artifacts to disk before moving on:

A full-page screenshot at the moment of failure.
The rendered HTML.
The last few network responses (URL, status, headers, first 4KB of body).

Filename pattern: errors/<url_id>_<timestamp>.{png,html,har}. The audit trail in the database links to these by URL ID. When the operator wants to know "what was the page actually showing when this URL failed?", they have the answer in seconds.

The cost is a few hundred KB per error. Cleanup runs weekly: anything older than 30 days, delete. The artifacts are debugging aids, not long-term storage.

The Operator's Morning

The whole point of the architecture above is to make the operator's morning predictable. Every overnight run produces:

A summary email or Slack message: rows fetched, success rate, top failure classes, total proxy consumption.
A clean output file with no challenge-page garbage in it.
A queryable audit trail for any URL the analyst questions.
A specific list of URLs that failed, with screenshots one click away.

What it doesn't produce: a 200 MB log file no one can read, a phone call at 3am, or three days of retroactive cleanup because half the rows are wrong.

Wrap-Up

Detection and recovery are the unglamorous middle of every reliable scraping system. The interesting work is the parsing; the work that determines whether the system is trustworthy is the failure-handling layer. Get the classification right, hold the line on single-retry, build the audit trail, capture the artifacts. The result is a scraper you can put on a schedule and stop thinking about.

For the surrounding pieces — the worker pool, the queue, the proxy rotation logic this hooks into — see the Python Web Scraping and Browser Automation hubs.