Detecting Silent Pipeline Failures: A Practical Guide
The pipelines that bite hardest don't throw exceptions. They run green, write zero rows, and you find out three days later in a Slack message from finance.
The failure modes that scheduler alerts miss
Airflow, Dagster, Prefect, dbt Cloud, they all alert you when a task crashes. That's the easy case. The hard cases:
- Zero-row writes. The query ran, returned 0 rows, and overwrote the previous day's data. Status: green.
- Stuck schedulers. The job didn't fail. It didn't run. The trigger was disabled in last week's deploy.
- Partial loads. Half the source files copied. The downstream join produces a smaller fact table; nobody notices for a week.
- Schema drift. Upstream renamed
customer_idtocust_id. Your transformation now silently fills NULL. - Stale source. The pipeline ran. The source table hadn't been updated in 72 hours. Nothing complained.
Every one of these passes a scheduler health check. None of them throws an exception. All of them eventually show up as a wrong number on a dashboard.
The five checks that catch 80% of silent failures
1. Row-count freshness
For every table you care about, ask one question every hour: how many rows landed in the last N hours? If the answer is zero (or below a floor), something silent happened.
-- Rows landed in the last 24 hours
SELECT COUNT(*) AS rows_today
FROM orders
WHERE created_at >= CURRENT_TIMESTAMP() - INTERVAL 24 HOUR;
-- Alert if rows_today = 0 OR rows_today < historical_p52. Source-table freshness
Run a query against the SOURCE table's latest timestamp, not your transformed one. A pipeline can be on time and the warehouse can still be stale.
SELECT MAX(updated_at) AS source_latest
FROM source_db.raw_orders;
-- Alert if MAX(updated_at) is older than expected SLA3. Schema hash
Compute a hash of (column_name, data_type) for every critical table and persist it. When it changes, alert. Adding columns is usually safe. Renames, drops, and type changes break downstream.
4. NULL rate drift
For columns you care about, baseline the NULL rate over the last 30 days. When today's NULL rate is more than 3 standard deviations from the baseline, alert. This is the cheap version of anomaly detection and it catches the silent rename / silent join-key change.
SELECT
COUNT(*) FILTER (WHERE customer_id IS NULL) * 1.0 / COUNT(*) AS null_rate
FROM fct_orders
WHERE created_at >= CURRENT_DATE - INTERVAL 1 DAY;5. Business-metric drift
The dashboard's most important number, revenue, signups, active accounts, should have its own threshold check. Day-over-day change above ±30% (or whatever fits your business) is a flag. This catches the failures that bypass every other check because they corrupt the math, not the data shape.
Where to run these
Three places, in priority order:
- As post-job hooks in your orchestrator. dbt has
on-run-end. Airflow has callbacks. Run the freshness + row-count checks right after the job that should have populated the table. - On a separate schedule from the pipeline. If the pipeline failed to schedule entirely, hooks won't run. A scheduled cron, independent of your data infra, catches that.
- In a dedicated observability tool. Once you're running more than ~10 of these, the maintenance burden of in-orchestrator checks tips. A tool like Sparvi handles all five categories with auto-generated rules, plus anomaly detection on top.
The alert that catches everything: business-metric drift
If you only build one check, build this one. Pick the top 3 metrics your stakeholders look at daily. Compute them on a fixed schedule. Alert when they move more than you'd expect. Every silent failure that matters eventually shows up here, even when every other signal is green.
Don't alert on everything
The temptation is to wrap every table in every check. Alert fatigue kills observability faster than any silent failure does. Start with the tables that drive the dashboards executives look at. Earn the right to expand by being right when you do alert.
Try Sparvi for free
Sparvi monitors your warehouse for all five of these failure modes. Set up monitors with ML or threshold detection, segment them by source, region, or tenant when an aggregate would hide the problem, and route alerts to Slack. Connect Snowflake or BigQuery, get the first alert in 30 minutes. 14-day free trial, no credit card.
Start Free Trial