Engineering8 min read

BigQuery Freshness Checks: SQL Patterns That Actually Work

The freshness query in your runbook probably lies. Here are the patterns that don't.

By the Sparvi Team

The naive freshness check

Most teams start here:

SELECT MAX(updated_at) AS last_update
FROM `project.dataset.orders`;

This works until it doesn't. Three failure modes:

  1. updated_at is on the row, not the load. A backfill of historical data shows old timestamps. The table is fresh; the column says otherwise.
  2. The streaming insert has been writing rows with backdated updated_at for a week. The query always returns yesterday's value.
  3. The column is NULL for half the rows. MAX(NULL) returns NULL, your check passes silently.

Three patterns that actually work, in order of cost.

Pattern 1: INFORMATION_SCHEMA.PARTITIONS (cheap, partitioned tables only)

BigQuery's INFORMATION_SCHEMA.PARTITIONS exposes the load time of each partition. For partitioned tables this is the cheapest freshness check because it scans metadata, not data.

SELECT
  MAX(last_modified_time) AS table_last_modified
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'orders';

This returns when BigQuery last wrote to any partition. It catches stuck schedulers and silent partial loads because if no partition was modified, the value is genuinely old.

Limitation: only works on partitioned tables. Non-partitioned tables have __UNPARTITIONED__ as a single pseudo-partition, which is fine but less useful.

Pattern 2: INFORMATION_SCHEMA.TABLES (universal)

Every table has a last_modified_time in INFORMATION_SCHEMA.TABLES. This is the timestamp BigQuery itself updates on any write.

SELECT
  TIMESTAMP_MILLIS(last_modified_time) AS table_last_modified
FROM `project.dataset.__TABLES__`
WHERE table_id = 'orders';

Note: this is the legacy __TABLES__ view, not INFORMATION_SCHEMA.TABLES. The legacy view exposes last_modified_time as a millisecond integer. INFORMATION_SCHEMA.TABLES doesn't expose this directly, counterintuitively, the legacy view is what you want.

When this is useful: any table, partitioned or not. Catches the truly silent failures because BigQuery only updates this timestamp on real writes.

Pattern 3: Row-count delta (catches business-level freshness)

The previous two patterns answer "did BigQuery write to this table." They don't answer "did your pipeline write the right amount." For that:

-- Rows landed in the last 24 hours by load_date partition
SELECT
  load_date,
  COUNT(*) AS rows_loaded
FROM `project.dataset.orders`
WHERE load_date >= CURRENT_DATE() - 7
GROUP BY load_date
ORDER BY load_date DESC;

Alert when today's row count is below some floor relative to the trailing 7-day average. This catches the failure where the pipeline ran, wrote 12 rows instead of the usual 12,000, and reported success.

If your tables aren't partitioned by load date, partition them. The cost of partition pruning savings will dwarf the cost of the partitioning operation, and freshness queries become trivial.

The composite check you want in production

Combine all three for the highest-stakes tables:

WITH metadata AS (
  SELECT TIMESTAMP_MILLIS(last_modified_time) AS table_modified
  FROM `project.dataset.__TABLES__`
  WHERE table_id = 'orders'
),
recent_rows AS (
  SELECT COUNT(*) AS row_count
  FROM `project.dataset.orders`
  WHERE load_date = CURRENT_DATE()
)
SELECT
  table_modified,
  row_count,
  CASE
    WHEN table_modified < CURRENT_TIMESTAMP() - INTERVAL 6 HOUR THEN 'STALE'
    WHEN row_count = 0 THEN 'EMPTY'
    WHEN row_count < 1000 THEN 'PARTIAL'
    ELSE 'OK'
  END AS freshness_status
FROM metadata, recent_rows;

Three potential failure modes, one query. Alert if freshness_status is anything other than OK.

Gotchas worth knowing

DATE vs TIMESTAMP comparison

BigQuery rejects DATE >= TIMESTAMP with a type-mismatch error. If your column is a DATE, use CURRENT_DATE() and DATE_SUB. If it's TIMESTAMP or DATETIME, use the matching CURRENT_* function. Mixing them silently doesn't work; you'll get a query error, not bad data.

Streaming tables and the buffer

Tables fed by the streaming insert API have a streaming buffer that doesn't show in INFORMATION_SCHEMA immediately. last_modified_time may lag by 1-3 minutes. Don't alert on freshness windows shorter than 10 minutes for streaming tables.

Views and materialized views

Views don't have a last_modified_time meaningful for data freshness, they reflect their definition, not their data. For views, check the freshness of the underlying tables. For materialized views, check the refresh timestamp via INFORMATION_SCHEMA.MATERIALIZED_VIEWS.

Skip writing these by hand

The patterns above work, but maintaining 100 of them across a warehouse is a part-time job. Sparvi auto-generates freshness rules for every table you connect, picks the right pattern based on the table's partitioning and column types, and alerts on drift from baselines. The DATE/TIMESTAMP gotcha mentioned above? Handled automatically.

Auto-generated freshness checks

Sparvi profiles your BigQuery tables and writes the right freshness pattern for each, partitioned, non-partitioned, view, materialized view, streaming. 14-day free trial, no credit card.

Start Free Trial