Definition
What is Data Lineage?
Data lineage is the complete lifecycle of data—tracking where it originated, how it's been transformed, and where it flows throughout your systems. It's essentially a map of your data's journey.
Data Lineage Explained
Your CEO asks: "Where does this revenue number come from?"
Without data lineage, answering means hours of detective work—tracing through dashboards, finding the underlying queries, tracking down the source tables, understanding the transformations.
With data lineage, you can instantly see: Revenue in the dashboard comes from the monthly_revenue table, which is calculated from orders joined with refunds, which are loaded from the Stripe API.
Data lineage is one of the five pillars of data observability. It answers questions like:
- Where does this data come from?
- What transformations have been applied?
- What downstream systems depend on this table?
- If I change this column, what breaks?
Types of Data Lineage
Table-Level Lineage
Shows relationships between tables:
- Table A feeds Table B
- Table B and Table C feed Table D
Table-level lineage is easier to implement but less precise. You know tables are connected but not exactly how.
Column-Level Lineage
Shows relationships between specific columns:
- Column
orders.amountfeedsrevenue.total_sales - Columns
users.first_nameandusers.last_nameconcatenate intocustomers.full_name
Column-level lineage is more powerful for impact analysis but harder to maintain accurately.
Business Lineage
Maps data flows in business terms:
- "Customer Lifetime Value" comes from "Purchase History" and "Support Interactions"
- "Monthly Recurring Revenue" is calculated from "Active Subscriptions"
Business lineage helps non-technical stakeholders understand data origins.
Why Data Lineage Matters
Impact Analysis
Before making changes, you need to understand the blast radius:
- Schema changes: If I rename this column, what transformations break?
- Table deprecation: If I delete this table, what dashboards stop working?
- Data fixes: If I correct this data, what downstream reports need refreshing?
Without lineage, changes are risky. Teams either move too slowly (checking everything manually) or too fast (breaking things they didn't know depended on the change).
Root Cause Analysis
When something's wrong, lineage helps trace back to the source:
- Wrong numbers: Follow lineage upstream to find where calculation went wrong
- Missing data: Trace back to find which source or transformation dropped records
- Stale data: Identify which upstream job failed or is delayed
Lineage turns "something's broken" into "this specific transformation has a bug."
Regulatory Compliance
Many regulations require proving data's origin:
- GDPR: Where does personal data flow? Can you delete it everywhere?
- SOX: How are financial figures calculated? What's the audit trail?
- HIPAA: Where does patient data go? Who has access?
Lineage provides the documentation auditors need.
Data Trust
Stakeholders trust data more when they understand its origin:
- Seeing the full lineage shows data isn't a black box
- Understanding transformations explains why numbers look the way they do
- Knowing the source systems validates data accuracy
Data Lineage Examples
Simple ETL Lineage
Source: Stripe API
↓
Raw Layer: raw_stripe_charges
↓ (filter: status = 'succeeded')
Staging: stg_payments
↓ (join with customers, aggregate by month)
Mart: monthly_revenue
↓
Dashboard: Executive Revenue ReportComplex Transformation Lineage
orders.amount ──────────────┐
├─→ revenue.gross_sales
orders.discount_amount ─────┘ │
│
refunds.amount ─────────────────────├─→ revenue.net_sales
│
subscriptions.mrr ──────────────────┴─→ revenue.total_revenueData Lineage vs Data Provenance
These terms are related but different:
| Aspect | Lineage | Provenance |
|---|---|---|
| Focus | Where data flows (the path) | What happened to data (the history) |
| Question | "Where does this come from?" | "What happened to this?" |
| Analogy | A map showing the route | A travel log with timestamps |
| Example | Orders → Revenue | Loaded at 6am, transformed at 7am |
Both are valuable. Lineage helps understand structure; provenance helps debug timing and history issues.
Implementing Data Lineage
Manual Documentation
The simplest approach: document lineage in spreadsheets or diagrams.
- Pros: No tooling required, captures business context
- Cons: Goes stale quickly, doesn't scale, error-prone
Manual documentation works for small, stable environments but becomes unmanageable as complexity grows.
Code Parsing
Extract lineage from transformation code (dbt, SQL, Spark):
-- dbt model with implicit lineage
-- File: models/marts/monthly_revenue.sql
SELECT
date_trunc('month', o.order_date) as month,
SUM(o.amount) as gross_revenue,
SUM(r.amount) as refunds,
SUM(o.amount) - COALESCE(SUM(r.amount), 0) as net_revenue
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_refunds') }} r ON o.order_id = r.order_id
GROUP BY 1
-- dbt automatically tracks that monthly_revenue
-- depends on stg_orders and stg_refundsTools like dbt build lineage graphs from ref() and source() functions.
Query Log Analysis
Analyze database query logs to discover actual data flows:
- See which tables are actually read together
- Discover undocumented dependencies
- Track query patterns over time
This catches lineage that code parsing misses—like ad-hoc queries and legacy pipelines.
Data Observability Platforms
Data observability tools combine multiple approaches:
- Parse transformation code automatically
- Analyze query logs for actual usage
- Visualize lineage graphs interactively
- Integrate lineage with anomaly detection and alerting
Data Lineage Best Practices
Start with Critical Paths
Don't try to map everything. Start with:
- Executive dashboards
- Financial reports
- Customer-facing data
- Compliance-sensitive data
Map these critical paths first, then expand coverage over time.
Automate Where Possible
Manual lineage goes stale. Use tools that automatically discover and update lineage as your pipelines change.
Include Business Context
Technical lineage ("table A feeds table B") is less useful without business context ("customer orders flow into monthly revenue reports").
Connect to Impact Analysis
Lineage is most valuable when connected to change management. Before making changes, automatically show what will be affected.
Data Lineage with Sparvi
Sparvi helps teams understand their data flows with automated lineage tracking. See where data comes from, understand dependencies, and make changes with confidence.
Learn More About SparviFrequently Asked Questions
What is data lineage?
Data lineage is the complete lifecycle of data, tracking where it originated, how it has been transformed, and where it flows within your systems. It provides a visual map showing data's journey from source to destination, including all transformations along the way.
Why is data lineage important?
Data lineage is important for impact analysis (understanding what breaks when something changes), root cause analysis (tracing problems back to their source), compliance (proving data's origin and transformations for auditors), and trust (knowing where numbers in reports actually come from).
What is the difference between data lineage and data provenance?
Data lineage shows the path data takes through systems (the "where"). Data provenance provides detailed history of what happened to data at each step (the "what" and "when"). Lineage is the map; provenance is the detailed travel log.
How do you implement data lineage?
Data lineage can be implemented through: manual documentation (spreadsheets and diagrams), parsing transformation code (dbt, SQL), query log analysis (tracking actual data flows), or dedicated data catalog and observability tools that automatically discover and visualize lineage.