Data Lineage vs Data Provenance
Introduction: The Trust Deficit
Imagine an executive presenting a critical financial dashboard to the board of directors. The dashboard shows that profits dropped by 15%. A board member asks a simple question: “Where exactly did this number come from?”
If the data team cannot definitively trace that single number backward through dozens of database tables, ETL pipelines, and raw CSV files to its absolute origin, the data is untrustworthy. In heavily regulated industries (like banking or healthcare), the inability to prove the origin of a metric can result in catastrophic legal penalties.
This requirement for absolute traceability is satisfied by two closely related, but distinct, Data Governance concepts: Data Lineage and Data Provenance.
Data Lineage: The Map of Movement
Data Lineage answers the question: “How did the data get here, and what happened to it along the way?”
Data Lineage focuses on the physical journey and transformation of data through the enterprise architecture. It is usually visualized as a complex, branching graph (a Directed Acyclic Graph, or DAG).
If you look at the lineage of a “Total Revenue” column in a dashboard, the graph will show:
- The Source: It originated in the
Salesforce.Orderstable. - The Extraction: An Airflow job copied it to an Amazon S3 raw bucket.
- The Transformation (dbt): A dbt SQL script joined it with the
Stripe.Paymentstable and filtered out refunded orders. - The Destination: It was materialized into the
Lakehouse.Gold_Revenuetable. - The Consumer: It is currently being read by 3 different Tableau dashboards.
Lineage is critical for Impact Analysis. If a data engineer wants to delete or change the Salesforce.Orders table, they look at the lineage graph to see exactly which downstream Tableau dashboards will crash if they make that change.
Data Provenance: The Certificate of Authenticity
Data Provenance answers the question: “Who created this data, when, and under what authority?”
While Lineage focuses on the engineering pipeline (the How), Provenance focuses on the historical metadata and ownership (the Who and Why). It is the historical “chain of custody.”
If you look at the provenance of the same “Total Revenue” dataset, it will show:
- Authorship: The raw data was generated by the Salesforce API v2.1 on May 14, 2026, at 09:00 AM UTC.
- Ownership: The dataset is legally owned by “Jane Doe, VP of Sales.”
- Quality Metrics: When it was ingested, it passed 99.8% of its automated Data Quality checks.
- Security Classification: It contains PII (Personally Identifiable Information) and is governed by GDPR rules.
Provenance is critical for Regulatory Auditing. When regulators audit a machine learning model that denied a customer a loan, provenance provides the legal documentation proving exactly who provided the training data and verifying that the data was not illegally biased.
Implementing Lineage and Provenance
Historically, lineage and provenance were tracked manually in massive Excel spreadsheets, which instantly became outdated.
Today, tracking is entirely automated using Active Metadata tools (like Alation, Collibra, or open-source tools like DataHub and Marquez). As data moves through the modern Lakehouse, the orchestration tools (Airflow) and transformation tools (dbt) automatically emit metadata events. The governance platform listens to these events, dynamically draws the lineage maps, and updates the provenance ledgers in real-time.
Furthermore, Open Table Formats like Apache Iceberg provide foundational provenance by maintaining a strict, immutable history (Snapshots) of every single change made to the physical data files.
Conclusion
Data Lineage and Data Provenance are the foundational pillars of enterprise data trust. Lineage provides the engineering map necessary to debug pipelines and prevent catastrophic breaking changes, while Provenance provides the historical chain of custody necessary for legal compliance and executive confidence. Together, they ensure that an organization’s data remains transparent, reliable, and auditable from its chaotic origin to the final executive dashboard.
Deepen Your Knowledge
Ready to take the next step in mastering the Data Lakehouse? Dive deeper with my authoritative guides and practical resources.
Explore Alex's Books