Data Lineage & Discovery

🛡️ Data Lineage & Discovery

Data Lineage is the map of where your data comes from, what happens to it, and where it goes. Data Discovery allows users to find the data they need in a complex organization.

🟢 Level 1: Foundations (The Map)

1. Why Lineage?

Impact Analysis: “If I change this column in the raw DB, what reports will break?”
Troubleshooting: “Why is this dashboard showing the wrong numbers?”
Compliance: Proving to auditors that sensitive data is handled correctly.

🟡 Level 2: Modern Lineage Tools

2. OpenLineage

A vendor-neutral standard for data lineage collection. Many tools (Airflow, Spark, dbt) have built-in OpenLineage emitters.

3. Data Catalogs

A central UI to browse all datasets, their owners, and their descriptions.

Amundsen: (Originally from Lyft) focuses on search and social features.
DataHub: (Originally from LinkedIn) focuses on a metadata-driven graph.

🔴 Level 3: Column-Level Lineage

The most granular level of lineage. It traces the flow of a single field (e.g., user_email) through every join and aggregation in your Spark/SQL code.

Use OpenLineage integrated with Airflow to automatically capture lineage for every task run. This eliminates the need to manually update documentation as your pipeline evolves.