Skip to content

DVC Deep Dive

πŸ“¦ DVC: The Definitive Deep Dive

DVC (Data Version Control) is built to make ML models reproducible. It solves the problem of β€œHow do I version 1TB of data in Git?” by using lightweight pointers.


🟒 Level 1: Git-like Data Management

DVC mirrors the Git workflow but for large files.

  • .dvc files: Small text files (tracked by Git) that contain the hash of the large data file.
  • Remote Cache: The actual 1TB file is stored in S3, Azure Blob, or GCP.

1. Simple Workflow

# Track a large file
dvc add raw_data.csv

# This creates 'raw_data.csv.dvc'. Track this in Git!
git add raw_data.csv.dvc .gitignore
git commit -m "add raw data version 1"

# Push the actual data to S3
dvc push

🟑 Level 2: DVC Pipelines (Reproducibility)

DVC isn’t just for storage; it’s for Workflows. You define steps in a dvc.yaml file.

2. The Dependency Graph

A step in DVC has:

  • deps: Files that must change to trigger a re-run.
  • outs: Files created by the step.
  • cmd: The Python command to run.
stages:
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed_data.csv
    outs:
      - model/model.pkl

If you run dvc repro, DVC checks if the deps have changed. If not, it skips the step and uses the cached outs. This saves hours of compute time.


πŸ”΄ Level 3: Advanced Storage (Data Pipelines)

3. Data Registry

Just like a Model Registry, you can create a Data Registry repository that acts as a central hub for shared datasets.

4. Cloud Integration & Security

DVC supports External Outputs. You can tell DVC that an output lives directly on S3 (without downloading it to your local machine).