DVC Deep Dive
π¦ DVC: The Definitive Deep Dive
DVC (Data Version Control) is built to make ML models reproducible. It solves the problem of βHow do I version 1TB of data in Git?β by using lightweight pointers.
π’ Level 1: Git-like Data Management
DVC mirrors the Git workflow but for large files.
.dvcfiles: Small text files (tracked by Git) that contain the hash of the large data file.- Remote Cache: The actual 1TB file is stored in S3, Azure Blob, or GCP.
1. Simple Workflow
# Track a large file
dvc add raw_data.csv
# This creates 'raw_data.csv.dvc'. Track this in Git!
git add raw_data.csv.dvc .gitignore
git commit -m "add raw data version 1"
# Push the actual data to S3
dvc pushπ‘ Level 2: DVC Pipelines (Reproducibility)
DVC isnβt just for storage; itβs for Workflows. You define steps in a dvc.yaml file.
2. The Dependency Graph
A step in DVC has:
deps: Files that must change to trigger a re-run.outs: Files created by the step.cmd: The Python command to run.
stages:
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed_data.csv
outs:
- model/model.pklIf you run dvc repro, DVC checks if the deps have changed. If not, it skips the step and uses the cached outs. This saves hours of compute time.
π΄ Level 3: Advanced Storage (Data Pipelines)
3. Data Registry
Just like a Model Registry, you can create a Data Registry repository that acts as a central hub for shared datasets.
4. Cloud Integration & Security
DVC supports External Outputs. You can tell DVC that an output lives directly on S3 (without downloading it to your local machine).