Introduction
Git is terrible at large files. DVC solves this: big files stay in object storage (S3, GCS, Azure, SSH, local drives) while small .dvc pointer files go into Git. Checkouts restore the exact data that matches your commit, so experiments are fully reproducible.
With over 14,000 GitHub stars, DVC is used by ML teams at every scale — from solo researchers to companies running hundreds of experiments per day.
What DVC Does
DVC handles three things: (1) versioning large datasets/models alongside Git, (2) building reproducible pipelines (dvc.yaml defines stages with inputs/outputs), and (3) tracking experiments with metrics and parameters. Checkouts sync both code and data to the right state.
Architecture Overview
Git Repo (small) Remote Storage (large)
+------------------+ +----------------------+
| train.py | | .../a1/b2c3d4... | <-- content-addressed
| dvc.yaml | | .../e5/f6g7h8... | blobs
| params.yaml | | |
| data/train.csv.dvc| <----- | |
| models/m.pkl.dvc | | |
+------------------+ +----------------------+
^ ^
| git pull | dvc pull
v v
[Developer workspace]Self-Hosting & Configuration
# dvc.yaml — reproducible pipeline
stages:
prepare:
cmd: python src/prepare.py
deps: [src/prepare.py, data/raw.csv]
outs: [data/prepared.csv]
train:
cmd: python src/train.py
deps: [src/train.py, data/prepared.csv]
params: [train.lr, train.epochs]
outs: [models/model.pkl]
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python src/eval.py
deps: [src/eval.py, models/model.pkl]
metrics:
- eval.json:
cache: falsedvc repro
dvc exp run -S train.lr=0.01
dvc exp show
dvc plots show eval.jsonKey Features
- Git-compatible — works alongside any Git workflow
- Storage backends — S3, GCS, Azure, SSH, HDFS, HTTP, local
- Pipelines — DAG of stages, deps, outs, params, metrics
- Experiment tracking — named runs, metric tables, cross-experiment diffs
- Plot comparison — diff metrics across commits/branches/experiments
- Studio UI — web dashboard for teams (cloud or self-hosted)
- CI/CD hooks — integrates with GitHub Actions, GitLab CI (CML)
- Deduplication — content-addressed storage, same file stored once
Comparison with Similar Tools
| Feature | DVC | Git LFS | MLflow | Pachyderm | W&B |
|---|---|---|---|---|---|
| Data versioning | Yes | Yes | Via artifacts | Yes (k8s) | Via artifacts |
| Pipelines | Yes | No | Yes | Yes | No |
| Experiment tracking | Yes | No | Yes (focus) | No | Yes (focus) |
| Git-native | Yes | Yes | No | No | No |
| Storage choice | Many | Git hosting | Many | k8s-centric | Managed |
| Best For | Git + reproducible ML | Any large files | Experiment logs | Kubernetes ML | Deep learning logs |
FAQ
Q: DVC vs MLflow — do I pick one? A: DVC versions data and defines pipelines in Git. MLflow tracks training runs and models. Many teams use both.
Q: Will DVC slow down my Git repo?
A: No. Git only stores small .dvc pointer files. Actual data lives in remote storage and syncs via dvc push/pull.
Q: Is DVC enough for experiment tracking?
A: dvc exp is good for small-to-mid teams. For rich dashboards across many projects, pair with Studio, MLflow, or W&B.
Q: Does DVC work with GitHub-only workflows?
A: Yes. CML (Continuous Machine Learning) automates dvc repro + metric diffing in GitHub Actions PR comments.
Sources
- GitHub: https://github.com/iterative/dvc
- Docs: https://dvc.org
- Company: Iterative
- License: Apache-2.0