# DVC — Data Version Control for Machine Learning > DVC brings Git-like versioning to datasets, models, and ML pipelines. Large files live in S3/GCS/Azure while lightweight metafiles are tracked in Git — giving you reproducible experiments and auditable model lineage. ## Install Save as a script file and run: # DVC — Data Version Control for Machine Learning ## Quick Use ```bash pip install dvc[s3] git init && dvc init dvc remote add -d storage s3://my-bucket/dvcstore dvc add data/train.csv # tracks large file via pointer .dvc git add data/train.csv.dvc .gitignore git commit -m "track training data" dvc push # upload to remote storage ``` ## Introduction Git is terrible at large files. DVC solves this: big files stay in object storage (S3, GCS, Azure, SSH, local drives) while small .dvc pointer files go into Git. Checkouts restore the exact data that matches your commit, so experiments are fully reproducible. With over 14,000 GitHub stars, DVC is used by ML teams at every scale — from solo researchers to companies running hundreds of experiments per day. ## What DVC Does DVC handles three things: (1) versioning large datasets/models alongside Git, (2) building reproducible pipelines (dvc.yaml defines stages with inputs/outputs), and (3) tracking experiments with metrics and parameters. Checkouts sync both code and data to the right state. ## Architecture Overview ``` Git Repo (small) Remote Storage (large) +------------------+ +----------------------+ | train.py | | .../a1/b2c3d4... | <-- content-addressed | dvc.yaml | | .../e5/f6g7h8... | blobs | params.yaml | | | | data/train.csv.dvc| <----- | | | models/m.pkl.dvc | | | +------------------+ +----------------------+ ^ ^ | git pull | dvc pull v v [Developer workspace] ``` ## Self-Hosting & Configuration ```yaml # dvc.yaml — reproducible pipeline stages: prepare: cmd: python src/prepare.py deps: [src/prepare.py, data/raw.csv] outs: [data/prepared.csv] train: cmd: python src/train.py deps: [src/train.py, data/prepared.csv] params: [train.lr, train.epochs] outs: [models/model.pkl] metrics: - metrics.json: cache: false evaluate: cmd: python src/eval.py deps: [src/eval.py, models/model.pkl] metrics: - eval.json: cache: false ``` ```bash dvc repro dvc exp run -S train.lr=0.01 dvc exp show dvc plots show eval.json ``` ## Key Features - **Git-compatible** — works alongside any Git workflow - **Storage backends** — S3, GCS, Azure, SSH, HDFS, HTTP, local - **Pipelines** — DAG of stages, deps, outs, params, metrics - **Experiment tracking** — named runs, metric tables, cross-experiment diffs - **Plot comparison** — diff metrics across commits/branches/experiments - **Studio UI** — web dashboard for teams (cloud or self-hosted) - **CI/CD hooks** — integrates with GitHub Actions, GitLab CI (CML) - **Deduplication** — content-addressed storage, same file stored once ## Comparison with Similar Tools | Feature | DVC | Git LFS | MLflow | Pachyderm | W&B | |---|---|---|---|---|---| | Data versioning | Yes | Yes | Via artifacts | Yes (k8s) | Via artifacts | | Pipelines | Yes | No | Yes | Yes | No | | Experiment tracking | Yes | No | Yes (focus) | No | Yes (focus) | | Git-native | Yes | Yes | No | No | No | | Storage choice | Many | Git hosting | Many | k8s-centric | Managed | | Best For | Git + reproducible ML | Any large files | Experiment logs | Kubernetes ML | Deep learning logs | ## FAQ **Q: DVC vs MLflow — do I pick one?** A: DVC versions data and defines pipelines in Git. MLflow tracks training runs and models. Many teams use both. **Q: Will DVC slow down my Git repo?** A: No. Git only stores small .dvc pointer files. Actual data lives in remote storage and syncs via `dvc push/pull`. **Q: Is DVC enough for experiment tracking?** A: `dvc exp` is good for small-to-mid teams. For rich dashboards across many projects, pair with Studio, MLflow, or W&B. **Q: Does DVC work with GitHub-only workflows?** A: Yes. CML (Continuous Machine Learning) automates `dvc repro` + metric diffing in GitHub Actions PR comments. ## Sources - GitHub: https://github.com/iterative/dvc - Docs: https://dvc.org - Company: Iterative - License: Apache-2.0 --- Source: https://tokrepo.com/en/workflows/696434ff-37b5-11f1-9bc6-00163e2b0d79 Author: Script Depot