SkillsApr 14, 2026·3 min read

DVC — Data Version Control for Machine Learning

DVC brings Git-like versioning to datasets, models, and ML pipelines. Large files live in S3/GCS/Azure while lightweight metafiles are tracked in Git — giving you reproducible experiments and auditable model lineage.

Script Depot · Community

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Single

Trust

Trust: Established

Entrypoint

step-1.md

Direct install command

npx -y tokrepo@latest install 696434ff-37b5-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

TL;DR

DVC versions datasets and ML models with Git-like commands while storing large files in cloud storage.

§01

What it is

DVC (Data Version Control) brings Git-like versioning to datasets, machine learning models, and ML pipelines. Large binary files are stored in remote storage (S3, GCS, Azure, SSH, HDFS) while lightweight pointer files (.dvc files) are tracked in Git. This gives you reproducible experiments, auditable model lineage, and the ability to switch between data versions with git checkout.

DVC targets ML engineers, data scientists, and teams that need to track data and model artifacts alongside code without bloating the Git repository.

§02

How it saves time or tokens

Without DVC, teams resort to naming files model_v2_final_FINAL.pkl, storing data on shared drives with no versioning, or using custom scripts to manage artifacts. DVC integrates with Git to provide dvc push, dvc pull, and dvc checkout commands that work like their Git equivalents for data. Reproducing an experiment from six months ago is a git checkout plus dvc checkout away.

§03

How to use

Install DVC: pip install dvc[s3] (or [gcs], [azure] for your storage backend).
Initialize in a Git repo: git init && dvc init, add a remote: dvc remote add -d storage s3://my-bucket/dvcstore.
Track data: dvc add data/train.csv, commit the .dvc file to Git, and push data: dvc push.

§04

Example

# Initialize DVC in a Git repository
pip install dvc[s3]
git init && dvc init

# Configure remote storage
dvc remote add -d storage s3://my-bucket/dvcstore

# Track a large dataset
dvc add data/train.csv
git add data/train.csv.dvc .gitignore
git commit -m 'track training data'

# Push data to remote storage
dvc push

# On another machine, pull the data
git clone <repo>
dvc pull

# Create a reproducible pipeline
dvc run -n train -d data/train.csv -d train.py -o model.pkl \
  python train.py

§05

Related on TokRepo

Automation Tools -- ML pipeline and build tools
Coding Tools -- Developer productivity libraries

§06

Common pitfalls

Forgetting to run dvc push after dvc add. The data is tracked locally but not uploaded to remote storage, so teammates cannot pull it.
Not committing the .dvc file and updated .gitignore to Git. Without the pointer file in Git history, you lose the ability to reproduce that data version.
Using DVC without configuring a remote for team projects. Local-only DVC tracking works for solo use but breaks collaboration.

Frequently Asked Questions

How does DVC handle large files?+

DVC moves large files into a local cache (`.dvc/cache/`) and replaces them with small pointer files (`.dvc` files) that contain a hash. The pointer file is committed to Git. The actual data is stored in a configured remote (S3, GCS, Azure). `dvc pull` downloads data from the remote using the hash.

Can DVC define ML pipelines?+

Yes. DVC pipelines are defined in a `dvc.yaml` file where each stage specifies dependencies (data files, scripts), outputs (models, metrics), and commands. Running `dvc repro` re-executes only the stages whose dependencies have changed, similar to Make.

Does DVC work with any Git hosting?+

Yes. DVC uses Git for metadata (pointer files) and any Git host (GitHub, GitLab, Bitbucket) works. The data storage is separate and configured as a DVC remote. This means you can use GitHub for code and S3 for data without conflicts.

How does DVC compare to Git LFS?+

Git LFS stores large files on the Git server, which can become expensive and slow. DVC stores data on any storage backend you control (S3, GCS, your own server) with no per-file size limits. DVC also adds pipeline tracking and experiment management that Git LFS does not provide.

Can I use DVC for experiment tracking?+

Yes. DVC Experiments lets you run and compare experiments with different parameters. Use `dvc exp run --set-param lr=0.01` to run variations, then `dvc exp show` to compare metrics across experiments in a table. This integrates with Git branches for full reproducibility.

Citations (3)

DVC GitHub— DVC provides Git-like versioning for data and ML models
DVC Documentation— Supports S3, GCS, Azure, SSH, and HDFS as remote storage backends
DVC Experiments Docs— DVC Experiments for running and comparing ML experiments

Related on TokRepo

Automation Tools Coding Tools Featured Workflows

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

LakeFS — Git-Like Version Control for Data Lakes

LakeFS adds Git-like branching, committing, and merging to your data lake on S3, GCS, or Azure Blob Storage, enabling reproducible data pipelines and zero-copy experimentation.

Skills

AI Open Source

Pachyderm — Data Versioning and Pipeline Orchestration

Version your data like Git, build reproducible data pipelines triggered by commits, and track lineage from raw input to model output on Kubernetes.

Skills

Script Depot

Jujutsu (jj) — A Git-Compatible Next-Generation Version Control System

A version control system that combines the best ideas from Git, Mercurial, and Pijul with automatic rebasing, first-class conflicts, and a working-copy-as-commit model.

Skills

Script Depot

Git — The Distributed Version Control System

Git is the most widely used version control system in the world. Created by Linus Torvalds for Linux kernel development, it tracks changes in source code with distributed repositories, branching, merging, and a complete history of every file modification.

Skills

AI Open Source