What is DVC — Data Version Control for Machine Learning?

DVC brings Git-like versioning to datasets, models, and ML pipelines. Large files live in S3/GCS/Azure while lightweight metafiles are tracked in Git — giving you reproducible experiments and auditable model lineage.

Is DVC — Data Version Control for Machine Learning free to use?

Yes. DVC — Data Version Control for Machine Learning is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install DVC — Data Version Control for Machine Learning?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

DVC — Data Version Control for Machine Learning

Introduction

Git is terrible at large files. DVC solves this: big files stay in object storage (S3, GCS, Azure, SSH, local drives) while small .dvc pointer files go into Git. Checkouts restore the exact data that matches your commit, so experiments are fully reproducible.

With over 14,000 GitHub stars, DVC is used by ML teams at every scale — from solo researchers to companies running hundreds of experiments per day.

What DVC Does

DVC handles three things: (1) versioning large datasets/models alongside Git, (2) building reproducible pipelines (dvc.yaml defines stages with inputs/outputs), and (3) tracking experiments with metrics and parameters. Checkouts sync both code and data to the right state.

Architecture Overview

Git Repo (small)             Remote Storage (large)
+------------------+         +----------------------+
| train.py         |         | .../a1/b2c3d4...     |  <-- content-addressed
| dvc.yaml         |         | .../e5/f6g7h8...     |      blobs
| params.yaml      |         |                      |
| data/train.csv.dvc| <----- |                      |
| models/m.pkl.dvc |         |                      |
+------------------+         +----------------------+
        ^                               ^
        | git pull                      | dvc pull
        v                               v
             [Developer workspace]

Self-Hosting & Configuration

# dvc.yaml — reproducible pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [src/prepare.py, data/raw.csv]
    outs: [data/prepared.csv]

  train:
    cmd: python src/train.py
    deps: [src/train.py, data/prepared.csv]
    params: [train.lr, train.epochs]
    outs: [models/model.pkl]
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/eval.py
    deps: [src/eval.py, models/model.pkl]
    metrics:
      - eval.json:
          cache: false

dvc repro
dvc exp run -S train.lr=0.01
dvc exp show
dvc plots show eval.json

Key Features

Git-compatible — works alongside any Git workflow
Storage backends — S3, GCS, Azure, SSH, HDFS, HTTP, local
Pipelines — DAG of stages, deps, outs, params, metrics
Experiment tracking — named runs, metric tables, cross-experiment diffs
Plot comparison — diff metrics across commits/branches/experiments
Studio UI — web dashboard for teams (cloud or self-hosted)
CI/CD hooks — integrates with GitHub Actions, GitLab CI (CML)
Deduplication — content-addressed storage, same file stored once

Comparison with Similar Tools

Feature	DVC	Git LFS	MLflow	Pachyderm	W&B
Data versioning	Yes	Yes	Via artifacts	Yes (k8s)	Via artifacts
Pipelines	Yes	No	Yes	Yes	No
Experiment tracking	Yes	No	Yes (focus)	No	Yes (focus)
Git-native	Yes	Yes	No	No	No
Storage choice	Many	Git hosting	Many	k8s-centric	Managed
Best For	Git + reproducible ML	Any large files	Experiment logs	Kubernetes ML	Deep learning logs

FAQ

Q: DVC vs MLflow — do I pick one? A: DVC versions data and defines pipelines in Git. MLflow tracks training runs and models. Many teams use both.

Q: Will DVC slow down my Git repo? A: No. Git only stores small .dvc pointer files. Actual data lives in remote storage and syncs via dvc push/pull.

Q: Is DVC enough for experiment tracking? A: dvc exp is good for small-to-mid teams. For rich dashboards across many projects, pair with Studio, MLflow, or W&B.

Q: Does DVC work with GitHub-only workflows? A: Yes. CML (Continuous Machine Learning) automates dvc repro + metric diffing in GitHub Actions PR comments.

Sources

GitHub: https://github.com/iterative/dvc
Docs: https://dvc.org
Company: Iterative
License: Apache-2.0

DVC — Data Version Control for Machine Learning

Introduction

What DVC Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Invidious — Alternative Privacy-First Frontend for YouTube

Zulip — Threaded Team Chat That Actually Scales to Thousands of Topics

PhotoPrism — AI-Powered Photo Management for the Self-Hosted Era