Scripts2026年4月14日·1 分钟阅读

DVC — Data Version Control for Machine Learning

DVC brings Git-like versioning to datasets, models, and ML pipelines. Large files live in S3/GCS/Azure while lightweight metafiles are tracked in Git — giving you reproducible experiments and auditable model lineage.

Introduction

Git is terrible at large files. DVC solves this: big files stay in object storage (S3, GCS, Azure, SSH, local drives) while small .dvc pointer files go into Git. Checkouts restore the exact data that matches your commit, so experiments are fully reproducible.

With over 14,000 GitHub stars, DVC is used by ML teams at every scale — from solo researchers to companies running hundreds of experiments per day.

What DVC Does

DVC handles three things: (1) versioning large datasets/models alongside Git, (2) building reproducible pipelines (dvc.yaml defines stages with inputs/outputs), and (3) tracking experiments with metrics and parameters. Checkouts sync both code and data to the right state.

Architecture Overview

Git Repo (small)             Remote Storage (large)
+------------------+         +----------------------+
| train.py         |         | .../a1/b2c3d4...     |  <-- content-addressed
| dvc.yaml         |         | .../e5/f6g7h8...     |      blobs
| params.yaml      |         |                      |
| data/train.csv.dvc| <----- |                      |
| models/m.pkl.dvc |         |                      |
+------------------+         +----------------------+
        ^                               ^
        | git pull                      | dvc pull
        v                               v
             [Developer workspace]

Self-Hosting & Configuration

# dvc.yaml — reproducible pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [src/prepare.py, data/raw.csv]
    outs: [data/prepared.csv]

  train:
    cmd: python src/train.py
    deps: [src/train.py, data/prepared.csv]
    params: [train.lr, train.epochs]
    outs: [models/model.pkl]
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/eval.py
    deps: [src/eval.py, models/model.pkl]
    metrics:
      - eval.json:
          cache: false
dvc repro
dvc exp run -S train.lr=0.01
dvc exp show
dvc plots show eval.json

Key Features

  • Git-compatible — works alongside any Git workflow
  • Storage backends — S3, GCS, Azure, SSH, HDFS, HTTP, local
  • Pipelines — DAG of stages, deps, outs, params, metrics
  • Experiment tracking — named runs, metric tables, cross-experiment diffs
  • Plot comparison — diff metrics across commits/branches/experiments
  • Studio UI — web dashboard for teams (cloud or self-hosted)
  • CI/CD hooks — integrates with GitHub Actions, GitLab CI (CML)
  • Deduplication — content-addressed storage, same file stored once

Comparison with Similar Tools

Feature DVC Git LFS MLflow Pachyderm W&B
Data versioning Yes Yes Via artifacts Yes (k8s) Via artifacts
Pipelines Yes No Yes Yes No
Experiment tracking Yes No Yes (focus) No Yes (focus)
Git-native Yes Yes No No No
Storage choice Many Git hosting Many k8s-centric Managed
Best For Git + reproducible ML Any large files Experiment logs Kubernetes ML Deep learning logs

FAQ

Q: DVC vs MLflow — do I pick one? A: DVC versions data and defines pipelines in Git. MLflow tracks training runs and models. Many teams use both.

Q: Will DVC slow down my Git repo? A: No. Git only stores small .dvc pointer files. Actual data lives in remote storage and syncs via dvc push/pull.

Q: Is DVC enough for experiment tracking? A: dvc exp is good for small-to-mid teams. For rich dashboards across many projects, pair with Studio, MLflow, or W&B.

Q: Does DVC work with GitHub-only workflows? A: Yes. CML (Continuous Machine Learning) automates dvc repro + metric diffing in GitHub Actions PR comments.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产