Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 13, 2026·3 min de lecture

Snorkel — Programmatic Data Labeling for Machine Learning

Snorkel is a framework for building and managing training datasets programmatically using labeling functions, data augmentation, and slicing, replacing manual annotation with scalable automated approaches.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Snorkel Overview
Commande CLI universelle
npx tokrepo install 73dc2715-4ea4-11f1-9bc6-00163e2b0d79

Introduction

Snorkel is a data-centric AI framework from Stanford that replaces hand-labeling with programmatic labeling functions. Users write simple Python functions that encode heuristics, patterns, or external knowledge sources, and Snorkel combines their noisy outputs into high-quality training labels using a generative model.

What Snorkel Does

  • Lets users write labeling functions as simple Python heuristics
  • Combines multiple noisy label sources into probabilistic training labels
  • Models labeling function accuracy and correlations automatically
  • Supports data augmentation through transformation functions
  • Provides slicing functions for fine-grained model analysis

Architecture Overview

Snorkel operates in three stages. First, labeling functions produce a label matrix where each row is a data point and each column is a labeling function's vote. Second, a generative label model learns the accuracy and correlation structure of the labeling functions without ground truth, producing probabilistic labels. Third, these soft labels train a downstream discriminative model (any standard classifier) that generalizes beyond the labeling function coverage.

Self-Hosting & Configuration

  • Install via pip: pip install snorkel
  • Define labeling functions as decorated Python functions
  • Apply labeling functions to your dataset with the built-in applier
  • Train the label model to estimate function accuracies
  • Feed probabilistic labels to any downstream ML framework

Key Features

  • Replaces manual labeling with programmatic heuristics at scale
  • Learns labeling function quality without any ground-truth labels
  • Handles conflicting and overlapping label sources automatically
  • Integrates with pandas DataFrames and standard ML pipelines
  • Supports transformation functions for data augmentation

Comparison with Similar Tools

  • Label Studio — manual annotation UI; Snorkel automates labeling with code
  • Prodigy — active learning annotation tool; Snorkel uses heuristic functions instead of human feedback
  • Cleanlab — detects label errors in existing datasets; Snorkel generates labels from scratch
  • Argilla — collaborative data curation; Snorkel focuses on programmatic weak supervision

FAQ

Q: Do labeling functions need to be perfect? A: No. Snorkel is designed to work with noisy, incomplete labeling functions and automatically estimates their accuracy.

Q: How many labeling functions do I need? A: Even 3-5 functions can produce useful labels. More functions with diverse signals generally improve quality.

Q: Does Snorkel replace all manual labeling? A: It dramatically reduces the need for manual labels. A small validation set is still recommended for evaluation.

Q: Can I use LLMs as labeling functions? A: Yes, wrapping an LLM call in a labeling function is a common and effective pattern.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.