Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 31, 2026·3 min de lecture

LM Evaluation Harness — Few-Shot Language Model Benchmarking

A unified framework for evaluating language models across hundreds of benchmarks with reproducible few-shot testing.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
LM Eval Harness
Commande d'installation directe
npx -y tokrepo@latest install 62c59e45-5cea-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

LM Evaluation Harness by EleutherAI provides a standardized way to evaluate language models across hundreds of academic benchmarks. It has become the de facto evaluation framework used by research labs and the Open LLM Leaderboard to produce comparable, reproducible benchmark scores.

What LM Evaluation Harness Does

  • Evaluates language models on 200+ tasks including MMLU, HellaSwag, ARC, GSM8K, and TruthfulQA
  • Supports Hugging Face Transformers, GGUF, vLLM, OpenAI API, and custom model backends
  • Runs few-shot and zero-shot evaluations with configurable prompt templates
  • Produces standardized JSON results for comparison across models and checkpoints
  • Handles batched inference and multi-GPU parallelism for efficient evaluation

Architecture Overview

The harness defines each benchmark as a YAML-based task configuration that specifies the dataset, prompt template, few-shot examples, and scoring metric. At runtime it loads the configured model backend, constructs evaluation prompts, batches inference requests, collects log-likelihoods or generated text, and scores outputs against reference answers. Results are aggregated per-task and exported as structured JSON.

Self-Hosting & Configuration

  • Install from PyPI with pip or clone the repository for development use
  • Configure model backends via --model and --model_args flags on the CLI
  • Define custom evaluation tasks in YAML without writing Python code
  • Set batch size, number of few-shot examples, and output directory via CLI flags
  • Run distributed evaluation across multiple GPUs with accelerate or vLLM tensor parallelism

Key Features

  • Powers the Hugging Face Open LLM Leaderboard, the community standard for model ranking
  • YAML task definitions allow non-programmers to add new benchmarks quickly
  • Caches model outputs for fast re-scoring when metrics change
  • Supports log-likelihood, multiple-choice, and generative evaluation modes
  • Extensible model API enables integration with any inference backend

Comparison with Similar Tools

  • Promptfoo — focuses on prompt regression testing for applications; LM Eval Harness targets academic benchmark evaluation
  • DeepEval — LLM testing with custom metrics for production apps; LM Eval Harness covers established research benchmarks
  • Ragas — specializes in RAG pipeline evaluation; LM Eval Harness evaluates base model capabilities
  • OpenCompass — Chinese-originated evaluation suite; LM Eval Harness has broader English benchmark coverage
  • Inspect AI — safety-focused evaluations; LM Eval Harness covers general capability benchmarks

FAQ

Q: Which models does it support? A: Any model accessible via Hugging Face Transformers, GGUF files, vLLM, or an OpenAI-compatible API endpoint.

Q: How long does a full evaluation take? A: Running MMLU on an 8B model with a single GPU takes roughly 30-60 minutes. Smaller benchmarks like HellaSwag finish in minutes.

Q: Can I add my own benchmark? A: Yes. Create a YAML task config specifying the Hugging Face dataset, prompt template, and metric. No Python code is required for standard formats.

Q: Is it used for the Open LLM Leaderboard? A: Yes. The Hugging Face Open LLM Leaderboard uses LM Evaluation Harness as its evaluation backend.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires