Configs2026年5月31日·1 分钟阅读

LM Evaluation Harness — Few-Shot Language Model Benchmarking

A unified framework for evaluating language models across hundreds of benchmarks with reproducible few-shot testing.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
LM Eval Harness
直接安装命令
npx -y tokrepo@latest install 62c59e45-5cea-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

LM Evaluation Harness by EleutherAI provides a standardized way to evaluate language models across hundreds of academic benchmarks. It has become the de facto evaluation framework used by research labs and the Open LLM Leaderboard to produce comparable, reproducible benchmark scores.

What LM Evaluation Harness Does

  • Evaluates language models on 200+ tasks including MMLU, HellaSwag, ARC, GSM8K, and TruthfulQA
  • Supports Hugging Face Transformers, GGUF, vLLM, OpenAI API, and custom model backends
  • Runs few-shot and zero-shot evaluations with configurable prompt templates
  • Produces standardized JSON results for comparison across models and checkpoints
  • Handles batched inference and multi-GPU parallelism for efficient evaluation

Architecture Overview

The harness defines each benchmark as a YAML-based task configuration that specifies the dataset, prompt template, few-shot examples, and scoring metric. At runtime it loads the configured model backend, constructs evaluation prompts, batches inference requests, collects log-likelihoods or generated text, and scores outputs against reference answers. Results are aggregated per-task and exported as structured JSON.

Self-Hosting & Configuration

  • Install from PyPI with pip or clone the repository for development use
  • Configure model backends via --model and --model_args flags on the CLI
  • Define custom evaluation tasks in YAML without writing Python code
  • Set batch size, number of few-shot examples, and output directory via CLI flags
  • Run distributed evaluation across multiple GPUs with accelerate or vLLM tensor parallelism

Key Features

  • Powers the Hugging Face Open LLM Leaderboard, the community standard for model ranking
  • YAML task definitions allow non-programmers to add new benchmarks quickly
  • Caches model outputs for fast re-scoring when metrics change
  • Supports log-likelihood, multiple-choice, and generative evaluation modes
  • Extensible model API enables integration with any inference backend

Comparison with Similar Tools

  • Promptfoo — focuses on prompt regression testing for applications; LM Eval Harness targets academic benchmark evaluation
  • DeepEval — LLM testing with custom metrics for production apps; LM Eval Harness covers established research benchmarks
  • Ragas — specializes in RAG pipeline evaluation; LM Eval Harness evaluates base model capabilities
  • OpenCompass — Chinese-originated evaluation suite; LM Eval Harness has broader English benchmark coverage
  • Inspect AI — safety-focused evaluations; LM Eval Harness covers general capability benchmarks

FAQ

Q: Which models does it support? A: Any model accessible via Hugging Face Transformers, GGUF files, vLLM, or an OpenAI-compatible API endpoint.

Q: How long does a full evaluation take? A: Running MMLU on an 8B model with a single GPU takes roughly 30-60 minutes. Smaller benchmarks like HellaSwag finish in minutes.

Q: Can I add my own benchmark? A: Yes. Create a YAML task config specifying the Hugging Face dataset, prompt template, and metric. No Python code is required for standard formats.

Q: Is it used for the Open LLM Leaderboard? A: Yes. The Hugging Face Open LLM Leaderboard uses LM Evaluation Harness as its evaluation backend.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产