Skills2026年5月2日·1 分钟阅读

LM Evaluation Harness — Unified LLM Benchmarking Framework

EleutherAI's framework for reproducible evaluation of language models across hundreds of benchmarks, providing the standard evaluation backend used by the Open LLM Leaderboard and research papers.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
LM Evaluation Harness
通用 CLI 安装命令
npx tokrepo install 0df4d2b1-45bd-11f1-9bc6-00163e2b0d79

Introduction

The LM Evaluation Harness by EleutherAI provides a unified interface for evaluating generative language models across standardized benchmarks. It ensures reproducible evaluation with consistent prompting, scoring, and reporting, and serves as the evaluation backend for Hugging Face's Open LLM Leaderboard.

What LM Evaluation Harness Does

  • Evaluates language models on 400+ built-in benchmark tasks
  • Supports multiple model backends (Hugging Face, GGUF, vLLM, OpenAI API, etc.)
  • Handles few-shot prompting with configurable example selection
  • Computes metrics including accuracy, perplexity, exact match, and BLEU
  • Produces structured JSON output for automated comparison pipelines

Architecture Overview

The harness separates model interfaces (LM classes) from task definitions (YAML configs specifying dataset, prompt template, and metric). During evaluation, it constructs prompts per task specification, batches requests to the model, collects log-probabilities or generated text, and scores against reference answers. A task registry discovers built-in tasks and user-defined YAML configs automatically. Caching avoids redundant computation across runs.

Self-Hosting & Configuration

  • Install via pip: pip install lm-eval or clone for development
  • Configure model backend and arguments via CLI or Python API
  • Add custom tasks by writing YAML configuration files
  • Use --batch_size auto for automatic GPU memory-aware batching
  • Output results to JSON, CSV, or push to Weights & Biases

Key Features

  • 400+ tasks covering reasoning, knowledge, coding, math, and language understanding
  • Backend-agnostic: evaluate local models, API endpoints, or quantized formats
  • Few-shot evaluation with deterministic example sampling
  • Group tasks into suites for composite benchmark scoring
  • Extensible YAML task format for custom evaluations

Comparison with Similar Tools

  • HELM (Stanford) — broader evaluation including fairness and toxicity; more complex setup
  • OpenCompass — Chinese-origin eval suite with similar scope; different task implementations
  • Promptfoo — focused on prompt testing and red-teaming rather than academic benchmarks
  • DeepEval — LLM testing with custom metrics; less standardized benchmark coverage
  • bigcode-evaluation-harness — specialized fork for code generation benchmarks

FAQ

Q: How do I evaluate a model served via API? A: Use the --model openai-completions or --model local-completions backend with the appropriate base URL.

Q: Can I add my own benchmark? A: Yes. Write a YAML task config specifying the dataset (from Hugging Face or local), prompt template, and scoring metric.

Q: How long does a full evaluation take? A: Depends on model size and task count. MMLU on a 7B model takes roughly 30 minutes on a single A100.

Q: Is the scoring deterministic? A: Yes, given the same model weights, task version, and few-shot seed. The harness pins random seeds for reproducibility.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产