How do I install LM Evaluation Harness — Unified LLM Benchmarking Framework?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LM Evaluation Harness — Unified LLM Benchmarking Framework

Introduction

The LM Evaluation Harness by EleutherAI provides a unified interface for evaluating generative language models across standardized benchmarks. It ensures reproducible evaluation with consistent prompting, scoring, and reporting, and serves as the evaluation backend for Hugging Face's Open LLM Leaderboard.

What LM Evaluation Harness Does

Evaluates language models on 400+ built-in benchmark tasks
Supports multiple model backends (Hugging Face, GGUF, vLLM, OpenAI API, etc.)
Handles few-shot prompting with configurable example selection
Computes metrics including accuracy, perplexity, exact match, and BLEU
Produces structured JSON output for automated comparison pipelines

Architecture Overview

The harness separates model interfaces (LM classes) from task definitions (YAML configs specifying dataset, prompt template, and metric). During evaluation, it constructs prompts per task specification, batches requests to the model, collects log-probabilities or generated text, and scores against reference answers. A task registry discovers built-in tasks and user-defined YAML configs automatically. Caching avoids redundant computation across runs.

Self-Hosting & Configuration

Install via pip: pip install lm-eval or clone for development
Configure model backend and arguments via CLI or Python API
Add custom tasks by writing YAML configuration files
Use --batch_size auto for automatic GPU memory-aware batching
Output results to JSON, CSV, or push to Weights & Biases

Key Features

400+ tasks covering reasoning, knowledge, coding, math, and language understanding
Backend-agnostic: evaluate local models, API endpoints, or quantized formats
Few-shot evaluation with deterministic example sampling
Group tasks into suites for composite benchmark scoring
Extensible YAML task format for custom evaluations

Comparison with Similar Tools

HELM (Stanford) — broader evaluation including fairness and toxicity; more complex setup
OpenCompass — Chinese-origin eval suite with similar scope; different task implementations
Promptfoo — focused on prompt testing and red-teaming rather than academic benchmarks
DeepEval — LLM testing with custom metrics; less standardized benchmark coverage
bigcode-evaluation-harness — specialized fork for code generation benchmarks

FAQ

Q: How do I evaluate a model served via API? A: Use the --model openai-completions or --model local-completions backend with the appropriate base URL.

Q: Can I add my own benchmark? A: Yes. Write a YAML task config specifying the dataset (from Hugging Face or local), prompt template, and scoring metric.

Q: How long does a full evaluation take? A: Depends on model size and task count. MMLU on a 7B model takes roughly 30 minutes on a single A100.

Q: Is the scoring deterministic? A: Yes, given the same model weights, task version, and few-shot seed. The harness pins random seeds for reproducibility.

LM Evaluation Harness — Unified LLM Benchmarking Framework

Agent 可直接安装

Introduction

What LM Evaluation Harness Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

LLaMA-Factory — Unified LLM Fine-Tuning Framework

Oumi — Unified LLM Fine-Tuning and Evaluation

DeepEval — LLM Testing Framework with 30+ Metrics

LM Evaluation Harness — Few-Shot Language Model Benchmarking