Introduction
The LM Evaluation Harness by EleutherAI provides a unified interface for evaluating generative language models across standardized benchmarks. It ensures reproducible evaluation with consistent prompting, scoring, and reporting, and serves as the evaluation backend for Hugging Face's Open LLM Leaderboard.
What LM Evaluation Harness Does
- Evaluates language models on 400+ built-in benchmark tasks
- Supports multiple model backends (Hugging Face, GGUF, vLLM, OpenAI API, etc.)
- Handles few-shot prompting with configurable example selection
- Computes metrics including accuracy, perplexity, exact match, and BLEU
- Produces structured JSON output for automated comparison pipelines
Architecture Overview
The harness separates model interfaces (LM classes) from task definitions (YAML configs specifying dataset, prompt template, and metric). During evaluation, it constructs prompts per task specification, batches requests to the model, collects log-probabilities or generated text, and scores against reference answers. A task registry discovers built-in tasks and user-defined YAML configs automatically. Caching avoids redundant computation across runs.
Self-Hosting & Configuration
- Install via pip:
pip install lm-evalor clone for development - Configure model backend and arguments via CLI or Python API
- Add custom tasks by writing YAML configuration files
- Use
--batch_size autofor automatic GPU memory-aware batching - Output results to JSON, CSV, or push to Weights & Biases
Key Features
- 400+ tasks covering reasoning, knowledge, coding, math, and language understanding
- Backend-agnostic: evaluate local models, API endpoints, or quantized formats
- Few-shot evaluation with deterministic example sampling
- Group tasks into suites for composite benchmark scoring
- Extensible YAML task format for custom evaluations
Comparison with Similar Tools
- HELM (Stanford) — broader evaluation including fairness and toxicity; more complex setup
- OpenCompass — Chinese-origin eval suite with similar scope; different task implementations
- Promptfoo — focused on prompt testing and red-teaming rather than academic benchmarks
- DeepEval — LLM testing with custom metrics; less standardized benchmark coverage
- bigcode-evaluation-harness — specialized fork for code generation benchmarks
FAQ
Q: How do I evaluate a model served via API?
A: Use the --model openai-completions or --model local-completions backend with the appropriate base URL.
Q: Can I add my own benchmark? A: Yes. Write a YAML task config specifying the dataset (from Hugging Face or local), prompt template, and scoring metric.
Q: How long does a full evaluation take? A: Depends on model size and task count. MMLU on a 7B model takes roughly 30 minutes on a single A100.
Q: Is the scoring deterministic? A: Yes, given the same model weights, task version, and few-shot seed. The harness pins random seeds for reproducibility.