# LM Evaluation Harness — Few-Shot Language Model Benchmarking

> A unified framework for evaluating language models across hundreds of benchmarks with reproducible few-shot testing.

## Install

Save in your project root:

# LM Evaluation Harness — Few-Shot Language Model Benchmarking

## Quick Use
```bash
pip install lm-eval
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B --tasks hellaswag --num_fewshot 5
```

## Introduction
LM Evaluation Harness by EleutherAI provides a standardized way to evaluate language models across hundreds of academic benchmarks. It has become the de facto evaluation framework used by research labs and the Open LLM Leaderboard to produce comparable, reproducible benchmark scores.

## What LM Evaluation Harness Does
- Evaluates language models on 200+ tasks including MMLU, HellaSwag, ARC, GSM8K, and TruthfulQA
- Supports Hugging Face Transformers, GGUF, vLLM, OpenAI API, and custom model backends
- Runs few-shot and zero-shot evaluations with configurable prompt templates
- Produces standardized JSON results for comparison across models and checkpoints
- Handles batched inference and multi-GPU parallelism for efficient evaluation

## Architecture Overview
The harness defines each benchmark as a YAML-based task configuration that specifies the dataset, prompt template, few-shot examples, and scoring metric. At runtime it loads the configured model backend, constructs evaluation prompts, batches inference requests, collects log-likelihoods or generated text, and scores outputs against reference answers. Results are aggregated per-task and exported as structured JSON.

## Self-Hosting & Configuration
- Install from PyPI with pip or clone the repository for development use
- Configure model backends via --model and --model_args flags on the CLI
- Define custom evaluation tasks in YAML without writing Python code
- Set batch size, number of few-shot examples, and output directory via CLI flags
- Run distributed evaluation across multiple GPUs with accelerate or vLLM tensor parallelism

## Key Features
- Powers the Hugging Face Open LLM Leaderboard, the community standard for model ranking
- YAML task definitions allow non-programmers to add new benchmarks quickly
- Caches model outputs for fast re-scoring when metrics change
- Supports log-likelihood, multiple-choice, and generative evaluation modes
- Extensible model API enables integration with any inference backend

## Comparison with Similar Tools
- **Promptfoo** — focuses on prompt regression testing for applications; LM Eval Harness targets academic benchmark evaluation
- **DeepEval** — LLM testing with custom metrics for production apps; LM Eval Harness covers established research benchmarks
- **Ragas** — specializes in RAG pipeline evaluation; LM Eval Harness evaluates base model capabilities
- **OpenCompass** — Chinese-originated evaluation suite; LM Eval Harness has broader English benchmark coverage
- **Inspect AI** — safety-focused evaluations; LM Eval Harness covers general capability benchmarks

## FAQ
**Q: Which models does it support?**
A: Any model accessible via Hugging Face Transformers, GGUF files, vLLM, or an OpenAI-compatible API endpoint.

**Q: How long does a full evaluation take?**
A: Running MMLU on an 8B model with a single GPU takes roughly 30-60 minutes. Smaller benchmarks like HellaSwag finish in minutes.

**Q: Can I add my own benchmark?**
A: Yes. Create a YAML task config specifying the Hugging Face dataset, prompt template, and metric. No Python code is required for standard formats.

**Q: Is it used for the Open LLM Leaderboard?**
A: Yes. The Hugging Face Open LLM Leaderboard uses LM Evaluation Harness as its evaluation backend.

## Sources
- https://github.com/EleutherAI/lm-evaluation-harness
- https://www.eleuther.ai/projects/large-language-model-evaluation

---
Source: https://tokrepo.com/en/workflows/asset-62c59e45
Author: AI Open Source