How do I install LM Evaluation Harness — Few-Shot Language Model Benchmarking?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LM Evaluation Harness — Few-Shot Language Model Benchmarking

Introduction

LM Evaluation Harness by EleutherAI provides a standardized way to evaluate language models across hundreds of academic benchmarks. It has become the de facto evaluation framework used by research labs and the Open LLM Leaderboard to produce comparable, reproducible benchmark scores.

What LM Evaluation Harness Does

Evaluates language models on 200+ tasks including MMLU, HellaSwag, ARC, GSM8K, and TruthfulQA
Supports Hugging Face Transformers, GGUF, vLLM, OpenAI API, and custom model backends
Runs few-shot and zero-shot evaluations with configurable prompt templates
Produces standardized JSON results for comparison across models and checkpoints
Handles batched inference and multi-GPU parallelism for efficient evaluation

Architecture Overview

The harness defines each benchmark as a YAML-based task configuration that specifies the dataset, prompt template, few-shot examples, and scoring metric. At runtime it loads the configured model backend, constructs evaluation prompts, batches inference requests, collects log-likelihoods or generated text, and scores outputs against reference answers. Results are aggregated per-task and exported as structured JSON.

Self-Hosting & Configuration

Install from PyPI with pip or clone the repository for development use
Configure model backends via --model and --model_args flags on the CLI
Define custom evaluation tasks in YAML without writing Python code
Set batch size, number of few-shot examples, and output directory via CLI flags
Run distributed evaluation across multiple GPUs with accelerate or vLLM tensor parallelism

Key Features

Powers the Hugging Face Open LLM Leaderboard, the community standard for model ranking
YAML task definitions allow non-programmers to add new benchmarks quickly
Caches model outputs for fast re-scoring when metrics change
Supports log-likelihood, multiple-choice, and generative evaluation modes
Extensible model API enables integration with any inference backend

Comparison with Similar Tools

Promptfoo — focuses on prompt regression testing for applications; LM Eval Harness targets academic benchmark evaluation
DeepEval — LLM testing with custom metrics for production apps; LM Eval Harness covers established research benchmarks
Ragas — specializes in RAG pipeline evaluation; LM Eval Harness evaluates base model capabilities
OpenCompass — Chinese-originated evaluation suite; LM Eval Harness has broader English benchmark coverage
Inspect AI — safety-focused evaluations; LM Eval Harness covers general capability benchmarks

FAQ

Q: Which models does it support? A: Any model accessible via Hugging Face Transformers, GGUF files, vLLM, or an OpenAI-compatible API endpoint.

Q: How long does a full evaluation take? A: Running MMLU on an 8B model with a single GPU takes roughly 30-60 minutes. Smaller benchmarks like HellaSwag finish in minutes.

Q: Can I add my own benchmark? A: Yes. Create a YAML task config specifying the Hugging Face dataset, prompt template, and metric. No Python code is required for standard formats.

Q: Is it used for the Open LLM Leaderboard? A: Yes. The Hugging Face Open LLM Leaderboard uses LM Evaluation Harness as its evaluation backend.

LM Evaluation Harness — Few-Shot Language Model Benchmarking

Agent 可直接安装

Introduction

What LM Evaluation Harness Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

LM Evaluation Harness — Unified LLM Benchmarking Framework

FastChat — Open Platform for LLM Serving and Evaluation

Browser Harness — Self-Healing Browser Automation for LLMs

Oh My OpenAgent — Universal Agent Harness Platform