# Margin Eval — Local Evals for CLI Coding Agents

> Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

## Install

Copy the content below into your project:

## Quick Use

1. Install / run:
   ```bash
   curl -fsSL https://raw.githubusercontent.com/Margin-Lab/evals/main/scripts/install.sh | bash
   ```
2. Start / smoke test:
   ```bash
   margin --version && margin check
   ```
3. Verify:
   - Run a `--dry-run` eval from the README; confirm a run directory is created and the trace bundle is reproducible across two executions.

## Intro

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

- **Best for:** Teams comparing CLI agents (Claude Code/Codex/Gemini CLI) with one unified harness and trace format
- **Works with:** Docker + a provider API key or OAuth; runs local suites from Git repos and saves run bundles
- **Setup time:** 20 minutes

## Practical Notes

- Setup time ~20 minutes (install + `margin check` + one dry-run)
- Two measurable checks: `margin --version` works, and a run bundle is produced under your output folder
- GitHub stars + forks (verified): see Source & Thanks

Margin Eval is strongest when you standardize “what counts as success” for tool-using agent runs:

- Use a shared suite repo for scenarios and fixtures.
- Keep agent configs in version control (so changes are reviewed).
- Compare agents side-by-side using the same suites and eval configs.

If you run multiple providers, treat auth as part of the harness: keep keys out of logs, and make sure dry-run is part of every developer’s setup.

### FAQ

**Q: Why evaluate locally instead of only in CI?**
A: Local evals shorten iteration loops. You can reproduce a failure immediately before pushing.

**Q: Do I need Docker?**
A: The README lists Docker as a prerequisite for the quickstart.

**Q: What should I store long-term?**
A: Store the run bundle/traces and a small summary so regressions can be audited later.

## Source & Thanks

> Source: https://github.com/Margin-Lab/evals
> License: AGPL-3.0
> GitHub stars: 59 · forks: 1

---

<!-- ZH -->

## 快速使用

1. 安装 / 运行：
   ```bash
   curl -fsSL https://raw.githubusercontent.com/Margin-Lab/evals/main/scripts/install.sh | bash
   ```
2. 启动 / 冒烟测试：
   ```bash
   margin --version && margin check
   ```
3. 验证：
   - 按 README 跑一次 `--dry-run`；确认生成 run 目录，并且两次执行能得到可对比的 trace/输出包。

## 简介

Margin Eval 是开源评测运行时：可对 Claude Code、Codex、Gemini CLI 等 CLI 编码 Agent 做基准评测，记录准确率、token 用量、耗时与执行轨迹，便于复现与回归。

- **适合谁：** 需要用统一框架对比多种 CLI Agent，并沉淀可复现评测记录的团队
- **可搭配：** 需要 Docker + provider 的 API key 或 OAuth；可从 Git 仓库拉取 suite 并保存运行包
- **准备时间：** 20 分钟

## 实战建议

- 准备时间约 20 分钟（安装 + `margin check` + 跑一次 dry-run）
- 两项可量化检查：`margin --version` 可用；输出目录产生可复现 run bundle
- GitHub stars / forks（已核验）：见「来源与感谢」

Margin Eval 最适合用来把“成功标准”统一起来：

- 用共享 suite 仓库存放场景与夹具。
- agent 配置文件版本化（变更走 PR 审核）。
- 同一套 suite + eval 配置下做横向对比，结论才可复现。

当你同时跑多个 provider 时，把鉴权也当作评测的一部分：避免 key 进入日志，并把 dry-run 变成每个开发者的默认自检步骤。

### FAQ

**为什么要本地评测，不只在 CI 跑？**
答：本地能缩短迭代回路：失败可立即复现与定位，再提交 PR。

**一定要 Docker 吗？**
答：README 的 Quickstart 把 Docker 列为前置条件。

**长期应该保存什么？**
答：建议保存 run bundle/trace 与简要摘要，便于之后对回归做审计与对比。

## 来源与感谢

> Source: https://github.com/Margin-Lab/evals
> License: AGPL-3.0
> GitHub stars: 59 · forks: 1


---
Source: https://tokrepo.com/en/workflows/margin-eval-local-evals-for-cli-coding-agents
Author: Script Depot