# Margin Eval — Local Evals for CLI Coding Agents > Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format. ## Install Copy the content below into your project: ## Quick Use 1. Install / run: ```bash curl -fsSL https://raw.githubusercontent.com/Margin-Lab/evals/main/scripts/install.sh | bash ``` 2. Start / smoke test: ```bash margin --version && margin check ``` 3. Verify: - Run a `--dry-run` eval from the README; confirm a run directory is created and the trace bundle is reproducible across two executions. ## Intro Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format. - **Best for:** Teams comparing CLI agents (Claude Code/Codex/Gemini CLI) with one unified harness and trace format - **Works with:** Docker + a provider API key or OAuth; runs local suites from Git repos and saves run bundles - **Setup time:** 20 minutes ## Practical Notes - Setup time ~20 minutes (install + `margin check` + one dry-run) - Two measurable checks: `margin --version` works, and a run bundle is produced under your output folder - GitHub stars + forks (verified): see Source & Thanks Margin Eval is strongest when you standardize “what counts as success” for tool-using agent runs: - Use a shared suite repo for scenarios and fixtures. - Keep agent configs in version control (so changes are reviewed). - Compare agents side-by-side using the same suites and eval configs. If you run multiple providers, treat auth as part of the harness: keep keys out of logs, and make sure dry-run is part of every developer’s setup. ### FAQ **Q: Why evaluate locally instead of only in CI?** A: Local evals shorten iteration loops. You can reproduce a failure immediately before pushing. **Q: Do I need Docker?** A: The README lists Docker as a prerequisite for the quickstart. **Q: What should I store long-term?** A: Store the run bundle/traces and a small summary so regressions can be audited later. ## Source & Thanks > Source: https://github.com/Margin-Lab/evals > License: AGPL-3.0 > GitHub stars: 59 · forks: 1 --- ## 快速使用 1. 安装 / 运行: ```bash curl -fsSL https://raw.githubusercontent.com/Margin-Lab/evals/main/scripts/install.sh | bash ``` 2. 启动 / 冒烟测试: ```bash margin --version && margin check ``` 3. 验证: - 按 README 跑一次 `--dry-run`;确认生成 run 目录,并且两次执行能得到可对比的 trace/输出包。 ## 简介 Margin Eval 是开源评测运行时:可对 Claude Code、Codex、Gemini CLI 等 CLI 编码 Agent 做基准评测,记录准确率、token 用量、耗时与执行轨迹,便于复现与回归。 - **适合谁:** 需要用统一框架对比多种 CLI Agent,并沉淀可复现评测记录的团队 - **可搭配:** 需要 Docker + provider 的 API key 或 OAuth;可从 Git 仓库拉取 suite 并保存运行包 - **准备时间:** 20 分钟 ## 实战建议 - 准备时间约 20 分钟(安装 + `margin check` + 跑一次 dry-run) - 两项可量化检查:`margin --version` 可用;输出目录产生可复现 run bundle - GitHub stars / forks(已核验):见「来源与感谢」 Margin Eval 最适合用来把“成功标准”统一起来: - 用共享 suite 仓库存放场景与夹具。 - agent 配置文件版本化(变更走 PR 审核)。 - 同一套 suite + eval 配置下做横向对比,结论才可复现。 当你同时跑多个 provider 时,把鉴权也当作评测的一部分:避免 key 进入日志,并把 dry-run 变成每个开发者的默认自检步骤。 ### FAQ **为什么要本地评测,不只在 CI 跑?** 答:本地能缩短迭代回路:失败可立即复现与定位,再提交 PR。 **一定要 Docker 吗?** 答:README 的 Quickstart 把 Docker 列为前置条件。 **长期应该保存什么?** 答:建议保存 run bundle/trace 与简要摘要,便于之后对回归做审计与对比。 ## 来源与感谢 > Source: https://github.com/Margin-Lab/evals > License: AGPL-3.0 > GitHub stars: 59 · forks: 1 --- Source: https://tokrepo.com/en/workflows/margin-eval-local-evals-for-cli-coding-agents Author: Script Depot