CLI ToolsMay 11, 2026·2 min read

Margin Eval — Local Evals for CLI Coding Agents

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 29/100Stage only
Agent surface
Any MCP/CLI agent
Kind
CLI Tool
Install
Single
Trust
Trust: Established
Entrypoint
README.md
Universal CLI install command
npx tokrepo install f4905383-abe8-46fb-8c5c-2cdcdb45b141
Intro

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

  • Best for: Teams comparing CLI agents (Claude Code/Codex/Gemini CLI) with one unified harness and trace format
  • Works with: Docker + a provider API key or OAuth; runs local suites from Git repos and saves run bundles
  • Setup time: 20 minutes

Practical Notes

  • Setup time ~20 minutes (install + margin check + one dry-run)
  • Two measurable checks: margin --version works, and a run bundle is produced under your output folder
  • GitHub stars + forks (verified): see Source & Thanks

Margin Eval is strongest when you standardize “what counts as success” for tool-using agent runs:

  • Use a shared suite repo for scenarios and fixtures.
  • Keep agent configs in version control (so changes are reviewed).
  • Compare agents side-by-side using the same suites and eval configs.

If you run multiple providers, treat auth as part of the harness: keep keys out of logs, and make sure dry-run is part of every developer’s setup.

FAQ

Q: Why evaluate locally instead of only in CI? A: Local evals shorten iteration loops. You can reproduce a failure immediately before pushing.

Q: Do I need Docker? A: The README lists Docker as a prerequisite for the quickstart.

Q: What should I store long-term? A: Store the run bundle/traces and a small summary so regressions can be audited later.

🙏

Source & Thanks

Source: https://github.com/Margin-Lab/evals > License: AGPL-3.0 > GitHub stars: 59 · forks: 1

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets