Esta página se muestra en inglés. Una traducción al español está en curso.
CLI ToolsMay 11, 2026·2 min de lectura

Margin Eval — Local Evals for CLI Coding Agents

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 29/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
CLI Tool
Instalación
Single
Confianza
Confianza: Established
Entrada
README.md
Comando CLI universal
npx tokrepo install f4905383-abe8-46fb-8c5c-2cdcdb45b141
Introducción

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

  • Best for: Teams comparing CLI agents (Claude Code/Codex/Gemini CLI) with one unified harness and trace format
  • Works with: Docker + a provider API key or OAuth; runs local suites from Git repos and saves run bundles
  • Setup time: 20 minutes

Practical Notes

  • Setup time ~20 minutes (install + margin check + one dry-run)
  • Two measurable checks: margin --version works, and a run bundle is produced under your output folder
  • GitHub stars + forks (verified): see Source & Thanks

Margin Eval is strongest when you standardize “what counts as success” for tool-using agent runs:

  • Use a shared suite repo for scenarios and fixtures.
  • Keep agent configs in version control (so changes are reviewed).
  • Compare agents side-by-side using the same suites and eval configs.

If you run multiple providers, treat auth as part of the harness: keep keys out of logs, and make sure dry-run is part of every developer’s setup.

FAQ

Q: Why evaluate locally instead of only in CI? A: Local evals shorten iteration loops. You can reproduce a failure immediately before pushing.

Q: Do I need Docker? A: The README lists Docker as a prerequisite for the quickstart.

Q: What should I store long-term? A: Store the run bundle/traces and a small summary so regressions can be audited later.

🙏

Fuente y agradecimientos

Source: https://github.com/Margin-Lab/evals > License: AGPL-3.0 > GitHub stars: 59 · forks: 1

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados