Cette page est affichée en anglais. Une traduction française est en cours.
CLI ToolsMay 11, 2026·2 min de lecture

Margin Eval — Local Evals for CLI Coding Agents

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 29/100Stage only
Surface agent
Tout agent MCP/CLI
Type
CLI Tool
Installation
Single
Confiance
Confiance : Established
Point d'entrée
README.md
Commande CLI universelle
npx tokrepo install f4905383-abe8-46fb-8c5c-2cdcdb45b141
Introduction

Margin Eval is an eval runtime that benchmarks CLI coding agents and records accuracy, token usage, runtime, and traces in a reproducible format.

  • Best for: Teams comparing CLI agents (Claude Code/Codex/Gemini CLI) with one unified harness and trace format
  • Works with: Docker + a provider API key or OAuth; runs local suites from Git repos and saves run bundles
  • Setup time: 20 minutes

Practical Notes

  • Setup time ~20 minutes (install + margin check + one dry-run)
  • Two measurable checks: margin --version works, and a run bundle is produced under your output folder
  • GitHub stars + forks (verified): see Source & Thanks

Margin Eval is strongest when you standardize “what counts as success” for tool-using agent runs:

  • Use a shared suite repo for scenarios and fixtures.
  • Keep agent configs in version control (so changes are reviewed).
  • Compare agents side-by-side using the same suites and eval configs.

If you run multiple providers, treat auth as part of the harness: keep keys out of logs, and make sure dry-run is part of every developer’s setup.

FAQ

Q: Why evaluate locally instead of only in CI? A: Local evals shorten iteration loops. You can reproduce a failure immediately before pushing.

Q: Do I need Docker? A: The README lists Docker as a prerequisite for the quickstart.

Q: What should I store long-term? A: Store the run bundle/traces and a small summary so regressions can be audited later.

🙏

Source et remerciements

Source: https://github.com/Margin-Lab/evals > License: AGPL-3.0 > GitHub stars: 59 · forks: 1

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires