Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeMay 11, 2026·2 min de lectura

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 96/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Knowledge
Instalación
Single
Confianza
Confianza: Established
Entrada
README.md
Comando CLI universal
npx tokrepo install 7fd5858d-76a8-4679-80d1-ee1191ad2977
Introducción

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

  • Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
  • Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
  • Setup time: 30 minutes

Quantitative Notes

  • Setup time ~30 minutes (install + Docker + first harness run)
  • GitHub stars + forks (verified): see Source & Thanks
  • Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.


🙏

Fuente y agradecimientos

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados