Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 11, 2026·2 min de lecture

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 96/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Single
Confiance
Confiance : Established
Point d'entrée
README.md
Commande CLI universelle
npx tokrepo install 7fd5858d-76a8-4679-80d1-ee1191ad2977
Introduction

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

  • Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
  • Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
  • Setup time: 30 minutes

Quantitative Notes

  • Setup time ~30 minutes (install + Docker + first harness run)
  • GitHub stars + forks (verified): see Source & Thanks
  • Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.


🙏

Source et remerciements

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires