KnowledgeMay 11, 2026·2 min read

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 96/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Single
Trust
Trust: Established
Entrypoint
README.md
Universal CLI install command
npx tokrepo install 7fd5858d-76a8-4679-80d1-ee1191ad2977
Intro

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

  • Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
  • Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
  • Setup time: 30 minutes

Quantitative Notes

  • Setup time ~30 minutes (install + Docker + first harness run)
  • GitHub stars + forks (verified): see Source & Thanks
  • Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.


🙏

Source & Thanks

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets