Is SWE-bench — Benchmark for Coding Agents free to use?

Yes. SWE-bench — Benchmark for Coding Agents is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install SWE-bench — Benchmark for Coding Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

KnowledgeMay 11, 2026·2 min read

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Agent Toolkit · Community

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 96/100Policy: allow

Agent surface

Any MCP/CLI agent

Kind

Knowledge

Install

Single

Trust

Trust: Established

Entrypoint

Asset

Direct install command

npx -y tokrepo@latest install 7fd5858d-76a8-4679-80d1-ee1191ad2977 --target codex

Run after dry-run confirms the install plan.

Intro

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
Setup time: 30 minutes

Quantitative Notes

Setup time ~30 minutes (install + Docker + first harness run)
GitHub stars + forks (verified): see Source & Thanks
Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.

🙏

Source & Thanks

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

Self-Evolving Agents Survey — Lifelong Systems

Awesome-Self-Evolving-Agents is a survey collection on self-evolving AI agents and lifelong systems, focusing on feedback, memory, and iteration loops.

Knowledge

Agent Toolkit

awesome-trading-agents — Trading Agents + MCP List

Curated list of trading agents, market-data MCPs, and skills, with “If you only read three” starters and bilingual docs. Verified 114★; pushed 2026-05-11.

Knowledge

Agent Toolkit

Awesome-Memory-for-Agents — Paper List + Taxonomy

Awesome-Memory-for-Agents is a paper list and taxonomy for agent memory, splitting short vs long-term memory and mapping to 3 application scenarios.

Knowledge

AI Open Source

Wax — Single-File Memory Layer for AI Agents

Wax stores documents, embeddings, and knowledge in one portable `.wax` file, giving AI agents a local memory layer without extra servers.

Knowledge

AI Open Source

◈Home 🔍Search 👤Me