Is SWE-bench — Benchmark for Coding Agents free to use?

Yes. SWE-bench — Benchmark for Coding Agents is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install SWE-bench — Benchmark for Coding Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Esta página se muestra en inglés. Una traducción al español está en curso.

KnowledgeMay 11, 2026·2 min de lectura

SWE-bench — Benchmark for Coding Agents

Name: SWE-bench — Benchmark for Coding Agents
Author: Agent Toolkit

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Agent Toolkit · Community

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 96/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Knowledge

Instalación

Single

Confianza

Confianza: Established

Entrada

README.md

Comando CLI universal

npx tokrepo install 7fd5858d-76a8-4679-80d1-ee1191ad2977

contrato de instalación JSON de metadata plan adaptador contenido raw

Introducción

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
Setup time: 30 minutes

Quantitative Notes

Setup time ~30 minutes (install + Docker + first harness run)
GitHub stars + forks (verified): see Source & Thanks
Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.

🙏

Fuente y agradecimientos

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Self-Evolving Agents Survey — Lifelong Systems

Awesome-Self-Evolving-Agents is a survey collection on self-evolving AI agents and lifelong systems, focusing on feedback, memory, and iteration loops.

Knowledge

Agent Toolkit

Mem0 — Memory Layer for AI Applications

Add persistent, personalized memory to AI agents and assistants. Mem0 stores user preferences, past interactions, and learned context across sessions.

Knowledge

Mem0

MemoryGraph — Graph MCP Memory Server (pipx)

MemoryGraph is a graph-based MCP memory server for coding agents; verified 199★ and shows a pipx install + `claude mcp add` setup for Claude Code.

Knowledge

Agent Toolkit

Weave — Trace and Debug LLM Apps

Weave adds tracing to LLM apps with `@weave.op`. Install `weave`, call `weave.init()`, then track inputs/outputs across API calls and validation steps.

Knowledge

Agent Toolkit

◈Inicio 🔍Buscar 👤Yo