What is SWE-bench — Benchmark for Coding Agents?

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Is SWE-bench — Benchmark for Coding Agents free to use?

Yes. SWE-bench — Benchmark for Coding Agents is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install SWE-bench — Benchmark for Coding Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

SWE-bench — Benchmark for Coding Agents

简介

用 SWE-bench 在真实 GitHub issue 上评测代码 Agent：提供可复现的 harness 运行与补丁预测评分；适合对比不同模型、prompt 与工具栈的端到端表现与回归变化。

适合谁（Best for）: 想用可复现数据集与 harness 评分来评测 AI coding agent 的团队
兼容工具（Works with）: Python、基于 Docker 的评测流程、数据集 + predictions 文件、可选 Modal 执行
安装时间（Setup time）: 30 分钟

量化信息

跑通约 30 分钟（安装 + Docker + 第一次 harness 运行）
GitHub stars + forks（已核验）：见「来源与感谢」
README 资源建议：约 120GB 可用磁盘、8 核 CPU（完整评测）

实战要点

把 SWE-bench 当北极星评测：定义基线 agent（模型 + 工具），先用 SWE-bench Lite 快速迭代，只有在有把握时才跑更大套件。记录版本信息（模型、agent 代码、工具 prompt），让提升可审计且可复现。

安全提示： 强化评测环境：隔离 Docker、固定依赖版本，并避免在沙箱之外运行不可信代码。

FAQ

Q: 它只是数据集吗？ A: 不是。SWE-bench 同时包含数据集与可复现的运行/评分 harness。

Q: 能用于回归测试吗？ A: 可以。冻结一部分任务集，定期或在关键改动后运行 harness。

Q: 为什么需要这么多磁盘？ A: 评测会在 Docker 中构建/运行大量仓库，镜像与日志会快速增长。

SWE-bench — Benchmark for Coding Agents

这个资产可以被 Agent 直接读取和安装

简介

量化信息

实战要点

FAQ

来源与感谢

讨论

相关资产

Self-Evolving Agents Survey — Lifelong Systems

MemoryGraph — Graph MCP Memory Server (pipx)

Weave — Trace and Debug LLM Apps

Awesome-Memory-for-Agents — Paper List + Taxonomy