# SWE-bench — Benchmark for Coding Agents > Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks. ## Install Copy the content below into your project: # SWE-bench — Benchmark for Coding Agents > Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks. ## Quick Use 1. Install: ```bash pip install -e . ``` 2. Run: ```bash python -m swebench.harness.run_evaluation --help ``` 3. Verify: - Run one small evaluation (e.g., SWE-bench Lite) and confirm you get an `evaluation_results` output folder. --- ## Intro Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks. - **Best for:** Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring - **Works with:** Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution - **Setup time:** 30 minutes ### Quantitative Notes - Setup time ~30 minutes (install + Docker + first harness run) - GitHub stars + forks (verified): see Source & Thanks - Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs --- ## Practical Notes Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable. **Safety note:** Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes. ### FAQ **Q: Is it only a dataset?** A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly. **Q: Can I use it for regression tests?** A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes. **Q: Why does it need so much disk?** A: Evaluations often build/run many repos in Docker; logs and images add up quickly. --- ## Source & Thanks > GitHub: https://github.com/SWE-bench/SWE-bench > Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 > License (SPDX): MIT > GitHub stars (verified via `api.github.com/repos/SWE-bench/SWE-bench`): 4,900 > GitHub forks (verified via `api.github.com/repos/SWE-bench/SWE-bench`): 856 --- # SWE-bench——面向代码 Agent 的真实任务基准 > 用 SWE-bench 在真实 GitHub issue 上评测代码 Agent:提供可复现的 harness 运行与补丁预测评分;适合对比不同模型、prompt 与工具栈的端到端表现与回归变化。 ## 快速使用 1. 安装: ```bash pip install -e . ``` 2. 运行: ```bash python -m swebench.harness.run_evaluation --help ``` 3. 验证: - Run one small evaluation (e.g., SWE-bench Lite) and confirm you get an `evaluation_results` output folder. --- ## 简介 用 SWE-bench 在真实 GitHub issue 上评测代码 Agent:提供可复现的 harness 运行与补丁预测评分;适合对比不同模型、prompt 与工具栈的端到端表现与回归变化。 - **适合谁(Best for):** 想用可复现数据集与 harness 评分来评测 AI coding agent 的团队 - **兼容工具(Works with):** Python、基于 Docker 的评测流程、数据集 + predictions 文件、可选 Modal 执行 - **安装时间(Setup time):** 30 分钟 ### 量化信息 - 跑通约 30 分钟(安装 + Docker + 第一次 harness 运行) - GitHub stars + forks(已核验):见「来源与感谢」 - README 资源建议:约 120GB 可用磁盘、8 核 CPU(完整评测) --- ## 实战要点 把 SWE-bench 当北极星评测:定义基线 agent(模型 + 工具),先用 SWE-bench Lite 快速迭代,只有在有把握时才跑更大套件。记录版本信息(模型、agent 代码、工具 prompt),让提升可审计且可复现。 **安全提示:** 强化评测环境:隔离 Docker、固定依赖版本,并避免在沙箱之外运行不可信代码。 ### FAQ **Q: 它只是数据集吗?** A: 不是。SWE-bench 同时包含数据集与可复现的运行/评分 harness。 **Q: 能用于回归测试吗?** A: 可以。冻结一部分任务集,定期或在关键改动后运行 harness。 **Q: 为什么需要这么多磁盘?** A: 评测会在 Docker 中构建/运行大量仓库,镜像与日志会快速增长。 --- ## 来源与感谢 > GitHub:https://github.com/SWE-bench/SWE-bench > Owner avatar:https://avatars.githubusercontent.com/u/139597579?v=4 > 许可证(SPDX):MIT > GitHub stars(已通过 `api.github.com/repos/SWE-bench/SWE-bench` 核验):4,900 > GitHub forks(已通过 `api.github.com/repos/SWE-bench/SWE-bench` 核验):856 --- Source: https://tokrepo.com/en/workflows/swe-bench-benchmark-for-coding-agents Author: Agent Toolkit