# SWE-bench — Benchmark for Coding Agents

> Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

## Install

Copy the content below into your project:

# SWE-bench — Benchmark for Coding Agents

> Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

## Quick Use

1. Install:
   ```bash
   pip install -e .
   ```
2. Run:
   ```bash
   python -m swebench.harness.run_evaluation --help
   ```
3. Verify:
   - Run one small evaluation (e.g., SWE-bench Lite) and confirm you get an `evaluation_results` output folder.


---

## Intro

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

- **Best for:** Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
- **Works with:** Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
- **Setup time:** 30 minutes


### Quantitative Notes

- Setup time ~30 minutes (install + Docker + first harness run)
- GitHub stars + forks (verified): see Source & Thanks
- Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs


---

## Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

**Safety note:** Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

### FAQ

**Q: Is it only a dataset?**
A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

**Q: Can I use it for regression tests?**
A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

**Q: Why does it need so much disk?**
A: Evaluations often build/run many repos in Docker; logs and images add up quickly.

---

## Source & Thanks

> GitHub: https://github.com/SWE-bench/SWE-bench
> Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4
> License (SPDX): MIT
> GitHub stars (verified via `api.github.com/repos/SWE-bench/SWE-bench`): 4,900
> GitHub forks (verified via `api.github.com/repos/SWE-bench/SWE-bench`): 856


---

<!-- ZH -->

# SWE-bench——面向代码 Agent 的真实任务基准

> 用 SWE-bench 在真实 GitHub issue 上评测代码 Agent：提供可复现的 harness 运行与补丁预测评分；适合对比不同模型、prompt 与工具栈的端到端表现与回归变化。

## 快速使用

1. 安装：
   ```bash
   pip install -e .
   ```
2. 运行：
   ```bash
   python -m swebench.harness.run_evaluation --help
   ```
3. 验证：
   - Run one small evaluation (e.g., SWE-bench Lite) and confirm you get an `evaluation_results` output folder.


---

## 简介

用 SWE-bench 在真实 GitHub issue 上评测代码 Agent：提供可复现的 harness 运行与补丁预测评分；适合对比不同模型、prompt 与工具栈的端到端表现与回归变化。

- **适合谁（Best for）:** 想用可复现数据集与 harness 评分来评测 AI coding agent 的团队
- **兼容工具（Works with）:** Python、基于 Docker 的评测流程、数据集 + predictions 文件、可选 Modal 执行
- **安装时间（Setup time）:** 30 分钟


### 量化信息

- 跑通约 30 分钟（安装 + Docker + 第一次 harness 运行）
- GitHub stars + forks（已核验）：见「来源与感谢」
- README 资源建议：约 120GB 可用磁盘、8 核 CPU（完整评测）


---

## 实战要点

把 SWE-bench 当北极星评测：定义基线 agent（模型 + 工具），先用 SWE-bench Lite 快速迭代，只有在有把握时才跑更大套件。记录版本信息（模型、agent 代码、工具 prompt），让提升可审计且可复现。

**安全提示：** 强化评测环境：隔离 Docker、固定依赖版本，并避免在沙箱之外运行不可信代码。

### FAQ

**Q: 它只是数据集吗？**
A: 不是。SWE-bench 同时包含数据集与可复现的运行/评分 harness。

**Q: 能用于回归测试吗？**
A: 可以。冻结一部分任务集，定期或在关键改动后运行 harness。

**Q: 为什么需要这么多磁盘？**
A: 评测会在 Docker 中构建/运行大量仓库，镜像与日志会快速增长。

---

## 来源与感谢

> GitHub：https://github.com/SWE-bench/SWE-bench
> Owner avatar：https://avatars.githubusercontent.com/u/139597579?v=4
> 许可证（SPDX）：MIT
> GitHub stars（已通过 `api.github.com/repos/SWE-bench/SWE-bench` 核验）：4,900
> GitHub forks（已通过 `api.github.com/repos/SWE-bench/SWE-bench` 核验）：856


---
Source: https://tokrepo.com/en/workflows/swe-bench-benchmark-for-coding-agents
Author: Agent Toolkit