Is SWE-bench — Benchmark for Coding Agents free to use?

Yes. SWE-bench — Benchmark for Coding Agents is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install SWE-bench — Benchmark for Coding Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cette page est affichée en anglais. Une traduction française est en cours.

KnowledgeMay 11, 2026·2 min de lecture

SWE-bench — Benchmark for Coding Agents

Name: SWE-bench — Benchmark for Coding Agents
Author: Agent Toolkit

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Agent Toolkit · Community

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 96/100Policy : autoriser

Surface agent

Tout agent MCP/CLI

Type

Knowledge

Installation

Single

Confiance

Confiance : Established

Point d'entrée

README.md

Commande CLI universelle

npx tokrepo install 7fd5858d-76a8-4679-80d1-ee1191ad2977

contrat d'installation JSON metadata plan adaptateur contenu raw

Introduction

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Best for: Teams benchmarking AI coding agents with reproducible datasets and harness-driven scoring
Works with: Python, Docker-based evaluation runs, dataset inputs + predictions JSON, optional Modal execution
Setup time: 30 minutes

Quantitative Notes

Setup time ~30 minutes (install + Docker + first harness run)
GitHub stars + forks (verified): see Source & Thanks
Resource note from README: ~120GB free disk, 8 CPU cores recommended for full runs

Practical Notes

Use SWE-bench as your north-star eval: define a baseline agent (model + tools), run SWE-bench Lite for fast iteration, and only run larger suites when you’re confident. Record versions (model, agent code, tool prompts) so improvements are auditable and repeatable.

Safety note: Harden your evaluation environment: isolate Docker, pin dependencies, and avoid running untrusted code outside sandboxes.

FAQ

Q: Is it only a dataset? A: No. SWE-bench includes a dataset plus a harness to run and score predictions reproducibly.

Q: Can I use it for regression tests? A: Yes. Freeze a subset of tasks and run the harness periodically or on key changes.

Q: Why does it need so much disk? A: Evaluations often build/run many repos in Docker; logs and images add up quickly.

🙏

Source et remerciements

GitHub: https://github.com/SWE-bench/SWE-bench Owner avatar: https://avatars.githubusercontent.com/u/139597579?v=4 License (SPDX): MIT GitHub stars (verified via api.github.com/repos/SWE-bench/SWE-bench): 4,900 GitHub forks (verified via api.github.com/repos/SWE-bench/SWE-bench): 856

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Self-Evolving Agents Survey — Lifelong Systems

Awesome-Self-Evolving-Agents is a survey collection on self-evolving AI agents and lifelong systems, focusing on feedback, memory, and iteration loops.

Knowledge

Agent Toolkit

Mem0 — Memory Layer for AI Applications

Add persistent, personalized memory to AI agents and assistants. Mem0 stores user preferences, past interactions, and learned context across sessions.

Knowledge

Mem0

MemoryGraph — Graph MCP Memory Server (pipx)

MemoryGraph is a graph-based MCP memory server for coding agents; verified 199★ and shows a pipx install + `claude mcp add` setup for Claude Code.

Knowledge

Agent Toolkit

Weave — Trace and Debug LLM Apps

Weave adds tracing to LLM apps with `@weave.op`. Install `weave`, call `weave.init()`, then track inputs/outputs across API calls and validation steps.

Knowledge

Agent Toolkit

◈Accueil 🔍Rechercher 👤Moi