Main
Use it as a sourcing index: jump from the list to primary papers/repos, then build your own benchmark set.
Extract evaluation dimensions: turn repeated criteria into a checklist for your harness (context, tools, memory, safety).
Keep a local notes file: for each referenced harness, record setup time, supported tools, and failure modes.
Prefer primary citations: when copying claims into docs, link to the original repo/paper, not a secondary summary.
README (excerpt)
⭐ This repo is actively maintained. If you find it useful, please star the repo to stay updated and help others find it.
The agent execution harness — not the model — is the primary determinant of agent reliability at scale.
This survey formalizes the harness as a first-class architectural object H = (E, T, C, S, L, V), surveys 110+ papers, blogs and reports across 23 systems, and maps 9 open technical challenges.
📄 Read the Paper
🌐 Preprints Version (v3)
✉️ Corrections & suggestions: gloriamenng@gmail.com (Qianyu Meng); wangyanan@mail.dlut.edu.cn (Yanan Wang); chenliyi@xiaohongshu.com (Liyi Chen)
If you find this survey useful, please cite:
@article{meng2026agentharness,
title = {Agent Harness for Large Language Model Agents: A Survey},
author = {Meng, Qianyu and Wang, Yanan and Chen, Liyi and Wu, Wei and
Li, Yihang and Jiang, Wenyuan and Wang, Qimeng and
Lu, Chengqiang and Gao, Yan and Wu, Yi and Hu, Yao},
year = {2026},
doi = {10.20944/preprints202604.0428.v3},
url = {https://www.preprints.org/manuscript/202604.0428/v3},
### Source-backed notes
- The repo is CC-BY-4.0 licensed (verified via GitHub API).
- GitHub API verification confirms the repo URL and recent push date.
- README functions as a curated survey/reading map (content is primarily links and structure).
### FAQ
- **Is it an implementation?**: No—it's primarily a survey/awesome list to help you find harness tools and papers.
- **Can I reuse content?**: Yes—license is CC-BY-4.0; attribute appropriately when reusing text.
- **How do I turn it into action?**: Pick 3–5 harnesses, run the same questions/tasks, and record results as your baseline benchmark.