Main
Treat it as a security lab: run experiments in an isolated environment and record the exact dependency set used.
Use it to build test cases: trojan knowledge scenarios can become unit/regression tests for your retrieval + tool pipeline.
Map the attack surface: separate poisoning in static docs vs retrieval corpora vs tool outputs so mitigations are targeted.
Export results as artifacts: logs, prompts, and configs are as important as code when reproducing agent-security claims.
README (excerpt)
[ICML 2026] CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
🛡️ Defense towards CKA
TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like CKA-Agent, achieving great defense performance while avoiding overrefusal.
🔥 Latest Results on Frontier Models (Dec 2025)
CKA-Agent demonstrates consistent high attack success rates against the latest frontier models, including GPT-5.2, Gemini-3.0-Pro, and Claude-Haiku-4.5. The results are summarized below:
Source-backed notes
- README shows
uv pip installcommands for installing experiment dependencies. - Repo is AGPL-3.0 licensed (verified via GitHub API).
- The repository positions itself as a reproducible implementation for agent-security research (per README wording).
FAQ
- Is this meant for production use?: It’s primarily research code; use it to evaluate and harden your own systems.
- How do I install dependencies?: Follow the README
uv pip install ...instructions and keep versions pinned for reproducibility. - What license applies?: AGPL-3.0 (verified via GitHub license metadata).
| Model | HarmBench | StrongREJECT | ||||||
|---|---|---|---|---|---|---|---|---|
| FS ↑ | PS ↑ | V ↓ | R ↓ | FS ↑ | PS ↑ | V ↓ | R ↓ | |
| 🟢 GPT-5.2 | 0.889 | 0.079 | 0.024 | 0.008 | 0.932 | 0.056 | 0.006 | 0.006 |
| 🟣 Gemini-3.0-Pro | 0.881 | 0.087 | ||||||