[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-agent-eval-benchmark-fr":3,"seo:pack:agent-eval-benchmark:fr":98},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":97},"agent-eval-benchmark","📏","#0F766E","new","Nouveau · cette semaine","Stack d'Évaluation + Benchmark pour Agents","Dix picks pour le ML\u002FLLM engineer qui évalue la qualité d'un agent : jeux de tests offline (DeepEval, Promptfoo, Ragas), benchmark runners (LM Eval Harness, SWE-bench), evals d'agent en CI, LLM-as-judge sur traces (Phoenix Evals, Langfuse), checks de sécurité (Giskard) et un gate CI. Arrêtez de livrer des agents au feeling.",[16,28,37,44,51,59,66,74,81,88],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},292,"a4d57f88-3711-4032-8ad5-f2040ae03178","deepeval-llm-testing-framework-30-metrics-a4d57f88","DeepEval — LLM Testing Framework with 30+ Metrics","DeepEval is a pytest-like testing framework for LLM apps with 30+ metrics. 14.4K+ GitHub stars. RAG, agent, multimodal evaluation. Runs locally. MIT.","Script Depot",249,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":22,"view_count":34,"vote_count":24,"lang_type":25,"type":35,"type_label":36},293,"42c43368-a482-4fad-b23d-d80e0530377b","promptfoo-test-red-team-llm-apps-42c43368","Promptfoo — Test & Red-Team LLM Apps","Promptfoo is a CLI for evaluating prompts, comparing models, and red-teaming AI apps. 18.9K+ GitHub stars. Side-by-side comparison, vulnerability scanning, CI\u002FCD. MIT.",116,"prompt","Prompt",{"id":38,"uuid":39,"slug":40,"title":41,"description":42,"author_name":22,"view_count":43,"vote_count":24,"lang_type":25,"type":26,"type_label":27},291,"2c856b4d-64e5-46b2-9bbd-a7ce9f7a7296","ragas-evaluate-rag-llm-applications-2c856b4d","Ragas — Evaluate RAG & LLM Applications","Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.",148,{"id":45,"uuid":46,"slug":47,"title":48,"description":49,"author_name":22,"view_count":50,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2502,"0df4d2b1-45bd-11f1-9bc6-00163e2b0d79","lm-evaluation-harness-unified-llm-benchmarking-framework-0df4d2b1","LM Evaluation Harness — Unified LLM Benchmarking Framework","EleutherAI's framework for reproducible evaluation of language models across hundreds of benchmarks, providing the standard evaluation backend used by the Open LLM Leaderboard and research papers.",132,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":57,"view_count":58,"vote_count":24,"lang_type":25,"type":26,"type_label":27},3111,"7fd5858d-76a8-4679-80d1-ee1191ad2977","swe-bench-benchmark-for-coding-agents","SWE-bench — Benchmark for Coding Agents","Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.","Agent Toolkit",129,{"id":60,"uuid":61,"slug":62,"title":63,"description":64,"author_name":57,"view_count":65,"vote_count":24,"lang_type":25,"type":26,"type_label":27},3153,"73cd67c3-9db6-48ed-8a31-c082f618168e","agent-evaluation-test-virtual-agents-in-ci","Agent Evaluation — Test Virtual Agents in CI","Agent Evaluation is a Python framework that runs repeatable, scored tests for virtual agents, so teams can catch regressions automatically in CI.",85,{"id":67,"uuid":68,"slug":69,"title":70,"description":71,"author_name":72,"view_count":73,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2842,"91b1b2a3-8be3-42c3-9366-c71fe29ed30d","phoenix-evals-llm-as-judge-library-with-built-in-templates","Phoenix Evals — LLM-as-Judge Library with Built-in Templates","Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.","Arize AI",84,{"id":75,"uuid":76,"slug":77,"title":78,"description":79,"author_name":57,"view_count":80,"vote_count":24,"lang_type":25,"type":26,"type_label":27},3314,"4bc8615f-82d2-5ecf-8842-720c8188357d","langfuse-python-sdk-trace-llm-apps","Langfuse Python SDK — Trace LLM Apps","Langfuse Python SDK adds tracing and observability to any LLM app via decorators or low-level calls, so you can track latency, cost, and prompts.",76,{"id":82,"uuid":83,"slug":84,"title":85,"description":86,"author_name":57,"view_count":87,"vote_count":24,"lang_type":25,"type":26,"type_label":27},3276,"08f46b1e-5a82-59dc-916a-cb2f0ae17a63","giskard-checks-evals-and-safety-tests-for-llm-agents","Giskard Checks — Evals and Safety Tests for LLM Agents","Giskard Checks gives Python teams a modular eval layer for agent regressions, groundedness, and policy conformance with scenario-based tests.",83,{"id":89,"uuid":90,"slug":91,"title":92,"description":93,"author_name":22,"view_count":94,"vote_count":24,"lang_type":25,"type":95,"type_label":96},3100,"1eecb87d-ec62-4982-828d-18dd9a031695","promptfoo-action-run-prompt-evals-in-github-ci","promptfoo-action — Run Prompt Evals in GitHub CI","Add promptfoo-action to GitHub Actions to run prompt\u002Fagent evals on PRs or pushes, cache results, and comment a before\u002Fafter report for safer iteration.",80,"script","Script","tokrepo install pack\u002Fagent-eval-benchmark",{"pageType":99,"pageKey":8,"locale":25,"title":100,"metaDescription":101,"h1":102,"tldr":103,"bodyMarkdown":104,"faq":105,"schema":121,"internalLinks":127,"citations":140,"wordCount":153,"generatedAt":154},"pack","Agent Evaluation + Benchmark Stack — 10 Picks to Measure Agent Quality","DeepEval, Promptfoo, Ragas, LM Evaluation Harness, SWE-bench, Agent Evaluation, Phoenix Evals, Langfuse, Giskard Checks, promptfoo-action — the stack an ML\u002FLLM engineer uses to evaluate agent quality. Offline test sets through CI regression gate, in install order.","Agent Evaluation + Benchmark Stack — Measure Agent Quality Before You Ship","Ten picks ordered by the actual eval pipeline an ML\u002FLLM engineer builds for an agent: write a test set, run it offline, score against a published benchmark, instrument traces, judge them with an LLM, add safety checks, and gate every PR in CI. The lesson the hard way: agents that look great in a demo regress silently — only a measured loop catches it.","## What's in this pack\n\nThis is the stack you build when the agent demo wowed the room, a few users started filing weird complaints, and you realized you have no idea whether last week's prompt edit helped or quietly regressed tool-call accuracy by 12%. Every pick here exists for one reason: **turn agent quality from a vibe into a number you can watch in CI**.\n\nThis pack is **deliberately agent-specific**. The sister pack `ml-engineer-rag-eval` covers retrieval\u002FRAG infrastructure (chunking, embedding servers, vector stores). This pack assumes that infra exists and asks the next question: *how do you measure whether the agent on top of it is getting better or worse?* The answers — test sets, benchmarks, trace eval, regression gates — are different tools with different shapes.\n\nFour layers run through the picks:\n\n- **Offline test sets** — DeepEval, Promptfoo, Ragas. Hand-curated cases with expected outputs or rubrics. The eval set is your ground truth.\n- **Benchmarks** — LM Evaluation Harness, SWE-bench. Published, comparable, useful for go\u002Fno-go decisions on model swaps.\n- **Trace-based eval** — Phoenix Evals, Langfuse. Score real production traces with an LLM judge or rule engine, sample what users actually hit.\n- **Regression in CI** — Agent Evaluation, Giskard Checks, promptfoo-action. Block PRs that drop scores or trip a safety rule.\n\n## Install in this order (test set → offline runner → benchmark → trace eval → regression CI)\n\n1. **DeepEval** — start here. Pytest-style assertions over LLM outputs with 30+ built-in metrics (faithfulness, hallucination, answer relevancy, contextual recall, G-Eval custom rubrics, tool-correctness, task-completion). Write your first 30 agent cases in a single file, run `deepeval test run`, watch them fail, and you suddenly have a baseline that didn't exist five minutes ago.\n2. **Promptfoo** — declarative YAML test suites that diff outputs side-by-side across prompts, models, and provider configs. Better than DeepEval when you want to A\u002FB two prompts visually, run a red-team sweep with adversarial inputs, or share a results matrix with non-engineers. Use both: DeepEval for the assertion library, Promptfoo for the comparison and red-team surface.\n3. **Ragas** — if the agent retrieves anything (docs, memory, tool results), retrieval quality is a hidden variable. Ragas computes faithfulness, answer relevancy, context precision, and context recall over retrieval-augmented outputs. Run it on the retrieval steps inside the agent loop, not just on a standalone RAG pipeline.\n4. **LM Evaluation Harness** — the standard offline benchmark runner. 60+ academic benchmarks (MMLU, GSM8K, HellaSwag, BBH, HumanEval) under one CLI. Use it for model-swap go\u002Fno-go decisions: if you're considering moving from Sonnet to Haiku, harness numbers are the cheapest first filter before you spend on bespoke agent evals.\n5. **SWE-bench** — the most-cited benchmark for coding agents specifically. Real GitHub issues, real test suites, pass\u002Ffail on whether the agent's patch makes them green. Run SWE-bench Lite for a quick signal (300 instances, hours on a single machine) and the full set when you change harness or model. Agent benchmark numbers from a vendor's blog post are marketing; SWE-bench numbers from your own runner are evidence.\n6. **Agent Evaluation — Test Virtual Agents in CI** — purpose-built for testing virtual agents in CI. Defines test plans, runs them against your agent, scores tool calls and final answers, and reports back in a format CI can gate on. The piece between \"I have a test set\" and \"my PR is blocked when scores drop.\"\n7. **Phoenix Evals — LLM-as-Judge** — once production traffic is real, hand-labeled test sets stop scaling. Phoenix Evals is the LLM-as-judge library with templates for hallucination, toxicity, relevance, summarization quality, and code generation correctness. Sample 100 production traces a night, score them, alert when a metric drifts.\n8. **Langfuse Python SDK** — instrument the agent app so traces are captured and scored. Langfuse decorators wrap LLM calls, tool calls, and full agent runs; scores from DeepEval, Phoenix Evals, or a custom judge attach back to the trace. Now your offline test set scores and your online sampled scores live in the same dashboard.\n9. **Giskard Checks — Evals and Safety Tests for LLM Agents** — safety, robustness, and red-team checks. Catches prompt injection, off-topic prompts, brand-violating outputs, and consistency failures across paraphrases. Different from accuracy eval: you're checking what the agent *won't* do, not what it does well. Run it as part of the CI suite, not after launch.\n10. **promptfoo-action** — the regression gate. GitHub Action that runs your Promptfoo eval set on every PR, blocks merge if a metric drops below threshold, posts a diff comment with which cases regressed. This is the line that turns \"we run evals sometimes\" into \"every PR is measured.\" Without a CI gate the eval loop quietly rots in three months.\n\n## How they fit together (agent eval pipeline)\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│  OFFLINE — TEST SET                                         │\n│   DeepEval (assertions + metrics)                           │\n│   Promptfoo (YAML, side-by-side, red-team)                  │\n│   Ragas (retrieval inside the agent loop)                   │\n└─────────────────────────────────────────────────────────────┘\n                          │\n                          ▼\n┌─────────────────────────────────────────────────────────────┐\n│  OFFLINE — BENCHMARK                                        │\n│   LM Evaluation Harness (MMLU, BBH, HumanEval ...)          │\n│   SWE-bench (coding agents, real GitHub issues)             │\n└─────────────────────────────────────────────────────────────┘\n                          │\n                          ▼\n┌─────────────────────────────────────────────────────────────┐\n│  ONLINE — TRACE + JUDGE                                     │\n│   Langfuse SDK  ──►  captures every agent run               │\n│        │                                                    │\n│        ▼                                                    │\n│   Phoenix Evals (LLM-as-judge over sampled traces)          │\n│        │                                                    │\n│        ▼                                                    │\n│   metric drift alerts                                       │\n└─────────────────────────────────────────────────────────────┘\n                          │\n                          ▼\n┌─────────────────────────────────────────────────────────────┐\n│  REGRESSION GATE — CI                                       │\n│   Agent Evaluation (virtual-agent CI tests)                 │\n│   Giskard Checks (safety + red-team)                        │\n│   promptfoo-action (block PR on score drop)                 │\n└─────────────────────────────────────────────────────────────┘\n```\n\nThe split matters. Offline test sets are slow, expensive, and live on your laptop; benchmarks are comparable across teams and let you make model-swap decisions; trace eval samples what users actually hit; the CI gate is the only thing that keeps the rest from rotting.\n\n## Tradeoffs you'll hit\n\n- **DeepEval vs Promptfoo vs Ragas** — overlapping eval libraries, different shapes. DeepEval is Python-native and reads like Pytest, best when your team already lives in Python. Promptfoo is YAML-first and renders a side-by-side comparison UI, best for A\u002FB prompt sweeps and red-team runs. Ragas is retrieval-specific and adds metrics the other two don't compute well. Most production agent teams end up using DeepEval and Promptfoo together — assertion library plus comparison surface — and reach for Ragas when the agent does retrieval.\n- **Public benchmark vs custom eval set** — public benchmarks (LM Eval Harness, SWE-bench) are comparable across teams and cheap to run; custom eval sets reflect your actual users and your actual failure modes. Both are needed. Public benchmark is the first filter on a model swap; custom set is the only thing that catches the specific bug your agent has on your specific tool schema.\n- **LLM-as-judge vs rule-based eval** — LLM-as-judge (Phoenix Evals) scales to nuanced metrics like hallucination and helpfulness but inherits the judge model's biases and costs tokens per case. Rule-based eval (regex, JSON-schema, exact-match) is free and deterministic but only catches the obvious failures. Use rule-based for structural correctness (\"did the tool call validate?\") and LLM-as-judge for semantic quality (\"was the answer faithful to retrieved context?\").\n- **Langfuse vs Phoenix vs Arize** — overlapping observability stack. Langfuse is the OSS leader for prompt + trace management with strong self-hosting. Phoenix is the OSS evaluation framework with built-in LLM-as-judge templates. Arize is the commercial parent of Phoenix with enterprise features. Most teams self-host Langfuse for tracing and run Phoenix Evals as the judge library against those traces.\n- **SWE-bench Lite vs full SWE-bench** — Lite (300 instances) runs in hours and is enough for a weekly signal. Full (2,294 instances) costs real compute. Run Lite on every model swap; run full only when a Lite delta looks promising or before a major release.\n\n## Common pitfalls\n\n- **No test set, only logs** — the most common failure mode. \"We'll just look at production logs\" works until traffic grows, then no one looks. Hand-curate 30–50 cases in week one. Update them every time a user complaint reveals a new failure mode. By month three you have 200–500 cases and a real signal.\n- **Eval set has only success cases** — half the eval set must be hard, ambiguous, or adversarial. If your agent passes 100% of cases, the cases aren't useful. Include known prompt-injection attempts, off-topic queries, contradicting tool results, and brittle paraphrases of common queries.\n- **Evaluating only the final answer** — agents fail at tool selection, argument shape, retry behaviour, and multi-turn state, not just the final string. Score tool-correctness and trajectory adherence, not just answer-correctness. Both DeepEval and Agent Evaluation expose tool-level metrics.\n- **Judge model is the same as the agent model** — using GPT-4 to judge GPT-4 inherits its biases and misses certain failure classes. Use a different family of model as the judge when budget allows. Even rotating between two judges and looking at disagreement is informative.\n- **Drift in the eval set itself** — the test set is a software artifact. Version it. Review it quarterly. Retire cases the agent has trivially mastered. Add cases that reflect this quarter's failure modes. An eval set that doesn't change in six months is an eval set that's now lying to you.\n- **No CI gate** — the loop that doesn't block a merge is the loop that rots. Wire up promptfoo-action or an Agent Evaluation CI step that fails the build when a metric drops. The threshold doesn't have to be tight on day one; the existence of the gate is what matters.\n- **Benchmark numbers from a vendor blog** — agent benchmark scores from the vendor that shipped the model are marketing copy until you reproduce them with your own harness. Run SWE-bench Lite yourself before you believe any agent benchmark claim.",[106,109,112,115,118],{"q":107,"a":108},"How is this pack different from the existing `ml-engineer-rag-eval` pack on TokRepo?","ml-engineer-rag-eval covers the RAG infrastructure layer: chunking, embedding servers, vector stores, retrieval frameworks, rerankers. It assumes you're building retrieval and need a stack. This pack assumes the agent infra already exists and asks the next question: how do you measure whether the agent is getting better or worse? Different audience, near-zero overlapping workflow IDs (Phoenix Evals 2842 is the eval-library, not the full Phoenix observability tool in the RAG pack). Pair them: build the RAG layer with the first pack, measure the agent on top of it with this one.",{"q":110,"a":111},"Why both DeepEval and Promptfoo? Aren't they redundant?","They overlap but serve different shapes. DeepEval is Python-native and reads like Pytest — you write `assert_test(case, metrics=[FaithfulnessMetric()])` and it slots into your existing test suite. Promptfoo is YAML-first and renders a comparison matrix UI — best for A\u002FB sweeping two prompts across five models or running a red-team adversarial set. Most production teams end up using both: DeepEval as the assertion library, Promptfoo as the side-by-side comparison surface and the CI entry point. If you can only pick one, start with DeepEval if you live in Python, Promptfoo if your team does more YAML and config than code.",{"q":113,"a":114},"Do I really need SWE-bench if I have a custom eval set?","If your agent isn't a coding agent, skip SWE-bench. If it is, SWE-bench is the only widely-cited public benchmark with real GitHub issues and real test suites, and the score is *comparable* across vendors. Run SWE-bench Lite (300 instances, runnable in hours) before every model swap as a cheap first filter. Your custom eval set catches your specific bugs; SWE-bench tells you whether the model class is even in the right ballpark for coding work. Both are needed.",{"q":116,"a":117},"Should I judge with the same model the agent uses?","Avoid it when possible. Same-family LLM-as-judge inherits the agent model's biases — it tends to rate its own outputs higher and miss the same failure modes the agent has. Use a different model family for judging (e.g., Anthropic for OpenAI agents or vice versa). When budget is tight, rotate between two judges and treat disagreement as a signal to add the case to a human-review queue. Phoenix Evals lets you swap the judge model with one config line.",{"q":119,"a":120},"What's the smallest viable eval set to start with?","Thirty hand-curated cases with expected outputs (or rubrics for open-ended tasks) is enough to be useful in week one. Include 10 happy-path cases, 10 edge cases the agent has historically struggled with, and 10 adversarial cases (off-topic, prompt-injection, brittle paraphrases). Wire it into DeepEval or Promptfoo, run it on every prompt change, and wire promptfoo-action to gate PRs. By month three you'll be at 200–500 cases and your real signal will come from the trace eval (Phoenix Evals + Langfuse) sampling production traffic — but the hand-curated set never goes away.",{"@context":122,"@type":123,"name":124,"description":125,"numberOfItems":126,"inLanguage":25},"https:\u002F\u002Fschema.org","ItemList","Agent Evaluation + Benchmark Stack","Ten picks ordered by the actual eval pipeline an ML\u002FLLM engineer builds for an agent: offline test sets (DeepEval, Promptfoo, Ragas), benchmark runners (LM Evaluation Harness, SWE-bench), agent-specific CI evals, trace-based LLM-as-judge (Phoenix Evals, Langfuse), safety checks (Giskard), and a CI regression gate.",10,[128,132,136],{"url":129,"anchor":130,"reason":131},"\u002Fen\u002Ftopics\u002Fml-engineer-rag-eval","ML Engineer's RAG + Eval Stack","Sister pack: the RAG infrastructure layer (chunking, embeddings, vector stores) that sits under the agent — pair this pack on top of it",{"url":133,"anchor":134,"reason":135},"\u002Fen\u002Ftopics\u002Fllm-observability","LLM Observability pack","Broader observability coverage — Langfuse, AgentOps, LangSmith — when the trace layer in this pack needs to scale to multi-team production",{"url":137,"anchor":138,"reason":139},"\u002Fen\u002Ftopics\u002Fpr-review-automation","PR Review Automation pack","Companion CI pack: pairs naturally with the promptfoo-action regression gate to make every PR a measurable change",[141,145,149],{"claim":142,"source_name":143,"source_url":144},"DeepEval ships 30+ built-in metrics including faithfulness, answer relevancy, hallucination, G-Eval custom rubrics, and tool-correctness for agent testing","DeepEval documentation","https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval",{"claim":146,"source_name":147,"source_url":148},"SWE-bench evaluates coding agents on real GitHub issues with their actual test suites, with SWE-bench Lite providing a 300-instance subset for faster iteration","SWE-bench project","https:\u002F\u002Fwww.swebench.com\u002F",{"claim":150,"source_name":151,"source_url":152},"Promptfoo provides side-by-side comparison across prompts, models, and providers, plus a GitHub Action (promptfoo-action) for running evals in CI on every PR","promptfoo documentation","https:\u002F\u002Fwww.promptfoo.dev\u002F",1450,"2026-05-22T00:00:00Z"]