[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-llm-eval-guardrails-en":3,"seo:pack:llm-eval-guardrails:en":61},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":60},"llm-eval-guardrails","⚖️","#B45309","stable","Stable","LLM Eval & Guardrails","DeepEval, Promptfoo, Ragas, Opik, Guardrails AI — score every prompt change before it ships and catch regressions early.",[16,28,38,45,53],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},292,"a4d57f88-3711-4032-8ad5-f2040ae03178","deepeval-llm-testing-framework-30-metrics-a4d57f88","DeepEval — LLM Testing Framework with 30+ Metrics","DeepEval is a pytest-like testing framework for LLM apps with 30+ metrics. 14.4K+ GitHub stars. RAG, agent, multimodal evaluation. Runs locally. MIT.","Script Depot",353,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":36,"type_label":37},618,"288cfb9f-58ef-4890-a0f7-f698ada3447e","promptfoo-llm-eval-red-team-testing-framework-288cfb9f","Promptfoo — LLM Eval & Red-Team Testing Framework","Open-source framework for evaluating and red-teaming LLM applications. Test prompts across models, detect jailbreaks, measure quality, and catch regressions. 5,000+ GitHub stars.","Agent Toolkit",240,"prompt","Prompt",{"id":39,"uuid":40,"slug":41,"title":42,"description":43,"author_name":22,"view_count":44,"vote_count":24,"lang_type":25,"type":26,"type_label":27},291,"2c856b4d-64e5-46b2-9bbd-a7ce9f7a7296","ragas-evaluate-rag-llm-applications-2c856b4d","Ragas — Evaluate RAG & LLM Applications","Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.",242,{"id":46,"uuid":47,"slug":48,"title":49,"description":50,"author_name":51,"view_count":52,"vote_count":24,"lang_type":25,"type":26,"type_label":27},443,"a543eba5-fe14-46f3-9aa5-96a5a23b72d0","opik-debug-evaluate-monitor-llm-apps-a543eba5","Opik — Debug, Evaluate & Monitor LLM Apps","Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.","AI Open Source",293,{"id":54,"uuid":55,"slug":56,"title":57,"description":58,"author_name":34,"view_count":59,"vote_count":24,"lang_type":25,"type":26,"type_label":27},773,"ffbad589-cd32-4eca-9518-fdcf9167ca21","guardrails-ai-validate-llm-outputs-production-ffbad589","Guardrails AI — Validate LLM Outputs in Production","Add validation and guardrails to any LLM output. Guardrails AI checks for hallucination, toxicity, PII leakage, and format compliance with 50+ built-in validators.",327,"tokrepo install pack\u002Fllm-eval-guardrails",{"pageType":62,"pageKey":8,"locale":25,"title":63,"metaDescription":64,"h1":13,"tldr":65,"bodyMarkdown":66,"faq":67,"schema":83,"internalLinks":92,"citations":105,"wordCount":118,"generatedAt":119},"pack","LLM Eval & Guardrails: DeepEval, Promptfoo, Ragas, Opik","Open-source LLM eval pack: DeepEval, Promptfoo, Ragas, Opik, Guardrails AI. Score prompts before deploy, constrain outputs at runtime. Install via TokRepo.","Five open-source tools that turn prompt iteration from vibes into measured engineering: offline eval, RAG-specific scoring, observability, and runtime output constraints.","## What's in this pack\n\nThis pack assembles the **five open-source tools** every team converges on once their LLM features ship to real users and \"the model got worse this week\" stops being a tolerable answer. The tools split into two halves: pre-deploy evaluation (score every prompt change) and runtime guardrails (constrain what the model actually outputs).\n\n| # | Asset | Phase | Best at |\n|---|---|---|---|\n| 1 | DeepEval | Pre-deploy | Pytest-style unit tests for LLM outputs (G-Eval, Faithfulness, Hallucination metrics) |\n| 2 | Promptfoo | Pre-deploy | A\u002FB prompt comparisons and red-team scans across models |\n| 3 | Ragas | Pre-deploy | RAG-specific metrics: context precision, faithfulness, answer relevancy |\n| 4 | Opik | Observability | Production tracing, eval scores per request, dataset curation |\n| 5 | Guardrails AI | Runtime | Validate output schema and policies, with retry and reasking |\n\nThe split matters. Pre-deploy eval catches the regression *before* customers see it. Runtime guardrails catch the regression you didn't predict. You need both — eval alone misses adversarial inputs you didn't sample, guardrails alone don't tell you which prompt change caused the drift.\n\n## Why eval is now table-stakes\n\nThree forcing functions made eval the difference between teams that ship and teams that stall:\n\n- **Model upgrades happen on the vendor's schedule.** When Anthropic releases Sonnet 4.7, your prompt that worked on 4.6 may behave subtly differently. Without an eval suite, you discover this from a customer support ticket. With Promptfoo, you run `promptfoo eval -c promptfooconfig.yaml --providers anthropic:claude-4.7,anthropic:claude-4.6` and see the diff in 30 seconds.\n- **Prompts have no compile errors.** A typo in code throws an exception. A typo in a prompt produces plausible but worse output that ships. Eval is the compile step prompts never had.\n- **RAG quality decays silently.** A new doc that gets retrieved but isn't actually relevant lowers answer quality without raising any error. Ragas gives you context precision and faithfulness scores per query, so you spot decay before it accumulates.\n\n## Install in one command\n\n```bash\n# Install the entire pack into the current project\ntokrepo install pack\u002Fllm-eval-guardrails\n\n# Or pick individual assets\ntokrepo install promptfoo\ntokrepo install ragas\n```\n\nThe TokRepo CLI sets up a `evals\u002F` directory with example test cases, a `promptfooconfig.yaml`, a Ragas notebook seeded with your retriever, and a Guardrails AI rail file template. CI snippets gate merges on eval-suite pass rate.\n\n## Common pitfalls\n\n- **LLM-as-judge without grounding.** DeepEval and Ragas use a judge model to score answers, but if the judge is the same model as the system under test, you get optimistic scores. Use a different model family as judge, or pin a stronger model (e.g. judge with Claude when scoring GPT outputs).\n- **Eval suites with 5 cases.** Five hand-picked examples don't cover the long tail. Aim for 50-200 cases derived from real production logs (Opik makes this easy — sample bad outputs, label them, promote to eval set).\n- **Treating Guardrails as a magic filter.** Guardrails enforces *structure* (valid JSON, profanity-free, schema-conformant) — it doesn't catch factually wrong but well-formatted answers. Pair it with a Ragas faithfulness check.\n- **Running eval against production traffic costs.** Eval suites can hit 5-10x your normal LLM bill if you re-score every nightly. Cache embeddings, sample your eval set per run, or use cheaper models for the judge step.\n- **No eval for non-text outputs.** If your agent emits tool calls, eval the *tool call shape* with structured assertions, not just the final text. Promptfoo supports this via custom `transform` and `assert` hooks.\n\n## A typical week with this stack\n\nMonday morning the on-call engineer reviews Opik traces from the weekend, samples 30 outputs that scored low on faithfulness, and promotes the worst 10 into the DeepEval test set. Wednesday a product manager asks whether switching the customer-support agent from GPT-4o to Claude Sonnet would save money without quality loss — the team writes a Promptfoo config, runs `promptfoo eval` over 200 cases against both models, and answers in 20 minutes with a side-by-side table. Friday before deploy, the CI pipeline runs the full DeepEval + Ragas suite; one regression on the new prompt blocks merge until fixed.\n\nThroughout, Guardrails sits inline in production, rejecting outputs that fail JSON schema validation and surfacing reask rate to a Grafana dashboard. When reask rate spikes above 3%, an alert fires and the team knows the upstream prompt drifted before any customer notices.\n\n## When this pack alone isn't enough\n\nFor full **production observability** beyond Opik (latency percentiles, cost tracking per user, model-routing analytics), look at LangSmith or Arize Phoenix — neither is in the pack because they're more orchestration than eval. For **safety classifiers** (jailbreak detection, prompt injection scoring), add Llama Guard or NVIDIA NeMo Guardrails — Guardrails AI focuses on output validation, not adversarial input detection. And if your eval needs **human-in-the-loop annotation** at scale, Argilla or Label Studio plug into the Opik dataset format.",[68,71,74,77,80],{"q":69,"a":70},"Is this pack free to run?","All five tools are open source under permissive licenses (Apache 2.0 or MIT). Compute costs are the variable: every eval call hits an LLM, so a suite of 200 cases × 4 prompt variants × 2 models = 1600 LLM calls per run. Cache aggressively, sample for nightly runs, full-run only on release. Ragas and DeepEval support LLM-as-judge with cheap models (Haiku, gpt-4o-mini) to keep judge cost low.",{"q":72,"a":73},"How does this compare to LangSmith or Braintrust?","LangSmith and Braintrust are managed platforms with eval, observability, and dataset curation in one UI. The pack here gives you 80% of the features for $0 and full self-host. Trade-off: you wire the components yourself (Promptfoo for eval, Opik for traces, Guardrails for runtime) instead of getting one dashboard. Pick managed if your team would otherwise not run eval at all; pick this pack if engineering effort is cheaper than seat fees.",{"q":75,"a":76},"Will this work with Claude Code or Cursor?","Yes. Claude Code can author Promptfoo configs and DeepEval test cases from your feature spec — give it the spec plus a few example prompts, and it generates `evals\u002Ftest_*.py` and `promptfooconfig.yaml`. The TokRepo asset pages include subagent prompts that wire this into a `prompt-eval` slash command. Cursor uses the same flow via custom rules.",{"q":78,"a":79},"What's the difference between Promptfoo and DeepEval?","Promptfoo is config-driven (YAML) and excels at A\u002FB comparisons across providers\u002Fmodels — perfect for the question 'should we switch from GPT to Claude?'. DeepEval is code-driven (pytest) and excels at unit-test-style assertions on individual prompts — perfect for 'this answer must mention X and not contain Y'. Most teams run both: Promptfoo for model selection, DeepEval for prompt regression.",{"q":81,"a":82},"Operational gotcha when adding Guardrails AI?","Guardrails reasking can multiply your latency and cost — every failed validation triggers another LLM call to fix the output. Set a max-retry of 1-2, monitor reask rate in Opik (if reask rate >5% your prompt itself is wrong, not the output), and prefer structured output mode (JSON schema) over reasking when the underlying model supports it (Claude, GPT-4o, Gemini all do).",{"@context":84,"@type":85,"name":13,"description":86,"numberOfItems":87,"publisher":88},"https:\u002F\u002Fschema.org","CollectionPage","Open-source pack for scoring prompt changes and constraining LLM outputs: DeepEval, Promptfoo, Ragas, Opik, Guardrails AI.",5,{"@type":89,"name":90,"url":91},"Organization","TokRepo","https:\u002F\u002Ftokrepo.com",[93,97,101],{"url":94,"anchor":95,"reason":96},"\u002Fen\u002Fpacks\u002Frag-pipelines","RAG Pipelines","evaluate retrieval quality alongside generation",{"url":98,"anchor":99,"reason":100},"\u002Fen\u002Fpacks\u002Fprompt-engineering-toolkit","Prompt Engineering Toolkit","the prompts you score with these evaluators",{"url":102,"anchor":103,"reason":104},"\u002Fen\u002Ftools\u002Fclaude-code","Claude Code","agent that can author Promptfoo configs from feature specs",[106,110,114],{"claim":107,"source_name":108,"source_url":109},"Promptfoo is an open-source CLI for testing and evaluating LLM apps with model comparisons and assertions","promptfoo\u002Fpromptfoo","https:\u002F\u002Fgithub.com\u002Fpromptfoo\u002Fpromptfoo",{"claim":111,"source_name":112,"source_url":113},"Ragas provides metrics like faithfulness and answer relevancy for evaluating RAG pipelines","explodinggradients\u002Fragas","https:\u002F\u002Fgithub.com\u002Fexplodinggradients\u002Fragas",{"claim":115,"source_name":116,"source_url":117},"Guardrails AI defines validation rules to constrain LLM outputs to expected formats and policies","guardrails-ai\u002Fguardrails","https:\u002F\u002Fgithub.com\u002Fguardrails-ai\u002Fguardrails",828,"2026-05-02T15:00:00Z"]