LLM Eval & Guardrails
DeepEval, Promptfoo, Ragas, Opik, Guardrails AI — score every prompt change before it ships and catch regressions early.
What's in this pack
This pack assembles the five open-source tools every team converges on once their LLM features ship to real users and "the model got worse this week" stops being a tolerable answer. The tools split into two halves: pre-deploy evaluation (score every prompt change) and runtime guardrails (constrain what the model actually outputs).
| # | Asset | Phase | Best at |
|---|---|---|---|
| 1 | DeepEval | Pre-deploy | Pytest-style unit tests for LLM outputs (G-Eval, Faithfulness, Hallucination metrics) |
| 2 | Promptfoo | Pre-deploy | A/B prompt comparisons and red-team scans across models |
| 3 | Ragas | Pre-deploy | RAG-specific metrics: context precision, faithfulness, answer relevancy |
| 4 | Opik | Observability | Production tracing, eval scores per request, dataset curation |
| 5 | Guardrails AI | Runtime | Validate output schema and policies, with retry and reasking |
The split matters. Pre-deploy eval catches the regression before customers see it. Runtime guardrails catch the regression you didn't predict. You need both — eval alone misses adversarial inputs you didn't sample, guardrails alone don't tell you which prompt change caused the drift.
Why eval is now table-stakes
Three forcing functions made eval the difference between teams that ship and teams that stall:
- Model upgrades happen on the vendor's schedule. When Anthropic releases Sonnet 4.7, your prompt that worked on 4.6 may behave subtly differently. Without an eval suite, you discover this from a customer support ticket. With Promptfoo, you run
promptfoo eval -c promptfooconfig.yaml --providers anthropic:claude-4.7,anthropic:claude-4.6and see the diff in 30 seconds. - Prompts have no compile errors. A typo in code throws an exception. A typo in a prompt produces plausible but worse output that ships. Eval is the compile step prompts never had.
- RAG quality decays silently. A new doc that gets retrieved but isn't actually relevant lowers answer quality without raising any error. Ragas gives you context precision and faithfulness scores per query, so you spot decay before it accumulates.
Install in one command
# Install the entire pack into the current project
tokrepo install pack/llm-eval-guardrails
# Or pick individual assets
tokrepo install promptfoo
tokrepo install ragas
The TokRepo CLI sets up a evals/ directory with example test cases, a promptfooconfig.yaml, a Ragas notebook seeded with your retriever, and a Guardrails AI rail file template. CI snippets gate merges on eval-suite pass rate.
Common pitfalls
- LLM-as-judge without grounding. DeepEval and Ragas use a judge model to score answers, but if the judge is the same model as the system under test, you get optimistic scores. Use a different model family as judge, or pin a stronger model (e.g. judge with Claude when scoring GPT outputs).
- Eval suites with 5 cases. Five hand-picked examples don't cover the long tail. Aim for 50-200 cases derived from real production logs (Opik makes this easy — sample bad outputs, label them, promote to eval set).
- Treating Guardrails as a magic filter. Guardrails enforces structure (valid JSON, profanity-free, schema-conformant) — it doesn't catch factually wrong but well-formatted answers. Pair it with a Ragas faithfulness check.
- Running eval against production traffic costs. Eval suites can hit 5-10x your normal LLM bill if you re-score every nightly. Cache embeddings, sample your eval set per run, or use cheaper models for the judge step.
- No eval for non-text outputs. If your agent emits tool calls, eval the tool call shape with structured assertions, not just the final text. Promptfoo supports this via custom
transformandasserthooks.
A typical week with this stack
Monday morning the on-call engineer reviews Opik traces from the weekend, samples 30 outputs that scored low on faithfulness, and promotes the worst 10 into the DeepEval test set. Wednesday a product manager asks whether switching the customer-support agent from GPT-4o to Claude Sonnet would save money without quality loss — the team writes a Promptfoo config, runs promptfoo eval over 200 cases against both models, and answers in 20 minutes with a side-by-side table. Friday before deploy, the CI pipeline runs the full DeepEval + Ragas suite; one regression on the new prompt blocks merge until fixed.
Throughout, Guardrails sits inline in production, rejecting outputs that fail JSON schema validation and surfacing reask rate to a Grafana dashboard. When reask rate spikes above 3%, an alert fires and the team knows the upstream prompt drifted before any customer notices.
When this pack alone isn't enough
For full production observability beyond Opik (latency percentiles, cost tracking per user, model-routing analytics), look at LangSmith or Arize Phoenix — neither is in the pack because they're more orchestration than eval. For safety classifiers (jailbreak detection, prompt injection scoring), add Llama Guard or NVIDIA NeMo Guardrails — Guardrails AI focuses on output validation, not adversarial input detection. And if your eval needs human-in-the-loop annotation at scale, Argilla or Label Studio plug into the Opik dataset format.
5 assets in this pack
Frequently asked questions
Is this pack free to run?
All five tools are open source under permissive licenses (Apache 2.0 or MIT). Compute costs are the variable: every eval call hits an LLM, so a suite of 200 cases × 4 prompt variants × 2 models = 1600 LLM calls per run. Cache aggressively, sample for nightly runs, full-run only on release. Ragas and DeepEval support LLM-as-judge with cheap models (Haiku, gpt-4o-mini) to keep judge cost low.
How does this compare to LangSmith or Braintrust?
LangSmith and Braintrust are managed platforms with eval, observability, and dataset curation in one UI. The pack here gives you 80% of the features for $0 and full self-host. Trade-off: you wire the components yourself (Promptfoo for eval, Opik for traces, Guardrails for runtime) instead of getting one dashboard. Pick managed if your team would otherwise not run eval at all; pick this pack if engineering effort is cheaper than seat fees.
Will this work with Claude Code or Cursor?
Yes. Claude Code can author Promptfoo configs and DeepEval test cases from your feature spec — give it the spec plus a few example prompts, and it generates evals/test_*.py and promptfooconfig.yaml. The TokRepo asset pages include subagent prompts that wire this into a prompt-eval slash command. Cursor uses the same flow via custom rules.
What's the difference between Promptfoo and DeepEval?
Promptfoo is config-driven (YAML) and excels at A/B comparisons across providers/models — perfect for the question 'should we switch from GPT to Claude?'. DeepEval is code-driven (pytest) and excels at unit-test-style assertions on individual prompts — perfect for 'this answer must mention X and not contain Y'. Most teams run both: Promptfoo for model selection, DeepEval for prompt regression.
Operational gotcha when adding Guardrails AI?
Guardrails reasking can multiply your latency and cost — every failed validation triggers another LLM call to fix the output. Set a max-retry of 1-2, monitor reask rate in Opik (if reask rate >5% your prompt itself is wrong, not the output), and prefer structured output mode (JSON schema) over reasking when the underlying model supports it (Claude, GPT-4o, Gemini all do).
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs