TOKREPO · ARSENAL
New · this week

LLM Eval & Guardrails

DeepEval, Promptfoo, Ragas, Opik, Guardrails AI — score every prompt change before it ships and catch regressions early.

5 assets

What's in this pack

This pack assembles the five open-source tools every team converges on once their LLM features ship to real users and "the model got worse this week" stops being a tolerable answer. The tools split into two halves: pre-deploy evaluation (score every prompt change) and runtime guardrails (constrain what the model actually outputs).

# Asset Phase Best at
1 DeepEval Pre-deploy Pytest-style unit tests for LLM outputs (G-Eval, Faithfulness, Hallucination metrics)
2 Promptfoo Pre-deploy A/B prompt comparisons and red-team scans across models
3 Ragas Pre-deploy RAG-specific metrics: context precision, faithfulness, answer relevancy
4 Opik Observability Production tracing, eval scores per request, dataset curation
5 Guardrails AI Runtime Validate output schema and policies, with retry and reasking

The split matters. Pre-deploy eval catches the regression before customers see it. Runtime guardrails catch the regression you didn't predict. You need both — eval alone misses adversarial inputs you didn't sample, guardrails alone don't tell you which prompt change caused the drift.

Why eval is now table-stakes

Three forcing functions made eval the difference between teams that ship and teams that stall:

  • Model upgrades happen on the vendor's schedule. When Anthropic releases Sonnet 4.7, your prompt that worked on 4.6 may behave subtly differently. Without an eval suite, you discover this from a customer support ticket. With Promptfoo, you run promptfoo eval -c promptfooconfig.yaml --providers anthropic:claude-4.7,anthropic:claude-4.6 and see the diff in 30 seconds.
  • Prompts have no compile errors. A typo in code throws an exception. A typo in a prompt produces plausible but worse output that ships. Eval is the compile step prompts never had.
  • RAG quality decays silently. A new doc that gets retrieved but isn't actually relevant lowers answer quality without raising any error. Ragas gives you context precision and faithfulness scores per query, so you spot decay before it accumulates.

Install in one command

# Install the entire pack into the current project
tokrepo install pack/llm-eval-guardrails

# Or pick individual assets
tokrepo install promptfoo
tokrepo install ragas

The TokRepo CLI sets up a evals/ directory with example test cases, a promptfooconfig.yaml, a Ragas notebook seeded with your retriever, and a Guardrails AI rail file template. CI snippets gate merges on eval-suite pass rate.

Common pitfalls

  • LLM-as-judge without grounding. DeepEval and Ragas use a judge model to score answers, but if the judge is the same model as the system under test, you get optimistic scores. Use a different model family as judge, or pin a stronger model (e.g. judge with Claude when scoring GPT outputs).
  • Eval suites with 5 cases. Five hand-picked examples don't cover the long tail. Aim for 50-200 cases derived from real production logs (Opik makes this easy — sample bad outputs, label them, promote to eval set).
  • Treating Guardrails as a magic filter. Guardrails enforces structure (valid JSON, profanity-free, schema-conformant) — it doesn't catch factually wrong but well-formatted answers. Pair it with a Ragas faithfulness check.
  • Running eval against production traffic costs. Eval suites can hit 5-10x your normal LLM bill if you re-score every nightly. Cache embeddings, sample your eval set per run, or use cheaper models for the judge step.
  • No eval for non-text outputs. If your agent emits tool calls, eval the tool call shape with structured assertions, not just the final text. Promptfoo supports this via custom transform and assert hooks.

A typical week with this stack

Monday morning the on-call engineer reviews Opik traces from the weekend, samples 30 outputs that scored low on faithfulness, and promotes the worst 10 into the DeepEval test set. Wednesday a product manager asks whether switching the customer-support agent from GPT-4o to Claude Sonnet would save money without quality loss — the team writes a Promptfoo config, runs promptfoo eval over 200 cases against both models, and answers in 20 minutes with a side-by-side table. Friday before deploy, the CI pipeline runs the full DeepEval + Ragas suite; one regression on the new prompt blocks merge until fixed.

Throughout, Guardrails sits inline in production, rejecting outputs that fail JSON schema validation and surfacing reask rate to a Grafana dashboard. When reask rate spikes above 3%, an alert fires and the team knows the upstream prompt drifted before any customer notices.

When this pack alone isn't enough

For full production observability beyond Opik (latency percentiles, cost tracking per user, model-routing analytics), look at LangSmith or Arize Phoenix — neither is in the pack because they're more orchestration than eval. For safety classifiers (jailbreak detection, prompt injection scoring), add Llama Guard or NVIDIA NeMo Guardrails — Guardrails AI focuses on output validation, not adversarial input detection. And if your eval needs human-in-the-loop annotation at scale, Argilla or Label Studio plug into the Opik dataset format.

INSTALL · ONE COMMAND
$ tokrepo install pack/llm-eval-guardrails
hand it to your agent — or paste it in your terminal
What's inside

5 assets in this pack

Script#01
DeepEval — LLM Testing Framework with 30+ Metrics

DeepEval is a pytest-like testing framework for LLM apps with 30+ metrics. 14.4K+ GitHub stars. RAG, agent, multimodal evaluation. Runs locally. MIT.

by Script Depot·142 views
$ tokrepo install deepeval-llm-testing-framework-30-metrics-a4d57f88
Script#02
Promptfoo — LLM Eval & Red-Team Testing Framework

Open-source framework for evaluating and red-teaming LLM applications. Test prompts across models, detect jailbreaks, measure quality, and catch regressions. 5,000+ GitHub stars.

by Agent Toolkit·90 views
$ tokrepo install promptfoo-llm-eval-red-team-testing-framework-288cfb9f
Script#03
Ragas — Evaluate RAG & LLM Applications

Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.

by Script Depot·62 views
$ tokrepo install ragas-evaluate-rag-llm-applications-2c856b4d
Config#04
Opik — Debug, Evaluate & Monitor LLM Apps

Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.

by AI Open Source·120 views
$ tokrepo install opik-debug-evaluate-monitor-llm-apps-a543eba5
Agent#05
Guardrails AI — Validate LLM Outputs in Production

Add validation and guardrails to any LLM output. Guardrails AI checks for hallucination, toxicity, PII leakage, and format compliance with 50+ built-in validators.

by Agent Toolkit·115 views
$ tokrepo install guardrails-ai-validate-llm-outputs-production-ffbad589
FAQ

Frequently asked questions

Is this pack free to run?

All five tools are open source under permissive licenses (Apache 2.0 or MIT). Compute costs are the variable: every eval call hits an LLM, so a suite of 200 cases × 4 prompt variants × 2 models = 1600 LLM calls per run. Cache aggressively, sample for nightly runs, full-run only on release. Ragas and DeepEval support LLM-as-judge with cheap models (Haiku, gpt-4o-mini) to keep judge cost low.

How does this compare to LangSmith or Braintrust?

LangSmith and Braintrust are managed platforms with eval, observability, and dataset curation in one UI. The pack here gives you 80% of the features for $0 and full self-host. Trade-off: you wire the components yourself (Promptfoo for eval, Opik for traces, Guardrails for runtime) instead of getting one dashboard. Pick managed if your team would otherwise not run eval at all; pick this pack if engineering effort is cheaper than seat fees.

Will this work with Claude Code or Cursor?

Yes. Claude Code can author Promptfoo configs and DeepEval test cases from your feature spec — give it the spec plus a few example prompts, and it generates evals/test_*.py and promptfooconfig.yaml. The TokRepo asset pages include subagent prompts that wire this into a prompt-eval slash command. Cursor uses the same flow via custom rules.

What's the difference between Promptfoo and DeepEval?

Promptfoo is config-driven (YAML) and excels at A/B comparisons across providers/models — perfect for the question 'should we switch from GPT to Claude?'. DeepEval is code-driven (pytest) and excels at unit-test-style assertions on individual prompts — perfect for 'this answer must mention X and not contain Y'. Most teams run both: Promptfoo for model selection, DeepEval for prompt regression.

Operational gotcha when adding Guardrails AI?

Guardrails reasking can multiply your latency and cost — every failed validation triggers another LLM call to fix the output. Set a max-retry of 1-2, monitor reask rate in Opik (if reask rate >5% your prompt itself is wrong, not the output), and prefer structured output mode (JSON schema) over reasking when the underlying model supports it (Claude, GPT-4o, Gemini all do).

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs