DeepEval — LLM Testing Framework with 30+ Metrics
DeepEval is a pytest-like testing framework for LLM apps with 30+ metrics. 14.4K+ GitHub stars. RAG, agent, multimodal evaluation. Runs locally. MIT.
Instalación lista para agent
Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.
npx -y tokrepo@latest install a4d57f88-3711-4032-8ad5-f2040ae03178 --target codexEjecutar después de confirmar el plan con dry-run.
What it is
DeepEval is an open-source testing framework designed specifically for LLM applications. It works like pytest but adds 30+ evaluation metrics tailored to AI outputs, including answer relevancy, faithfulness, contextual precision, hallucination detection, and task completion scoring.
The framework targets ML engineers and backend developers building RAG pipelines, AI agents, or any application that needs automated quality checks on LLM outputs.
How it saves time or tokens
Manual evaluation of LLM outputs is slow and inconsistent. DeepEval automates the process with quantitative metrics, catching regressions in CI/CD before they reach production. All evaluations run locally on your machine, so no data leaves your environment and you avoid paying for external evaluation APIs.
How to use
- Install DeepEval via pip:
pip install -U deepeval. - Create a test file with test cases defining input, expected output, and retrieval context.
- Run tests with
deepeval test run test_llm.py-- results show pass/fail per metric.
Example
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
def test_rag_pipeline():
test_case = LLMTestCase(
input='What is DeepEval?',
actual_output='DeepEval is an LLM testing framework.',
retrieval_context=['DeepEval provides 30+ metrics for LLM evaluation.']
)
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
assert_test(test_case, [relevancy, faithfulness])
Related on TokRepo
- AI Tools for Testing -- compare AI-powered testing and evaluation tools
- AI Tools for RAG -- explore retrieval-augmented generation frameworks and engines
Common pitfalls
- Setting metric thresholds too high initially causes false failures. Start with 0.5-0.7 and tighten as your pipeline matures.
- DeepEval strips types but does not validate LLM logic. Pair it with unit tests for deterministic code paths.
- The
retrieval_contextfield is required for RAG metrics like faithfulness. Omitting it silently skips those checks.
Preguntas frecuentes
Manual evaluation is subjective and does not scale. DeepEval quantifies output quality with reproducible metrics, runs in CI/CD, and catches regressions automatically. It replaces spreadsheet-based reviews with pytest-style assertions.
Yes. DeepEval integrates with OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. You can evaluate outputs from any model by passing the actual_output to test cases regardless of which LLM generated it.
Yes. DeepEval is pytest-compatible, so it runs in any CI system that supports Python testing -- GitHub Actions, GitLab CI, Jenkins, CircleCI. Use deepeval test run in your pipeline script.
DeepEval offers answer relevancy, faithfulness, contextual precision, contextual recall, and hallucination metrics. These measure whether the LLM answer stays grounded in the retrieved documents.
DeepEval is open source under MIT license. All metrics run locally on your machine at no cost. An optional hosted dashboard (Confident AI) is available for teams that want centralized reporting.
Referencias (3)
- DeepEval GitHub— DeepEval provides 30+ evaluation metrics for LLM apps
- DeepEval Docs— Supports pytest-compatible test execution
- DeepEval Metrics Docs— RAG evaluation metrics including faithfulness and relevancy
Relacionados en TokRepo
Fuente y agradecimientos
Created by Confident AI. Licensed under MIT. confident-ai/deepeval — 14,400+ GitHub stars
Discusión
Activos relacionados
LM Evaluation Harness — Unified LLM Benchmarking Framework
EleutherAI's framework for reproducible evaluation of language models across hundreds of benchmarks, providing the standard evaluation backend used by the Open LLM Leaderboard and research papers.
doctest — The Fastest Feature-Rich C++ Testing Framework
doctest is a single-header C++ testing framework designed for minimal compile-time overhead and maximum speed.
Metasploit Framework — Open-Source Penetration Testing Platform
The most widely used open-source penetration testing framework for discovering vulnerabilities and validating security defenses across networks and applications.
PHPUnit — The Standard Testing Framework for PHP
PHPUnit is the de facto unit testing framework for PHP, providing assertions, mocks, and code coverage analysis.