Skills2026年3月31日·1 分钟阅读

DeepEval — LLM Testing Framework with 30+ Metrics

DeepEval is a pytest-like testing framework for LLM apps with 30+ metrics. 14.4K+ GitHub stars. RAG, agent, multimodal evaluation. Runs locally. MIT.

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

DeepEval — LLM Testing Framework with 30+ Metrics

直接安装命令

npx -y tokrepo@latest install a4d57f88-3711-4032-8ad5-f2040ae03178 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

DeepEval provides 30+ evaluation metrics for LLM apps in a pytest-compatible framework.

§01

What it is

DeepEval is an open-source testing framework designed specifically for LLM applications. It works like pytest but adds 30+ evaluation metrics tailored to AI outputs, including answer relevancy, faithfulness, contextual precision, hallucination detection, and task completion scoring.

The framework targets ML engineers and backend developers building RAG pipelines, AI agents, or any application that needs automated quality checks on LLM outputs.

§02

How it saves time or tokens

Manual evaluation of LLM outputs is slow and inconsistent. DeepEval automates the process with quantitative metrics, catching regressions in CI/CD before they reach production. All evaluations run locally on your machine, so no data leaves your environment and you avoid paying for external evaluation APIs.

§03

How to use

Install DeepEval via pip: pip install -U deepeval.
Create a test file with test cases defining input, expected output, and retrieval context.
Run tests with deepeval test run test_llm.py -- results show pass/fail per metric.

§04

Example

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

def test_rag_pipeline():
    test_case = LLMTestCase(
        input='What is DeepEval?',
        actual_output='DeepEval is an LLM testing framework.',
        retrieval_context=['DeepEval provides 30+ metrics for LLM evaluation.']
    )
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)
    assert_test(test_case, [relevancy, faithfulness])

§05

Related on TokRepo

AI Tools for Testing -- compare AI-powered testing and evaluation tools
AI Tools for RAG -- explore retrieval-augmented generation frameworks and engines

§06

Common pitfalls

Setting metric thresholds too high initially causes false failures. Start with 0.5-0.7 and tighten as your pipeline matures.
DeepEval strips types but does not validate LLM logic. Pair it with unit tests for deterministic code paths.
The retrieval_context field is required for RAG metrics like faithfulness. Omitting it silently skips those checks.

常见问题

How does DeepEval compare to manual LLM evaluation?+

Manual evaluation is subjective and does not scale. DeepEval quantifies output quality with reproducible metrics, runs in CI/CD, and catches regressions automatically. It replaces spreadsheet-based reviews with pytest-style assertions.

Does DeepEval support multi-model evaluation?+

Yes. DeepEval integrates with OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. You can evaluate outputs from any model by passing the actual_output to test cases regardless of which LLM generated it.

Can DeepEval run in CI/CD pipelines?+

Yes. DeepEval is pytest-compatible, so it runs in any CI system that supports Python testing -- GitHub Actions, GitLab CI, Jenkins, CircleCI. Use deepeval test run in your pipeline script.

What RAG-specific metrics does DeepEval provide?+

DeepEval offers answer relevancy, faithfulness, contextual precision, contextual recall, and hallucination metrics. These measure whether the LLM answer stays grounded in the retrieved documents.

Is DeepEval free to use?+

DeepEval is open source under MIT license. All metrics run locally on your machine at no cost. An optional hosted dashboard (Confident AI) is available for teams that want centralized reporting.

引用来源 (3)

DeepEval GitHub— DeepEval provides 30+ evaluation metrics for LLM apps
DeepEval Docs— Supports pytest-compatible test execution
DeepEval Metrics Docs— RAG evaluation metrics including faithfulness and relevancy

🙏

来源与感谢

Created by Confident AI. Licensed under MIT. confident-ai/deepeval — 14,400+ GitHub stars

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

DeepEval — LLM Testing Framework with 30+ Metrics

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

LM Evaluation Harness — Unified LLM Benchmarking Framework

doctest — The Fastest Feature-Rich C++ Testing Framework

Metasploit Framework — Open-Source Penetration Testing Platform

PHPUnit — The Standard Testing Framework for PHP