Skills2026年3月31日·1 分钟阅读

Ragas — Evaluate RAG & LLM Applications

Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

Ragas — Evaluate RAG & LLM Applications

直接安装命令

npx -y tokrepo@latest install 2c856b4d-64e5-46b2-9bbd-a7ce9f7a7296 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

Ragas evaluates RAG pipelines with metrics like faithfulness, relevancy, and context precision, plus auto test data generation.

§01

What it is

Ragas is a Python framework for evaluating Retrieval-Augmented Generation (RAG) pipelines and LLM applications. It provides objective metrics such as faithfulness, answer relevancy, context precision, and context recall. Ragas also generates synthetic test datasets so you can evaluate your RAG pipeline without manually curating question-answer pairs.

Ragas targets ML engineers and developers building production RAG systems who need quantitative evaluation beyond manual spot-checking. If you run a retrieval pipeline feeding context to an LLM, Ragas tells you how well the system performs across measurable dimensions.

§02

How it saves time or tokens

Manually evaluating RAG outputs requires reading hundreds of responses and judging quality by hand. Ragas automates this with LLM-as-judge metrics that score each response on multiple axes. The synthetic test data generator creates diverse question-answer pairs from your documents, eliminating the hours spent crafting test sets. This workflow provides the pip install and evaluation script ready to run.

§03

How to use

Install Ragas:

pip install ragas

Prepare your evaluation dataset with questions, ground truth answers, retrieved contexts, and LLM-generated answers.

Run the evaluation:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    'question': ['What is RAG?'],
    'answer': ['RAG combines retrieval with generation...'],
    'contexts': [['Retrieval-augmented generation...']],
    'ground_truth': ['RAG is a technique that...']
})

result = evaluate(eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

§04

Example

# Generate synthetic test data from your documents
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI

generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model='gpt-4o'),
    critic_llm=ChatOpenAI(model='gpt-4o')
)

# Load your documents
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('./docs/', glob='**/*.md')
documents = loader.load()

testset = generator.generate_with_langchain_docs(
    documents, test_size=20
)
print(testset.to_pandas())

§05

Related on TokRepo

RAG tools -- Other retrieval-augmented generation frameworks and utilities
Testing tools -- Automated testing and quality assurance for AI applications

§06

Common pitfalls

Running evaluation without setting an OpenAI API key. Ragas uses LLM-as-judge by default, which requires an LLM provider configured via environment variables.
Using too few test samples leads to unreliable metric scores. Aim for at least 50 question-answer pairs for statistically meaningful results.
Confusing context precision with context recall. Precision measures how much of the retrieved context is relevant; recall measures how much relevant context was retrieved.

常见问题

What metrics does Ragas provide?+

Ragas offers faithfulness (does the answer stick to the context), answer relevancy (is the answer relevant to the question), context precision (how much retrieved context is relevant), and context recall (how much relevant information was retrieved). Additional metrics cover aspect critique and harmfulness.

Does Ragas require an LLM API key?+

Yes, by default. Ragas uses an LLM as a judge to compute metrics like faithfulness and relevancy. You need an OpenAI API key or a compatible provider set via environment variables. Some metrics can run without an LLM, but the core metrics require one.

Can I use Ragas with non-OpenAI models?+

Yes. Ragas supports any LLM provider through LangChain integrations. You can use Anthropic, Google, or local models via Ollama. Pass the appropriate LangChain LLM object when initializing the evaluator.

How does synthetic test data generation work?+

Ragas reads your source documents and uses an LLM to generate diverse question-answer pairs with varying difficulty levels. It creates simple, multi-context, and reasoning questions. This eliminates the need for manual test set curation.

How many test samples do I need for reliable evaluation?+

At least 50 question-answer pairs are recommended for statistically meaningful results. For production evaluation, 100-200 samples across different document sections provide more robust insights into pipeline performance.

引用来源 (3)

Ragas GitHub— Ragas provides objective metrics for RAG evaluation including faithfulness and r…
Ragas Documentation— Ragas supports synthetic test data generation from source documents
Ragas Metrics Docs— LLM-as-judge evaluation methodology

🙏

来源与感谢

Created by Exploding Gradients. Apache 2.0. explodinggradients/ragas — 13,200+ GitHub stars

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Ragas — Evaluate RAG & LLM Applications

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

Opik — Debug, Evaluate & Monitor LLM Apps

TruLens — Evaluate and Track LLM Apps

Haystack — AI Orchestration for Search & RAG

RAGFlow — Deep Document Understanding RAG Engine