Ragas — Evaluate RAG & LLM Applications
Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.
Agent 可直接安装
这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。
npx -y tokrepo@latest install 2c856b4d-64e5-46b2-9bbd-a7ce9f7a7296 --target codex先 dry-run 确认安装计划,再运行此命令。
What it is
Ragas is a Python framework for evaluating Retrieval-Augmented Generation (RAG) pipelines and LLM applications. It provides objective metrics such as faithfulness, answer relevancy, context precision, and context recall. Ragas also generates synthetic test datasets so you can evaluate your RAG pipeline without manually curating question-answer pairs.
Ragas targets ML engineers and developers building production RAG systems who need quantitative evaluation beyond manual spot-checking. If you run a retrieval pipeline feeding context to an LLM, Ragas tells you how well the system performs across measurable dimensions.
How it saves time or tokens
Manually evaluating RAG outputs requires reading hundreds of responses and judging quality by hand. Ragas automates this with LLM-as-judge metrics that score each response on multiple axes. The synthetic test data generator creates diverse question-answer pairs from your documents, eliminating the hours spent crafting test sets. This workflow provides the pip install and evaluation script ready to run.
How to use
- Install Ragas:
pip install ragas
- Prepare your evaluation dataset with questions, ground truth answers, retrieved contexts, and LLM-generated answers.
- Run the evaluation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_dataset = Dataset.from_dict({
'question': ['What is RAG?'],
'answer': ['RAG combines retrieval with generation...'],
'contexts': [['Retrieval-augmented generation...']],
'ground_truth': ['RAG is a technique that...']
})
result = evaluate(eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)
Example
# Generate synthetic test data from your documents
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI
generator = TestsetGenerator.from_langchain(
generator_llm=ChatOpenAI(model='gpt-4o'),
critic_llm=ChatOpenAI(model='gpt-4o')
)
# Load your documents
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('./docs/', glob='**/*.md')
documents = loader.load()
testset = generator.generate_with_langchain_docs(
documents, test_size=20
)
print(testset.to_pandas())
Related on TokRepo
- RAG tools -- Other retrieval-augmented generation frameworks and utilities
- Testing tools -- Automated testing and quality assurance for AI applications
Common pitfalls
- Running evaluation without setting an OpenAI API key. Ragas uses LLM-as-judge by default, which requires an LLM provider configured via environment variables.
- Using too few test samples leads to unreliable metric scores. Aim for at least 50 question-answer pairs for statistically meaningful results.
- Confusing context precision with context recall. Precision measures how much of the retrieved context is relevant; recall measures how much relevant context was retrieved.
常见问题
Ragas offers faithfulness (does the answer stick to the context), answer relevancy (is the answer relevant to the question), context precision (how much retrieved context is relevant), and context recall (how much relevant information was retrieved). Additional metrics cover aspect critique and harmfulness.
Yes, by default. Ragas uses an LLM as a judge to compute metrics like faithfulness and relevancy. You need an OpenAI API key or a compatible provider set via environment variables. Some metrics can run without an LLM, but the core metrics require one.
Yes. Ragas supports any LLM provider through LangChain integrations. You can use Anthropic, Google, or local models via Ollama. Pass the appropriate LangChain LLM object when initializing the evaluator.
Ragas reads your source documents and uses an LLM to generate diverse question-answer pairs with varying difficulty levels. It creates simple, multi-context, and reasoning questions. This eliminates the need for manual test set curation.
At least 50 question-answer pairs are recommended for statistically meaningful results. For production evaluation, 100-200 samples across different document sections provide more robust insights into pipeline performance.
引用来源 (3)
- Ragas GitHub— Ragas provides objective metrics for RAG evaluation including faithfulness and r…
- Ragas Documentation— Ragas supports synthetic test data generation from source documents
- Ragas Metrics Docs— LLM-as-judge evaluation methodology
TokRepo 相关
来源与感谢
Created by Exploding Gradients. Apache 2.0. explodinggradients/ragas — 13,200+ GitHub stars
讨论
相关资产
Opik — Debug, Evaluate & Monitor LLM Apps
Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.
TruLens — Evaluate and Track LLM Apps
Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.
Haystack — AI Orchestration for Search & RAG
Open-source AI orchestration framework by deepset. Build production RAG pipelines, semantic search, and agent workflows with modular components. 25K+ GitHub stars.
RAGFlow — Deep Document Understanding RAG Engine
Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.