ScriptsMar 31, 2026·2 min read

Ragas — Evaluate RAG & LLM Applications

Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.

TL;DR
Ragas evaluates RAG pipelines with metrics like faithfulness, relevancy, and context precision, plus auto test data generation.
§01

What it is

Ragas is a Python framework for evaluating Retrieval-Augmented Generation (RAG) pipelines and LLM applications. It provides objective metrics such as faithfulness, answer relevancy, context precision, and context recall. Ragas also generates synthetic test datasets so you can evaluate your RAG pipeline without manually curating question-answer pairs.

Ragas targets ML engineers and developers building production RAG systems who need quantitative evaluation beyond manual spot-checking. If you run a retrieval pipeline feeding context to an LLM, Ragas tells you how well the system performs across measurable dimensions.

§02

How it saves time or tokens

Manually evaluating RAG outputs requires reading hundreds of responses and judging quality by hand. Ragas automates this with LLM-as-judge metrics that score each response on multiple axes. The synthetic test data generator creates diverse question-answer pairs from your documents, eliminating the hours spent crafting test sets. This workflow provides the pip install and evaluation script ready to run.

§03

How to use

  1. Install Ragas:
pip install ragas
  1. Prepare your evaluation dataset with questions, ground truth answers, retrieved contexts, and LLM-generated answers.
  1. Run the evaluation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    'question': ['What is RAG?'],
    'answer': ['RAG combines retrieval with generation...'],
    'contexts': [['Retrieval-augmented generation...']],
    'ground_truth': ['RAG is a technique that...']
})

result = evaluate(eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)
§04

Example

# Generate synthetic test data from your documents
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI

generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model='gpt-4o'),
    critic_llm=ChatOpenAI(model='gpt-4o')
)

# Load your documents
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('./docs/', glob='**/*.md')
documents = loader.load()

testset = generator.generate_with_langchain_docs(
    documents, test_size=20
)
print(testset.to_pandas())
§05

Related on TokRepo

  • RAG tools -- Other retrieval-augmented generation frameworks and utilities
  • Testing tools -- Automated testing and quality assurance for AI applications
§06

Common pitfalls

  • Running evaluation without setting an OpenAI API key. Ragas uses LLM-as-judge by default, which requires an LLM provider configured via environment variables.
  • Using too few test samples leads to unreliable metric scores. Aim for at least 50 question-answer pairs for statistically meaningful results.
  • Confusing context precision with context recall. Precision measures how much of the retrieved context is relevant; recall measures how much relevant context was retrieved.

Frequently Asked Questions

What metrics does Ragas provide?+

Ragas offers faithfulness (does the answer stick to the context), answer relevancy (is the answer relevant to the question), context precision (how much retrieved context is relevant), and context recall (how much relevant information was retrieved). Additional metrics cover aspect critique and harmfulness.

Does Ragas require an LLM API key?+

Yes, by default. Ragas uses an LLM as a judge to compute metrics like faithfulness and relevancy. You need an OpenAI API key or a compatible provider set via environment variables. Some metrics can run without an LLM, but the core metrics require one.

Can I use Ragas with non-OpenAI models?+

Yes. Ragas supports any LLM provider through LangChain integrations. You can use Anthropic, Google, or local models via Ollama. Pass the appropriate LangChain LLM object when initializing the evaluator.

How does synthetic test data generation work?+

Ragas reads your source documents and uses an LLM to generate diverse question-answer pairs with varying difficulty levels. It creates simple, multi-context, and reasoning questions. This eliminates the need for manual test set curation.

How many test samples do I need for reliable evaluation?+

At least 50 question-answer pairs are recommended for statistically meaningful results. For production evaluation, 100-200 samples across different document sections provide more robust insights into pipeline performance.

Citations (3)
  • Ragas GitHub— Ragas provides objective metrics for RAG evaluation including faithfulness and r…
  • Ragas Documentation— Ragas supports synthetic test data generation from source documents
  • Ragas Metrics Docs— LLM-as-judge evaluation methodology
🙏

Source & Thanks

Created by Exploding Gradients. Apache 2.0. explodinggradients/ragas — 13,200+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets