TOKREPO · Arsenal IA

Stable

Stack RAG + Eval pour ML Engineers

Dix picks production pour le ML engineer qui livre un vrai RAG : chunking, serveur d'embeddings, vector DB (pgvector + Qdrant), frameworks de retrieval, reranker, eval, monitoring de drift, tracing. Sans eval, pas de progrès.

10 ressources

À propos de ce pack

What's in this pack

This is the stack you build when the demo RAG worked, the stakeholders got excited, and now you have to put it in front of real users without it hallucinating into a lawsuit. Every pick here is production-grade, actively maintained, and represents a layer of the pipeline an ML engineer will absolutely have to own — not a glue-code library that papers over the hard parts.

This pack is deliberately different from the existing rag-pipelines pack on TokRepo. That one is an end-user RAG framework round-up (Quivr, RAGFlow, GraphRAG, Kotaemon — useful if you want a finished app). This one is the infra layer underneath: the components you wire together when no off-the-shelf RAG framework gives you the latency, control, or eval rigor your team needs.

The through-line is the same painful realization most ML teams hit around week 3 of shipping RAG: the demo metric is irrelevant; the only number that matters is faithfulness + answer-relevance on a real eval set, measured before and after every change. Half the picks here exist to make that loop fast.

Install in this order (chunking → embeddings → vector store → retrieval → eval → trace)

Unstructured — document ETL. Start here because garbage in still equals garbage out. Unstructured handles PDFs with tables, scanned forms, HTML, .docx, .pptx, .eml. It returns clean chunks with element-level metadata (Title, NarrativeText, Table), which becomes filter and rerank signal downstream.
Text Embeddings Inference (Hugging Face) — your embedding server. Self-hosted, low-latency, batched, supports BGE / E5 / GTE / Jina / Nomic out of the box. Run it on one GPU, every downstream service POSTs to it. Don't call OpenAI's embedding API from 12 microservices.
Sentence Transformers — the model library behind most of the embeddings worth running. You'll use it for offline batch embedding, training your own domain-tuned model, and benchmarking BGE-large vs E5 vs nomic-embed on your corpus (which is the only benchmark that matters).
pgvector — vector store option A. If you already have Postgres, the cheapest correct answer for under ~50M vectors is pgvector with HNSW. One database, one backup story, transactional inserts, joins to your existing metadata tables. Don't add a separate vector DB until pgvector actually breaks for you.
Qdrant — vector store option B. When pgvector stops scaling (filtered queries over 100M+ vectors, hybrid search at low latency, dynamic schema), Qdrant is the open-source upgrade path. Rust core, payload-filtering at index time, sharded clusters, MIT.
Haystack — production RAG and agent framework. Pipeline-graph abstraction, every component swappable, async-native. This is what you reach for when LangChain feels like it's fighting you and you want explicit DAGs you can trace and test.
LlamaIndex — data framework for LLM apps. Strong at the ingestion and retrieval side: 150+ data loaders, query engines that compose (router → sub-question → response synth), and LlamaParse for hard PDFs. Pair it with Haystack or use it solo.
Cohere Rerank — the cheapest +10–20 point jump in retrieval quality you will ever ship. Retrieve top-50 with bi-encoder, rerank to top-5 with a cross-encoder. Almost every production RAG team that started without a reranker added one within a quarter.
Embedding Drift Monitoring — retrieval regression runbook. When the same query returns different docs two months later because your embedding model was silently re-quantized or the document distribution shifted, you need a drift dashboard. This is the production runbook for catching it.
Arize Phoenix — open-source AI observability + evaluation. OpenInference-compatible tracing for every LLM call and retrieval step, plus an evaluation framework that runs LLM-as-judge against your test set on every commit. The tracing + eval loop is non-negotiable; Phoenix is the open way to do both in one tool.

How they fit together (production RAG pipeline)

┌─────────────────────────────────────────────────────────────┐
│  INGESTION                                                  │
│   Unstructured  ──►  chunks + element metadata              │
│        │                                                    │
│        ▼                                                    │
│  EMBEDDINGS                                                 │
│   Text Embeddings Inference (server)                        │
│        ▲           (model = Sentence Transformers / BGE)    │
│        │                                                    │
│        ▼                                                    │
│  VECTOR STORE                                               │
│   pgvector (≤50M)   OR   Qdrant (>50M, hybrid, filtered)    │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  QUERY-TIME                                                 │
│   LlamaIndex / Haystack  ──►  retrieve top-50               │
│        │                                                    │
│        ▼                                                    │
│   Cohere Rerank  ──►  top-5 cross-encoder rerank            │
│        │                                                    │
│        ▼                                                    │
│   LLM (your generator) ──► answer                           │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  OBSERVABILITY                                              │
│   Arize Phoenix tracing  ◄── every span (retrieve, rerank)  │
│   Phoenix evals (LLM-as-judge)  ◄── runs on eval set per PR │
│   Embedding Drift Monitoring   ◄── nightly cron, alerts     │
└─────────────────────────────────────────────────────────────┘

The split is deliberate: ingestion is a batch job (Unstructured → TEI → vector store), query-time is a hot path (vector store → LlamaIndex/Haystack → Rerank → LLM), and observability wraps everything. Without that third box, you cannot tell whether last week's prompt tweak helped or quietly regressed faithfulness by 8%.

Tradeoffs you'll hit

pgvector vs Pinecone vs Qdrant — Postgres pgvector wins on operational simplicity (one DB, one backup, joins to existing metadata) and is genuinely fine to roughly 10–50M vectors with HNSW. Pinecone wins on "I don't want to run infra" and elastic scale, but costs add up fast and lock you in. Qdrant wins when you need filtered hybrid search at large scale and want to self-host. Default to pgvector. Switch to Qdrant when filtered latency degrades. Reach for Pinecone only when ops capacity is the binding constraint.
OpenAI embeddings vs OSS (BGE / E5 / Nomic) — OpenAI text-embedding-3-large is strong on general English and trivially easy. OSS embeddings via Text Embeddings Inference cost ~10x less at volume, run offline, and let you fine-tune on your domain. The decision usually comes down to: do you have an eval set good enough to A/B them? If yes, OSS often wins. If no, start with OpenAI and build the eval set.
Haystack vs LlamaIndex vs LangChain — Haystack: explicit pipeline graphs, easier to test, slightly more verbose. LlamaIndex: stronger on ingestion + retrieval composition, weaker abstractions for full agent loops. LangChain: maximum surface area, fastest prototyping, most production teams eventually refactor away from it. Most mature stacks end up with LlamaIndex for ingestion + retrieval + Haystack or plain Python for orchestration.
Reranker latency — Cohere Rerank adds 100–250ms. Almost always worth it. If you can't afford that, run a smaller open-source reranker (BGE-reranker-base) on your own GPU.

Common pitfalls

Chunking too aggressively — 512-token chunks with 50-token overlap is the default and it's usually wrong. For Q&A over technical docs, larger semantic chunks (1000–1500 tokens, split on heading boundaries with Unstructured's element metadata) consistently outperform. Measure on your eval set, don't guess.
No eval set = no progress — the most common failure mode. Without 50–200 hand-labelled query/expected-context pairs, every change is vibes-based. Build the eval set in week one. Update it whenever a real user reports a bad answer. This is the single highest-ROI engineering investment in any RAG project.
Embedding model change without re-indexing — silently swapping text-embedding-ada-002 for text-embedding-3-small makes your old vectors meaningless. Version your embeddings. Re-index when you swap.
Single retrieval strategy — pure dense retrieval misses lexical queries (product SKUs, error codes, version numbers). Add BM25/keyword as a parallel retriever and merge. Both LlamaIndex and Haystack support this in two lines.
No reranker — bi-encoders are fast and lossy. A cross-encoder rerank over the top-50 candidates is the single most reliable quality lift on the entire RAG stack. Skipping it because "it costs latency" is a false economy when faithfulness is the bottleneck.
Tracing as an afterthought — installing Phoenix after you have a quality problem is 5x harder than installing it on day one. Put it in before the first stakeholder demo.

INSTALLER · UNE COMMANDE

$ tokrepo install pack/ml-engineer-rag-eval

passez-la à votre agent — ou collez-la dans votre terminal

Ce qu'il contient

10 ressources prêtes à installer

MCP#01

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

by MCP Hub·388 views

$ tokrepo install unstructured-document-etl-llm-pipelines-c2ba9909

Skill#02

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.

by Hugging Face·367 views

$ tokrepo install text-embeddings-inference-high-performance-embedding-server-19c58bfa

Skill#03

Sentence Transformers — State-of-the-Art Embeddings

Sentence Transformers computes text embeddings for semantic search, similarity, and reranking. 18.5K+ GitHub stars. 15,000+ pre-trained models, dense/sparse/reranker, multi-lingual. Apache 2.0.

by Script Depot·229 views

$ tokrepo install sentence-transformers-state-art-embeddings-596096ff

Skill#04

pgvector — Vector Similarity Search Inside PostgreSQL

A PostgreSQL extension that adds a native `vector` type, HNSW and IVFFlat indexes, and distance operators so semantic search, RAG and recommendation workloads can reuse the same database as the rest of the app.

by Script Depot·315 views

$ tokrepo install pgvector-vector-similarity-search-inside-postgresql-121fb0d5

Skill#05

Qdrant — High-Performance Vector Database

Vector database and search engine for AI applications. Handles billion-scale similarity search with filtering, sparse vectors, and multi-tenancy. Rust-powered. 30K+ stars.

by AI Open Source·330 views

$ tokrepo install qdrant-high-performance-vector-database-1566710d

Skill#06

Haystack — Production RAG & Agent Framework

Build composable AI pipelines for RAG, agents, and search. Model-agnostic, production-ready, by deepset. 18K+ stars.

by Skill Factory·262 views

$ tokrepo install haystack-production-rag-agent-framework-2126f372

Skill#07

LlamaIndex — Data Framework for LLM Applications

Connect your data to large language models. The leading framework for RAG, document indexing, knowledge graphs, and structured data extraction.

by Script Depot·311 views

$ tokrepo install llamaindex-data-framework-llm-applications-1bd234e2

Skill#08

Cohere Rerank — Boost RAG Accuracy with Rerank-3

Cohere Rerank scores candidates against a query using a cross-encoder. Drop into any RAG to boost top-1 hit rate by 30-50% over vector search alone.

by Cohere·268 views

$ tokrepo install cohere-rerank-boost-rag-accuracy-with-rerank-3

Skill#09

Embedding Drift Monitoring — Retrieval Regression Runbook

Embedding drift monitoring runbook for RAG and agent search. Uses golden queries, recall@K, rank delta, and rollback gates.

by henuwangkai·221 views

$ tokrepo install embedding-drift-monitoring-retrieval-regression-runbook-ea696ee5

Skill#10

Arize Phoenix — Open Source AI Observability and Evaluation

Arize Phoenix is an open-source platform for monitoring, evaluating, and debugging AI applications, providing tracing, experiment tracking, and automated evaluation for LLM and ML pipelines.

by Script Depot·245 views

$ tokrepo install arize-phoenix-open-source-ai-observability-evaluation-41cdac3f

Questions fréquentes

How is this pack different from the existing `rag-pipelines` pack on TokRepo?

rag-pipelines is a framework round-up — Quivr, RAGFlow, GraphRAG, Kotaemon, Verba — the picks you reach for when you want a finished RAG app to deploy. This pack is the infra layer underneath: chunking (Unstructured), an embedding server you run yourself (Text Embeddings Inference), a vector store you operate (pgvector / Qdrant), a reranker, drift monitoring, and an observability layer. Different audience, zero overlapping workflow IDs. Pair them: pick a framework from rag-pipelines, then come here for the components when you need to take it past the demo.

Do I really need both pgvector AND Qdrant?

No, pick one. The pack lists both because the answer genuinely depends on scale and existing infra. If you already run Postgres and have fewer than ~50M vectors, pgvector with HNSW indexes is the correct answer and adding a second DB is gratuitous complexity. If you need filtered hybrid search over hundreds of millions of vectors at low latency, Qdrant earns its operational cost. Start pgvector, switch only when a benchmark on your real workload shows it can't keep up.

Why no LangChain in this pack?

LangChain is fine for prototyping, and most teams have it in their first RAG repo. The pack reflects the picks that mature ML teams actually keep in production — Haystack's explicit pipelines and LlamaIndex's retrieval composition consistently win on testability and maintainability once a project crosses about 6 months. Use whatever gets you to a working prototype. If you already ship with LangChain and it works, leave it; if you're greenfield in 2026, the picks here will age better.

Is a reranker really worth the latency?

Yes, almost always. A bi-encoder vector search returns top-50 candidates in 20ms but is lossy because it compares precomputed independent representations. A cross-encoder reranker re-scores each query/doc pair jointly and routinely lifts NDCG@5 by 10–20 points. Cohere Rerank adds ~100–250ms, which is the cheapest quality jump on the entire stack. If 100ms is a deal-breaker, run a small open reranker (BGE-reranker-base) on a CPU and stay under 50ms — but ship a reranker.

What's the smallest viable eval set I can start with?

Fifty queries with hand-written ideal contexts and acceptable answers is enough to measure faithfulness, answer-relevance, and context-precision meaningfully. Build it in week one before you tune anything else. Then bring in tools like Ragas, Phoenix Evals, or DeepEval to automate the scoring loop. Every real user complaint becomes one more eval case. By month three you'll have 200–500 cases and every PR runs them in CI — that's the loop that actually drives RAG quality.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs