Stack RAG + Eval pour ML Engineers
Dix picks production pour le ML engineer qui livre un vrai RAG : chunking, serveur d'embeddings, vector DB (pgvector + Qdrant), frameworks de retrieval, reranker, eval, monitoring de drift, tracing. Sans eval, pas de progrès.
What's in this pack
This is the stack you build when the demo RAG worked, the stakeholders got excited, and now you have to put it in front of real users without it hallucinating into a lawsuit. Every pick here is production-grade, actively maintained, and represents a layer of the pipeline an ML engineer will absolutely have to own — not a glue-code library that papers over the hard parts.
This pack is deliberately different from the existing rag-pipelines pack on TokRepo. That one is an end-user RAG framework round-up (Quivr, RAGFlow, GraphRAG, Kotaemon — useful if you want a finished app). This one is the infra layer underneath: the components you wire together when no off-the-shelf RAG framework gives you the latency, control, or eval rigor your team needs.
The through-line is the same painful realization most ML teams hit around week 3 of shipping RAG: the demo metric is irrelevant; the only number that matters is faithfulness + answer-relevance on a real eval set, measured before and after every change. Half the picks here exist to make that loop fast.
Install in this order (chunking → embeddings → vector store → retrieval → eval → trace)
- Unstructured — document ETL. Start here because garbage in still equals garbage out. Unstructured handles PDFs with tables, scanned forms, HTML, .docx, .pptx, .eml. It returns clean chunks with element-level metadata (
Title,NarrativeText,Table), which becomes filter and rerank signal downstream. - Text Embeddings Inference (Hugging Face) — your embedding server. Self-hosted, low-latency, batched, supports BGE / E5 / GTE / Jina / Nomic out of the box. Run it on one GPU, every downstream service POSTs to it. Don't call OpenAI's embedding API from 12 microservices.
- Sentence Transformers — the model library behind most of the embeddings worth running. You'll use it for offline batch embedding, training your own domain-tuned model, and benchmarking BGE-large vs E5 vs nomic-embed on your corpus (which is the only benchmark that matters).
- pgvector — vector store option A. If you already have Postgres, the cheapest correct answer for under ~50M vectors is pgvector with HNSW. One database, one backup story, transactional inserts, joins to your existing metadata tables. Don't add a separate vector DB until pgvector actually breaks for you.
- Qdrant — vector store option B. When pgvector stops scaling (filtered queries over 100M+ vectors, hybrid search at low latency, dynamic schema), Qdrant is the open-source upgrade path. Rust core, payload-filtering at index time, sharded clusters, MIT.
- Haystack — production RAG and agent framework. Pipeline-graph abstraction, every component swappable, async-native. This is what you reach for when LangChain feels like it's fighting you and you want explicit DAGs you can trace and test.
- LlamaIndex — data framework for LLM apps. Strong at the ingestion and retrieval side: 150+ data loaders, query engines that compose (router → sub-question → response synth), and
LlamaParsefor hard PDFs. Pair it with Haystack or use it solo. - Cohere Rerank — the cheapest +10–20 point jump in retrieval quality you will ever ship. Retrieve top-50 with bi-encoder, rerank to top-5 with a cross-encoder. Almost every production RAG team that started without a reranker added one within a quarter.
- Embedding Drift Monitoring — retrieval regression runbook. When the same query returns different docs two months later because your embedding model was silently re-quantized or the document distribution shifted, you need a drift dashboard. This is the production runbook for catching it.
- Arize Phoenix — open-source AI observability + evaluation. OpenInference-compatible tracing for every LLM call and retrieval step, plus an evaluation framework that runs LLM-as-judge against your test set on every commit. The tracing + eval loop is non-negotiable; Phoenix is the open way to do both in one tool.
How they fit together (production RAG pipeline)
┌─────────────────────────────────────────────────────────────┐
│ INGESTION │
│ Unstructured ──► chunks + element metadata │
│ │ │
│ ▼ │
│ EMBEDDINGS │
│ Text Embeddings Inference (server) │
│ ▲ (model = Sentence Transformers / BGE) │
│ │ │
│ ▼ │
│ VECTOR STORE │
│ pgvector (≤50M) OR Qdrant (>50M, hybrid, filtered) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ QUERY-TIME │
│ LlamaIndex / Haystack ──► retrieve top-50 │
│ │ │
│ ▼ │
│ Cohere Rerank ──► top-5 cross-encoder rerank │
│ │ │
│ ▼ │
│ LLM (your generator) ──► answer │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
│ Arize Phoenix tracing ◄── every span (retrieve, rerank) │
│ Phoenix evals (LLM-as-judge) ◄── runs on eval set per PR │
│ Embedding Drift Monitoring ◄── nightly cron, alerts │
└─────────────────────────────────────────────────────────────┘
The split is deliberate: ingestion is a batch job (Unstructured → TEI → vector store), query-time is a hot path (vector store → LlamaIndex/Haystack → Rerank → LLM), and observability wraps everything. Without that third box, you cannot tell whether last week's prompt tweak helped or quietly regressed faithfulness by 8%.
Tradeoffs you'll hit
- pgvector vs Pinecone vs Qdrant — Postgres pgvector wins on operational simplicity (one DB, one backup, joins to existing metadata) and is genuinely fine to roughly 10–50M vectors with HNSW. Pinecone wins on "I don't want to run infra" and elastic scale, but costs add up fast and lock you in. Qdrant wins when you need filtered hybrid search at large scale and want to self-host. Default to pgvector. Switch to Qdrant when filtered latency degrades. Reach for Pinecone only when ops capacity is the binding constraint.
- OpenAI embeddings vs OSS (BGE / E5 / Nomic) — OpenAI
text-embedding-3-largeis strong on general English and trivially easy. OSS embeddings via Text Embeddings Inference cost ~10x less at volume, run offline, and let you fine-tune on your domain. The decision usually comes down to: do you have an eval set good enough to A/B them? If yes, OSS often wins. If no, start with OpenAI and build the eval set. - Haystack vs LlamaIndex vs LangChain — Haystack: explicit pipeline graphs, easier to test, slightly more verbose. LlamaIndex: stronger on ingestion + retrieval composition, weaker abstractions for full agent loops. LangChain: maximum surface area, fastest prototyping, most production teams eventually refactor away from it. Most mature stacks end up with LlamaIndex for ingestion + retrieval + Haystack or plain Python for orchestration.
- Reranker latency — Cohere Rerank adds 100–250ms. Almost always worth it. If you can't afford that, run a smaller open-source reranker (BGE-reranker-base) on your own GPU.
Common pitfalls
- Chunking too aggressively — 512-token chunks with 50-token overlap is the default and it's usually wrong. For Q&A over technical docs, larger semantic chunks (1000–1500 tokens, split on heading boundaries with Unstructured's element metadata) consistently outperform. Measure on your eval set, don't guess.
- No eval set = no progress — the most common failure mode. Without 50–200 hand-labelled query/expected-context pairs, every change is vibes-based. Build the eval set in week one. Update it whenever a real user reports a bad answer. This is the single highest-ROI engineering investment in any RAG project.
- Embedding model change without re-indexing — silently swapping
text-embedding-ada-002fortext-embedding-3-smallmakes your old vectors meaningless. Version your embeddings. Re-index when you swap. - Single retrieval strategy — pure dense retrieval misses lexical queries (product SKUs, error codes, version numbers). Add BM25/keyword as a parallel retriever and merge. Both LlamaIndex and Haystack support this in two lines.
- No reranker — bi-encoders are fast and lossy. A cross-encoder rerank over the top-50 candidates is the single most reliable quality lift on the entire RAG stack. Skipping it because "it costs latency" is a false economy when faithfulness is the bottleneck.
- Tracing as an afterthought — installing Phoenix after you have a quality problem is 5x harder than installing it on day one. Put it in before the first stakeholder demo.
10 ressources prêtes à installer
Questions fréquentes
How is this pack different from the existing `rag-pipelines` pack on TokRepo?
rag-pipelines is a framework round-up — Quivr, RAGFlow, GraphRAG, Kotaemon, Verba — the picks you reach for when you want a finished RAG app to deploy. This pack is the infra layer underneath: chunking (Unstructured), an embedding server you run yourself (Text Embeddings Inference), a vector store you operate (pgvector / Qdrant), a reranker, drift monitoring, and an observability layer. Different audience, zero overlapping workflow IDs. Pair them: pick a framework from rag-pipelines, then come here for the components when you need to take it past the demo.
Do I really need both pgvector AND Qdrant?
No, pick one. The pack lists both because the answer genuinely depends on scale and existing infra. If you already run Postgres and have fewer than ~50M vectors, pgvector with HNSW indexes is the correct answer and adding a second DB is gratuitous complexity. If you need filtered hybrid search over hundreds of millions of vectors at low latency, Qdrant earns its operational cost. Start pgvector, switch only when a benchmark on your real workload shows it can't keep up.
Why no LangChain in this pack?
LangChain is fine for prototyping, and most teams have it in their first RAG repo. The pack reflects the picks that mature ML teams actually keep in production — Haystack's explicit pipelines and LlamaIndex's retrieval composition consistently win on testability and maintainability once a project crosses about 6 months. Use whatever gets you to a working prototype. If you already ship with LangChain and it works, leave it; if you're greenfield in 2026, the picks here will age better.
Is a reranker really worth the latency?
Yes, almost always. A bi-encoder vector search returns top-50 candidates in 20ms but is lossy because it compares precomputed independent representations. A cross-encoder reranker re-scores each query/doc pair jointly and routinely lifts NDCG@5 by 10–20 points. Cohere Rerank adds ~100–250ms, which is the cheapest quality jump on the entire stack. If 100ms is a deal-breaker, run a small open reranker (BGE-reranker-base) on a CPU and stay under 50ms — but ship a reranker.
What's the smallest viable eval set I can start with?
Fifty queries with hand-written ideal contexts and acceptable answers is enough to measure faithfulness, answer-relevance, and context-precision meaningfully. Build it in week one before you tune anything else. Then bring in tools like Ragas, Phoenix Evals, or DeepEval to automate the scoring loop. Every real user complaint becomes one more eval case. By month three you'll have 200–500 cases and every PR runs them in CI — that's the loop that actually drives RAG quality.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs