[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-ml-engineer-rag-eval-en":3,"seo:pack:ml-engineer-rag-eval:en":101},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":100},"ml-engineer-rag-eval","🧠","#8B5CF6","new","New · this week","ML Engineer's RAG + Eval Stack","Ten production picks for the ML engineer shipping a real RAG app: chunking, embedding server, vector DB (pgvector + Qdrant), retrieval frameworks, reranker, eval, drift monitoring, tracing. No eval = no progress.",[16,28,38,46,53,61,69,76,84,93],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},439,"c2ba9909-f624-414f-8aeb-fbd95c50766e","unstructured-document-etl-llm-pipelines-c2ba9909","Unstructured — Document ETL for LLM Pipelines","Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.","MCP Hub",212,0,"en","mcp","MCP",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":36,"type_label":37},2513,"19c58bfa-45e0-11f1-9bc6-00163e2b0d79","text-embeddings-inference-high-performance-embedding-server-19c58bfa","Text Embeddings Inference — High-Performance Embedding Server by Hugging Face","A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.","Hugging Face",134,"skill","Skill",{"id":39,"uuid":40,"slug":41,"title":42,"description":43,"author_name":44,"view_count":45,"vote_count":24,"lang_type":25,"type":36,"type_label":37},286,"596096ff-e0fb-41bd-a964-03817dafce9d","sentence-transformers-state-art-embeddings-596096ff","Sentence Transformers — State-of-the-Art Embeddings","Sentence Transformers computes text embeddings for semantic search, similarity, and reranking. 18.5K+ GitHub stars. 15,000+ pre-trained models, dense\u002Fsparse\u002Freranker, multi-lingual. Apache 2.0.","Script Depot",123,{"id":47,"uuid":48,"slug":49,"title":50,"description":51,"author_name":44,"view_count":52,"vote_count":24,"lang_type":25,"type":36,"type_label":37},1459,"121fb0d5-3920-11f1-9bc6-00163e2b0d79","pgvector-vector-similarity-search-inside-postgresql-121fb0d5","pgvector — Vector Similarity Search Inside PostgreSQL","A PostgreSQL extension that adds a native `vector` type, HNSW and IVFFlat indexes, and distance operators so semantic search, RAG and recommendation workloads can reuse the same database as the rest of the app.",164,{"id":54,"uuid":55,"slug":56,"title":57,"description":58,"author_name":59,"view_count":60,"vote_count":24,"lang_type":25,"type":36,"type_label":37},215,"1566710d-f5ed-46da-af8c-757475a10420","qdrant-high-performance-vector-database-1566710d","Qdrant — High-Performance Vector Database","Vector database and search engine for AI applications. Handles billion-scale similarity search with filtering, sparse vectors, and multi-tenancy. Rust-powered. 30K+ stars.","AI Open Source",187,{"id":62,"uuid":63,"slug":64,"title":65,"description":66,"author_name":67,"view_count":68,"vote_count":24,"lang_type":25,"type":36,"type_label":37},407,"2126f372-519e-45bd-8817-69d70e061bb0","haystack-production-rag-agent-framework-2126f372","Haystack — Production RAG & Agent Framework","Build composable AI pipelines for RAG, agents, and search. Model-agnostic, production-ready, by deepset. 18K+ stars.","Skill Factory",147,{"id":70,"uuid":71,"slug":72,"title":73,"description":74,"author_name":44,"view_count":75,"vote_count":24,"lang_type":25,"type":36,"type_label":37},157,"1bd234e2-5c10-459f-91f4-00675625103b","llamaindex-data-framework-llm-applications-1bd234e2","LlamaIndex — Data Framework for LLM Applications","Connect your data to large language models. The leading framework for RAG, document indexing, knowledge graphs, and structured data extraction.",177,{"id":77,"uuid":78,"slug":79,"title":80,"description":81,"author_name":82,"view_count":83,"vote_count":24,"lang_type":25,"type":36,"type_label":37},2824,"bf323939-d2b6-4426-aa9f-9325666e7eaa","cohere-rerank-boost-rag-accuracy-with-rerank-3","Cohere Rerank — Boost RAG Accuracy with Rerank-3","Cohere Rerank scores candidates against a query using a cross-encoder. Drop into any RAG to boost top-1 hit rate by 30-50% over vector search alone.","Cohere",92,{"id":85,"uuid":86,"slug":87,"title":88,"description":89,"author_name":90,"view_count":91,"vote_count":24,"lang_type":92,"type":36,"type_label":37},4260,"ea696ee5-0736-48e3-a789-f5a026223bd0","embedding-drift-monitoring-retrieval-regression-runbook-ea696ee5","Embedding Drift Monitoring — Retrieval Regression Runbook","Embedding drift monitoring runbook for RAG and agent search. Uses golden queries, recall@K, rank delta, and rollback gates.","henuwangkai",40,"",{"id":94,"uuid":95,"slug":96,"title":97,"description":98,"author_name":44,"view_count":99,"vote_count":24,"lang_type":25,"type":36,"type_label":37},3576,"41cdac3f-4ea4-11f1-9bc6-00163e2b0d79","arize-phoenix-open-source-ai-observability-evaluation-41cdac3f","Arize Phoenix — Open Source AI Observability and Evaluation","Arize Phoenix is an open-source platform for monitoring, evaluating, and debugging AI applications, providing tracing, experiment tracking, and automated evaluation for LLM and ML pipelines.",55,"tokrepo install pack\u002Fml-engineer-rag-eval",{"pageType":102,"pageKey":8,"locale":25,"title":103,"metaDescription":104,"h1":105,"tldr":106,"bodyMarkdown":107,"faq":108,"schema":124,"internalLinks":129,"citations":142,"wordCount":155,"generatedAt":156},"pack","ML Engineer's RAG + Eval Stack — 10 Production Picks for Real LLM Apps","Unstructured, Text Embeddings Inference, Sentence Transformers, pgvector, Qdrant, Haystack, LlamaIndex, Cohere Rerank, Embedding Drift Monitoring, Arize Phoenix — the stack an ML engineer actually ships RAG with. Chunking through tracing, in install order.","ML Engineer's RAG + Eval Stack — Production Picks for Real LLM Apps","Ten production-tested picks ordered by the actual pipeline an ML engineer builds: chunking and ingestion, embedding server, embedding model, vector store (pgvector or Qdrant), retrieval framework, reranker, eval harness, drift monitoring, tracing. The lesson the hard way: no eval, no progress.","## What's in this pack\n\nThis is the stack you build when the demo RAG worked, the stakeholders got excited, and now you have to put it in front of real users without it hallucinating into a lawsuit. Every pick here is **production-grade**, **actively maintained**, and represents a layer of the pipeline an ML engineer will absolutely have to own — not a glue-code library that papers over the hard parts.\n\nThis pack is **deliberately different** from the existing `rag-pipelines` pack on TokRepo. That one is an end-user RAG framework round-up (Quivr, RAGFlow, GraphRAG, Kotaemon — useful if you want a finished app). This one is the **infra layer underneath**: the components you wire together when no off-the-shelf RAG framework gives you the latency, control, or eval rigor your team needs.\n\nThe through-line is the same painful realization most ML teams hit around week 3 of shipping RAG: **the demo metric is irrelevant; the only number that matters is faithfulness + answer-relevance on a real eval set, measured before and after every change.** Half the picks here exist to make that loop fast.\n\n## Install in this order (chunking → embeddings → vector store → retrieval → eval → trace)\n\n1. **Unstructured** — document ETL. Start here because garbage in still equals garbage out. Unstructured handles PDFs with tables, scanned forms, HTML, .docx, .pptx, .eml. It returns clean chunks with element-level metadata (`Title`, `NarrativeText`, `Table`), which becomes filter and rerank signal downstream.\n2. **Text Embeddings Inference** (Hugging Face) — your embedding server. Self-hosted, low-latency, batched, supports BGE \u002F E5 \u002F GTE \u002F Jina \u002F Nomic out of the box. Run it on one GPU, every downstream service POSTs to it. Don't call OpenAI's embedding API from 12 microservices.\n3. **Sentence Transformers** — the model library behind most of the embeddings worth running. You'll use it for offline batch embedding, training your own domain-tuned model, and benchmarking BGE-large vs E5 vs nomic-embed on *your* corpus (which is the only benchmark that matters).\n4. **pgvector** — vector store option A. If you already have Postgres, the cheapest correct answer for under ~50M vectors is pgvector with HNSW. One database, one backup story, transactional inserts, joins to your existing metadata tables. Don't add a separate vector DB until pgvector actually breaks for you.\n5. **Qdrant** — vector store option B. When pgvector stops scaling (filtered queries over 100M+ vectors, hybrid search at low latency, dynamic schema), Qdrant is the open-source upgrade path. Rust core, payload-filtering at index time, sharded clusters, MIT.\n6. **Haystack** — production RAG and agent framework. Pipeline-graph abstraction, every component swappable, async-native. This is what you reach for when LangChain feels like it's fighting you and you want explicit DAGs you can trace and test.\n7. **LlamaIndex** — data framework for LLM apps. Strong at the *ingestion* and *retrieval* side: 150+ data loaders, query engines that compose (router → sub-question → response synth), and `LlamaParse` for hard PDFs. Pair it with Haystack or use it solo.\n8. **Cohere Rerank** — the cheapest +10–20 point jump in retrieval quality you will ever ship. Retrieve top-50 with bi-encoder, rerank to top-5 with a cross-encoder. Almost every production RAG team that started without a reranker added one within a quarter.\n9. **Embedding Drift Monitoring** — retrieval regression runbook. When the same query returns different docs two months later because your embedding model was silently re-quantized or the document distribution shifted, you need a drift dashboard. This is the production runbook for catching it.\n10. **Arize Phoenix** — open-source AI observability + evaluation. OpenInference-compatible tracing for every LLM call and retrieval step, plus an evaluation framework that runs LLM-as-judge against your test set on every commit. The tracing + eval loop is non-negotiable; Phoenix is the open way to do both in one tool.\n\n## How they fit together (production RAG pipeline)\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│  INGESTION                                                  │\n│   Unstructured  ──►  chunks + element metadata              │\n│        │                                                    │\n│        ▼                                                    │\n│  EMBEDDINGS                                                 │\n│   Text Embeddings Inference (server)                        │\n│        ▲           (model = Sentence Transformers \u002F BGE)    │\n│        │                                                    │\n│        ▼                                                    │\n│  VECTOR STORE                                               │\n│   pgvector (≤50M)   OR   Qdrant (>50M, hybrid, filtered)    │\n└─────────────────────────────────────────────────────────────┘\n                          │\n                          ▼\n┌─────────────────────────────────────────────────────────────┐\n│  QUERY-TIME                                                 │\n│   LlamaIndex \u002F Haystack  ──►  retrieve top-50               │\n│        │                                                    │\n│        ▼                                                    │\n│   Cohere Rerank  ──►  top-5 cross-encoder rerank            │\n│        │                                                    │\n│        ▼                                                    │\n│   LLM (your generator) ──► answer                           │\n└─────────────────────────────────────────────────────────────┘\n                          │\n                          ▼\n┌─────────────────────────────────────────────────────────────┐\n│  OBSERVABILITY                                              │\n│   Arize Phoenix tracing  ◄── every span (retrieve, rerank)  │\n│   Phoenix evals (LLM-as-judge)  ◄── runs on eval set per PR │\n│   Embedding Drift Monitoring   ◄── nightly cron, alerts     │\n└─────────────────────────────────────────────────────────────┘\n```\n\nThe split is deliberate: ingestion is a batch job (Unstructured → TEI → vector store), query-time is a hot path (vector store → LlamaIndex\u002FHaystack → Rerank → LLM), and observability wraps everything. Without that third box, you cannot tell whether last week's prompt tweak helped or quietly regressed faithfulness by 8%.\n\n## Tradeoffs you'll hit\n\n- **pgvector vs Pinecone vs Qdrant** — Postgres pgvector wins on operational simplicity (one DB, one backup, joins to existing metadata) and is genuinely fine to roughly 10–50M vectors with HNSW. Pinecone wins on \"I don't want to run infra\" and elastic scale, but costs add up fast and lock you in. Qdrant wins when you need filtered hybrid search at large scale and want to self-host. Default to pgvector. Switch to Qdrant when filtered latency degrades. Reach for Pinecone only when ops capacity is the binding constraint.\n- **OpenAI embeddings vs OSS (BGE \u002F E5 \u002F Nomic)** — OpenAI `text-embedding-3-large` is strong on general English and trivially easy. OSS embeddings via Text Embeddings Inference cost ~10x less at volume, run offline, and let you fine-tune on your domain. The decision usually comes down to: do you have an eval set good enough to A\u002FB them? If yes, OSS often wins. If no, start with OpenAI and build the eval set.\n- **Haystack vs LlamaIndex vs LangChain** — Haystack: explicit pipeline graphs, easier to test, slightly more verbose. LlamaIndex: stronger on ingestion + retrieval composition, weaker abstractions for full agent loops. LangChain: maximum surface area, fastest prototyping, most production teams eventually refactor away from it. Most mature stacks end up with **LlamaIndex for ingestion + retrieval** + **Haystack or plain Python for orchestration**.\n- **Reranker latency** — Cohere Rerank adds 100–250ms. Almost always worth it. If you can't afford that, run a smaller open-source reranker (BGE-reranker-base) on your own GPU.\n\n## Common pitfalls\n\n- **Chunking too aggressively** — 512-token chunks with 50-token overlap is the default and it's usually wrong. For Q&A over technical docs, larger semantic chunks (1000–1500 tokens, split on heading boundaries with Unstructured's element metadata) consistently outperform. Measure on your eval set, don't guess.\n- **No eval set = no progress** — the most common failure mode. Without 50–200 hand-labelled query\u002Fexpected-context pairs, every change is vibes-based. Build the eval set in week one. Update it whenever a real user reports a bad answer. This is the single highest-ROI engineering investment in any RAG project.\n- **Embedding model change without re-indexing** — silently swapping `text-embedding-ada-002` for `text-embedding-3-small` makes your old vectors meaningless. Version your embeddings. Re-index when you swap.\n- **Single retrieval strategy** — pure dense retrieval misses lexical queries (product SKUs, error codes, version numbers). Add BM25\u002Fkeyword as a parallel retriever and merge. Both LlamaIndex and Haystack support this in two lines.\n- **No reranker** — bi-encoders are fast and lossy. A cross-encoder rerank over the top-50 candidates is the single most reliable quality lift on the entire RAG stack. Skipping it because \"it costs latency\" is a false economy when faithfulness is the bottleneck.\n- **Tracing as an afterthought** — installing Phoenix after you have a quality problem is 5x harder than installing it on day one. Put it in before the first stakeholder demo.",[109,112,115,118,121],{"q":110,"a":111},"How is this pack different from the existing `rag-pipelines` pack on TokRepo?","rag-pipelines is a framework round-up — Quivr, RAGFlow, GraphRAG, Kotaemon, Verba — the picks you reach for when you want a finished RAG app to deploy. This pack is the infra layer underneath: chunking (Unstructured), an embedding server you run yourself (Text Embeddings Inference), a vector store you operate (pgvector \u002F Qdrant), a reranker, drift monitoring, and an observability layer. Different audience, zero overlapping workflow IDs. Pair them: pick a framework from rag-pipelines, then come here for the components when you need to take it past the demo.",{"q":113,"a":114},"Do I really need both pgvector AND Qdrant?","No, pick one. The pack lists both because the answer genuinely depends on scale and existing infra. If you already run Postgres and have fewer than ~50M vectors, pgvector with HNSW indexes is the correct answer and adding a second DB is gratuitous complexity. If you need filtered hybrid search over hundreds of millions of vectors at low latency, Qdrant earns its operational cost. Start pgvector, switch only when a benchmark on your real workload shows it can't keep up.",{"q":116,"a":117},"Why no LangChain in this pack?","LangChain is fine for prototyping, and most teams have it in their first RAG repo. The pack reflects the picks that mature ML teams actually keep in production — Haystack's explicit pipelines and LlamaIndex's retrieval composition consistently win on testability and maintainability once a project crosses about 6 months. Use whatever gets you to a working prototype. If you already ship with LangChain and it works, leave it; if you're greenfield in 2026, the picks here will age better.",{"q":119,"a":120},"Is a reranker really worth the latency?","Yes, almost always. A bi-encoder vector search returns top-50 candidates in 20ms but is lossy because it compares precomputed independent representations. A cross-encoder reranker re-scores each query\u002Fdoc pair jointly and routinely lifts NDCG@5 by 10–20 points. Cohere Rerank adds ~100–250ms, which is the cheapest quality jump on the entire stack. If 100ms is a deal-breaker, run a small open reranker (BGE-reranker-base) on a CPU and stay under 50ms — but ship a reranker.",{"q":122,"a":123},"What's the smallest viable eval set I can start with?","Fifty queries with hand-written ideal contexts and acceptable answers is enough to measure faithfulness, answer-relevance, and context-precision meaningfully. Build it in week one before you tune anything else. Then bring in tools like Ragas, Phoenix Evals, or DeepEval to automate the scoring loop. Every real user complaint becomes one more eval case. By month three you'll have 200–500 cases and every PR runs them in CI — that's the loop that actually drives RAG quality.",{"@context":125,"@type":126,"name":13,"description":127,"numberOfItems":128,"inLanguage":25},"https:\u002F\u002Fschema.org","ItemList","Ten production-grade open-source picks for ML engineers shipping real RAG pipelines: chunking, embedding server, vector store, retrieval framework, reranker, drift monitoring, evaluation, and tracing.",10,[130,134,138],{"url":131,"anchor":132,"reason":133},"\u002Fen\u002Ftopics\u002Frag-pipelines","RAG Pipelines pack — finished frameworks","Companion pack: end-user RAG apps you can deploy directly, while this pack is the infra layer underneath",{"url":135,"anchor":136,"reason":137},"\u002Fen\u002Ftopics\u002Fllm-observability","LLM Observability pack","Broader observability coverage beyond Phoenix — Langfuse, AgentOps, LangSmith for production LLM apps",{"url":139,"anchor":140,"reason":141},"\u002Fen\u002Ftopics\u002Fvector-db-showdown","Vector DB Showdown pack","Deeper comparison across Chroma, Weaviate, Pinecone, Milvus when the pgvector vs Qdrant choice in this pack isn't enough",[143,147,151],{"claim":144,"source_name":145,"source_url":146},"Cross-encoder reranking lifts retrieval NDCG significantly over bi-encoder-only retrieval","Cohere Rerank documentation","https:\u002F\u002Fcohere.com\u002Frerank",{"claim":148,"source_name":149,"source_url":150},"pgvector supports HNSW indexes for fast approximate nearest-neighbour search inside Postgres","pgvector GitHub","https:\u002F\u002Fgithub.com\u002Fpgvector\u002Fpgvector",{"claim":152,"source_name":153,"source_url":154},"Hugging Face Text Embeddings Inference is a high-performance embedding server supporting BGE, E5, GTE, Jina and Nomic models","Text Embeddings Inference repo","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference",1310,"2026-05-22T00:00:00Z"]