# RAG Best Practices — Production Pipeline Guide 2026 > Comprehensive guide to building production RAG pipelines. Covers chunking strategies, embedding models, vector databases, retrieval techniques, evaluation, and common pitfalls with code examples. ## Install Paste the prompt below into your AI tool: ## Quick Use The fastest path to a production RAG pipeline: ```python # 1. Parse documents from docling.document_converter import DocumentConverter docs = DocumentConverter().convert("knowledge_base/") # 2. Chunk intelligently from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = splitter.split_documents(docs) # 3. Embed and store from langchain_community.vectorstores import Qdrant vectorstore = Qdrant.from_documents(chunks, embedding=OpenAIEmbeddings()) # 4. Retrieve and generate retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) context = retriever.invoke("How does authentication work?") ``` --- ## Intro Retrieval-Augmented Generation (RAG) is the dominant architecture for building AI applications that need access to private data — from customer support bots to internal knowledge bases. This guide covers every stage of a production RAG pipeline with code examples, benchmarks, and lessons learned from real deployments. Best for developers building their first RAG system or optimizing an existing one. Works with: any LLM, any vector database. --- ## Pipeline Stages ### 1. Document Parsing | Tool | Best For | Accuracy | |------|----------|----------| | Docling | PDF with tables/figures | Highest | | Unstructured | Multi-format (15+ types) | High | | PyPDF | Simple PDFs | Medium | | BeautifulSoup | HTML/web pages | High | | Markitdown | Office docs → Markdown | High | **Rule**: Use Docling for complex PDFs, Unstructured for everything else. ### 2. Chunking Strategies | Strategy | When to Use | Chunk Size | |----------|-------------|------------| | Fixed-size | Simple documents | 512-1024 tokens | | Recursive | Code and structured text | 512 tokens | | Semantic | Dense technical content | Variable | | Document-level | Short documents (<1K tokens) | Full document | | Sentence-based | FAQ and Q&A content | 3-5 sentences | **Best practice**: Start with recursive chunking at 512 tokens with 50-token overlap. ```python splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""] ) ``` ### 3. Embedding Models | Model | Dimensions | Quality | Speed | Cost | |-------|-----------|---------|-------|------| | OpenAI text-embedding-3-large | 3072 | Best | Fast | $0.13/M | | OpenAI text-embedding-3-small | 1536 | Great | Fastest | $0.02/M | | Cohere embed-v3 | 1024 | Great | Fast | $0.10/M | | BGE-large-en | 1024 | Good | Medium | Free (local) | | all-MiniLM-L6 | 384 | OK | Fastest | Free (local) | **Best practice**: Use `text-embedding-3-small` for most cases. Switch to `large` only if retrieval quality is critical. ### 4. Vector Database Selection | Database | Hosted | Self-hosted | Best For | |----------|--------|-------------|----------| | Qdrant | Yes | Yes | General purpose, filtering | | Pinecone | Yes | No | Managed, zero ops | | Turbopuffer | Yes | No | Serverless, auto-scale | | ChromaDB | No | Yes | Prototyping, local dev | | pgvector | No | Yes | Already using PostgreSQL | | Weaviate | Yes | Yes | Multi-modal, GraphQL | ### 5. Retrieval Techniques | Technique | Improvement | Complexity | |-----------|-------------|------------| | Hybrid search (keyword + semantic) | +15-25% | Low | | Reranking (Cohere, BGE) | +10-20% | Low | | Query expansion | +5-15% | Medium | | Parent document retrieval | +10-20% | Medium | | HyDE (hypothetical doc embedding) | +5-15% | Medium | | Multi-query retrieval | +10-15% | Low | **Best practice**: Always use hybrid search + reranking. It is the highest ROI improvement. ### 6. Evaluation ```python # Use RAGAS for automated evaluation from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision results = evaluate( dataset=test_dataset, metrics=[faithfulness, answer_relevancy, context_precision] ) ``` ### Common Pitfalls | Pitfall | Solution | |---------|----------| | Chunks too large | Reduce to 512 tokens | | No overlap between chunks | Add 50-100 token overlap | | Wrong embedding model | Match model to your language | | No reranking | Add Cohere reranker (+15% accuracy) | | Ignoring metadata | Filter by date, source, type | | No evaluation | Use RAGAS or promptfoo | ### FAQ **Q: What is RAG?** A: Retrieval-Augmented Generation is an architecture where an LLM retrieves relevant documents from a knowledge base before generating a response, combining the LLMs reasoning with your private data. **Q: What chunk size should I use?** A: Start with 512 tokens and 50-token overlap. Adjust based on your document type and retrieval quality metrics. **Q: Do I need a vector database for RAG?** A: For production, yes. For prototyping, ChromaDB (in-memory) works. For production, use Qdrant, Pinecone, or pgvector. --- ## Source & Thanks > Compiled from production RAG deployments, research papers, and community benchmarks. > > Related assets on TokRepo: [Docling](https://tokrepo.com), [Qdrant MCP](https://tokrepo.com), [Haystack](https://tokrepo.com), [Turbopuffer MCP](https://tokrepo.com) --- ## 快速使用 ```python # 解析 → 分块 → 嵌入 → 检索 from docling.document_converter import DocumentConverter from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Qdrant docs = DocumentConverter().convert("knowledge_base/") chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_documents(docs) vectorstore = Qdrant.from_documents(chunks, embedding=OpenAIEmbeddings()) ``` --- ## 简介 RAG(检索增强生成)是构建需要访问私有数据的 AI 应用的主流架构。本指南涵盖生产 RAG 管道的每个阶段:文档解析、分块策略、嵌入模型、向量数据库选型、检索技术和评估方法。附代码示例和实战经验。 --- ## 来源与感谢 > 综合自生产 RAG 部署、研究论文和社区基准测试。 --- Source: https://tokrepo.com/en/workflows/7ded33e8-464c-4c8f-b3de-6dcf14c0eaf4 Author: Prompt Lab