# RAG Best Practices — Production Pipeline Guide 2026 > Comprehensive guide to building production RAG pipelines. Covers chunking strategies, embedding models, vector databases, retrieval techniques, evaluation, and common pitfalls with code examples. ## Install Paste the prompt below into your AI tool: ## Quick Use The fastest path to a production RAG pipeline: ```python # 1. Parse documents from docling.document_converter import DocumentConverter docs = DocumentConverter().convert("knowledge_base/") # 2. Chunk intelligently from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = splitter.split_documents(docs) # 3. Embed and store from langchain_community.vectorstores import Qdrant vectorstore = Qdrant.from_documents(chunks, embedding=OpenAIEmbeddings()) # 4. Retrieve and generate retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) context = retriever.invoke("How does authentication work?") ``` --- ## Intro Retrieval-Augmented Generation (RAG) is the dominant architecture for building AI applications that need access to private data — from customer support bots to internal knowledge bases. This guide covers every stage of a production RAG pipeline with code examples, benchmarks, and lessons learned from real deployments. Best for developers building their first RAG system or optimizing an existing one. Works with: any LLM, any vector database. --- ## Pipeline Stages ### 1. Document Parsing | Tool | Best For | Accuracy | |------|----------|----------| | Docling | PDF with tables/figures | Highest | | Unstructured | Multi-format (15+ types) | High | | PyPDF | Simple PDFs | Medium | | BeautifulSoup | HTML/web pages | High | | Markitdown | Office docs → Markdown | High | **Rule**: Use Docling for complex PDFs, Unstructured for everything else. ### 2. Chunking Strategies | Strategy | When to Use | Chunk Size | |----------|-------------|------------| | Fixed-size | Simple documents | 512-1024 tokens | | Recursive | Code and structured text | 512 tokens | | Semantic | Dense technical content | Variable | | Document-level | Short documents (<1K tokens) | Full document | | Sentence-based | FAQ and Q&A content | 3-5 sentences | **Best practice**: Start with recursive chunking at 512 tokens with 50-token overlap. ```python splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, separators=["\ \ ", "\ ", ". ", " ", ""] ) ``` ### 3. Embedding Models | Model | Dimensions | Quality | Speed | Cost | |-------|-----------|---------|-------|------| | OpenAI text-embedding-3-large | 3072 | Best | Fast | $0.13/M | | OpenAI text-embedding-3-small | 1536 | Great | Fastest | $0.02/M | | Cohere embed-v3 | 1024 | Great | Fast | $0.10/M | | BGE-large-en | 1024 | Good | Medium | Free (local) | | all-MiniLM-L6 | 384 | OK | Fastest | Free (local) | **Best practice**: Use `text-embedding-3-small` for most cases. Switch to `large` only if retrieval quality is critical. ### 4. Vector Database Selection | Database | Hosted | Self-hosted | Best For | |----------|--------|-------------|----------| | Qdrant | Yes | Yes | General purpose, filtering | | Pinecone | Yes | No | Managed, zero ops | | Turbopuffer | Yes | No | Serverless, auto-scale | | ChromaDB | No | Yes | Prototyping, local dev | | pgvector | No | Yes | Already using PostgreSQL | | Weaviate | Yes | Yes | Multi-modal, GraphQL | ### 5. Retrieval Techniques | Technique | Improvement | Complexity | |-----------|-------------|------------| | Hybrid search (keyword + semantic) | +15-25% | Low | | Reranking (Cohere, BGE) | +10-20% | Low | | Query expansion | +5-15% | Medium | | Parent document retrieval | +10-20% | Medium | | HyDE (hypothetical doc embedding) | +5-15% | Medium | | Multi-query retrieval | +10-15% | Low | **Best practice**: Always use hybrid search + reranking. It is the highest ROI improvement. ### 6. Evaluation ```python # Use RAGAS for automated evaluation from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision results = evaluate( dataset=test_dataset, metrics=[faithfulness, answer_relevancy, context_precision] ) ``` ### Common Pitfalls | Pitfall | Solution | |---------|----------| | Chunks too large | Reduce to 512 tokens | | No overlap between chunks | Add 50-100 token overlap | | Wrong embedding model | Match model to your language | | No reranking | Add Cohere reranker (+15% accuracy) | | Ignoring metadata | Filter by date, source, type | | No evaluation | Use RAGAS or promptfoo | ### FAQ **Q: What is RAG?** A: Retrieval-Augmented Generation is an architecture where an LLM retrieves relevant documents from a knowledge base before generating a response, combining the LLMs reasoning with your private data. **Q: What chunk size should I use?** A: Start with 512 tokens and 50-token overlap. Adjust based on your document type and retrieval quality metrics. **Q: Do I need a vector database for RAG?** A: For production, yes. For prototyping, ChromaDB (in-memory) works. For production, use Qdrant, Pinecone, or pgvector. --- ## Source & Thanks > Compiled from production RAG deployments, research papers, and community benchmarks. > > Related assets on TokRepo: [Docling](https://tokrepo.com), [Qdrant MCP](https://tokrepo.com), [Haystack](https://tokrepo.com), [Turbopuffer MCP](https://tokrepo.com) --- ## Quick Use ```python # parse → chunk → embed → retrieve from docling.document_converter import DocumentConverter from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Qdrant docs = DocumentConverter().convert("knowledge_base/") chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_documents(docs) vectorstore = Qdrant.from_documents(chunks, embedding=OpenAIEmbeddings()) ``` --- ## Intro RAG (retrieval-augmented generation) is the mainstream architecture for AI apps that need access to private data. This guide covers every stage of a production RAG pipeline: document parsing, chunking strategy, embedding models, vector database selection, retrieval techniques, and evaluation methods. With code examples and hard-won lessons. --- ## Source & Thanks > Synthesized from production RAG deployments, research papers, and community benchmarks. --- Source: https://tokrepo.com/en/workflows/rag-best-practices-production-pipeline-guide-2026-7ded33e8 Author: Prompt Lab