Pipeline Stages
1. Document Parsing
| Tool | Best For | Accuracy |
|---|---|---|
| Docling | PDF with tables/figures | Highest |
| Unstructured | Multi-format (15+ types) | High |
| PyPDF | Simple PDFs | Medium |
| BeautifulSoup | HTML/web pages | High |
| Markitdown | Office docs → Markdown | High |
Rule: Use Docling for complex PDFs, Unstructured for everything else.
2. Chunking Strategies
| Strategy | When to Use | Chunk Size |
|---|---|---|
| Fixed-size | Simple documents | 512-1024 tokens |
| Recursive | Code and structured text | 512 tokens |
| Semantic | Dense technical content | Variable |
| Document-level | Short documents (<1K tokens) | Full document |
| Sentence-based | FAQ and Q&A content | 3-5 sentences |
Best practice: Start with recursive chunking at 512 tokens with 50-token overlap.
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)3. Embedding Models
| Model | Dimensions | Quality | Speed | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Best | Fast | $0.13/M |
| OpenAI text-embedding-3-small | 1536 | Great | Fastest | $0.02/M |
| Cohere embed-v3 | 1024 | Great | Fast | $0.10/M |
| BGE-large-en | 1024 | Good | Medium | Free (local) |
| all-MiniLM-L6 | 384 | OK | Fastest | Free (local) |
Best practice: Use text-embedding-3-small for most cases. Switch to large only if retrieval quality is critical.
4. Vector Database Selection
| Database | Hosted | Self-hosted | Best For |
|---|---|---|---|
| Qdrant | Yes | Yes | General purpose, filtering |
| Pinecone | Yes | No | Managed, zero ops |
| Turbopuffer | Yes | No | Serverless, auto-scale |
| ChromaDB | No | Yes | Prototyping, local dev |
| pgvector | No | Yes | Already using PostgreSQL |
| Weaviate | Yes | Yes | Multi-modal, GraphQL |
5. Retrieval Techniques
| Technique | Improvement | Complexity |
|---|---|---|
| Hybrid search (keyword + semantic) | +15-25% | Low |
| Reranking (Cohere, BGE) | +10-20% | Low |
| Query expansion | +5-15% | Medium |
| Parent document retrieval | +10-20% | Medium |
| HyDE (hypothetical doc embedding) | +5-15% | Medium |
| Multi-query retrieval | +10-15% | Low |
Best practice: Always use hybrid search + reranking. It is the highest ROI improvement.
6. Evaluation
# Use RAGAS for automated evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=test_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)Common Pitfalls
| Pitfall | Solution |
|---|---|
| Chunks too large | Reduce to 512 tokens |
| No overlap between chunks | Add 50-100 token overlap |
| Wrong embedding model | Match model to your language |
| No reranking | Add Cohere reranker (+15% accuracy) |
| Ignoring metadata | Filter by date, source, type |
| No evaluation | Use RAGAS or promptfoo |
FAQ
Q: What is RAG? A: Retrieval-Augmented Generation is an architecture where an LLM retrieves relevant documents from a knowledge base before generating a response, combining the LLMs reasoning with your private data.
Q: What chunk size should I use? A: Start with 512 tokens and 50-token overlap. Adjust based on your document type and retrieval quality metrics.
Q: Do I need a vector database for RAG? A: For production, yes. For prototyping, ChromaDB (in-memory) works. For production, use Qdrant, Pinecone, or pgvector.