Introduction
LLMWare is an open-source Python framework for building enterprise RAG (retrieval-augmented generation) pipelines. It provides an integrated stack covering document parsing, embedding, vector storage, and inference using small, specialized models that can run locally without GPU requirements.
What LLMWare Does
- Parses PDFs, Office documents, HTML, and text into structured chunks for retrieval
- Generates embeddings and stores them in supported vector databases (Milvus, FAISS, Pinecone, Postgres/pgvector)
- Ships a catalog of 50+ small specialized GGUF and ONNX models for targeted tasks
- Runs function-calling models locally for summarization, extraction, classification, and Q&A
- Provides a library abstraction that connects parsing, retrieval, and generation into cohesive pipelines
Architecture Overview
LLMWare organizes work around a Library object that ingests documents, chunks them, and stores metadata in a document store (MongoDB, SQLite, or Postgres). Embeddings are generated and pushed to a vector database for similarity search. At query time, retrieved context is passed to a model from the built-in catalog or an external API. The SLIM model series (small language models under 3B parameters) handle structured extraction tasks efficiently on CPU.
Self-Hosting & Configuration
- Install with
pip install llmwareon Python 3.9+ - Choose a document store backend: SQLite (default), MongoDB, or PostgreSQL
- Select a vector database: FAISS (local default), Milvus, Qdrant, Pinecone, or pgvector
- Download models on first use from Hugging Face Hub via the ModelCatalog
- Configure API-based models (OpenAI, Anthropic, Google) via environment variables for hybrid deployments
Key Features
- Small specialized models (SLIM series) that run on CPU without GPU infrastructure
- End-to-end pipeline covering ingestion, parsing, embedding, retrieval, and generation
- Multi-format document parsing including scanned PDFs with OCR support
- Model catalog with 50+ pre-configured models for different tasks and hardware profiles
- Enterprise-friendly with support for air-gapped deployments and local-only operation
Comparison with Similar Tools
- LangChain — General-purpose LLM orchestration; LLMWare focuses on RAG with built-in models and parsing
- LlamaIndex — Specialized in data indexing and retrieval; LLMWare bundles its own small models
- Haystack — Pipeline-based NLP framework; LLMWare emphasizes CPU-friendly small models
- Unstructured — Document parsing library; LLMWare integrates parsing with retrieval and inference
- txtai — Embeddings and RAG; LLMWare provides a broader enterprise pipeline abstraction
FAQ
Q: Do I need a GPU to run LLMWare? A: No. The SLIM model series and GGUF models are designed to run on CPU. GPU acceleration is optional.
Q: What document formats does it support? A: PDF, DOCX, PPTX, XLSX, HTML, CSV, TXT, and JSON. Scanned PDFs are handled via integrated OCR.
Q: Can I use external LLM APIs instead of local models? A: Yes. LLMWare supports OpenAI, Anthropic, Google, and other API providers alongside local models.
Q: How does it compare to using LangChain with a vector store? A: LLMWare provides a more opinionated, integrated stack with built-in small models, reducing the need to assemble components separately.