Key Features
- Zero dependencies: Pure C/C++ with no external libraries required
- Universal hardware: CPU, Apple Metal, CUDA, HIP, Vulkan, SYCL, RISC-V
- Multi-bit quantization: 1.5 to 8-bit for speed/quality tradeoff
- 50+ model architectures: LLaMA, Mistral, Qwen, Gemma, Phi, multimodal models
- OpenAI-compatible server: Drop-in replacement for local inference
- CPU+GPU hybrid: Split models across CPU and GPU memory
- GGUF format: Standard model format used across the ecosystem
FAQ
Q: What is llama.cpp? A: llama.cpp is a C/C++ LLM inference engine with 100K+ stars. Zero dependencies, runs on any hardware (CPU, Apple Silicon, NVIDIA, AMD), supports 50+ model architectures with 1.5-8 bit quantization. MIT licensed.
Q: How do I install llama.cpp?
A: brew install llama.cpp on macOS/Linux, or build from source with cmake -B build && cmake --build build. Download GGUF models from Hugging Face.