llama.cpp — The C++ Engine Under Ollama, LM Studio, and Most Local LLMs
llama.cpp is Georgi Gerganov’s MIT-licensed C++ implementation of Llama-family inference — the engine most local LLM tools build on. Supports CPU, CUDA, ROCm, Metal, Vulkan, and aggressive quantization for any hardware.
Why llama.cpp
When Georgi Gerganov released llama.cpp in March 2023, it made running Llama on a MacBook a 20-line change instead of a research project. The implementation priorities — zero dependencies, aggressive quantization (2-8 bit), and maximum portability — turned out to define the local LLM category. Ollama, LM Studio, and LocalAI all wrap llama.cpp. If you install local LLM software in 2026, you’re probably running llama.cpp whether you know it or not.
Running llama.cpp directly gets you maximum control and minimum surface area. No runtime manager, no HTTP server by default (though one ships), no Python dependency. The trade: every convenience (model management, chat UIs, OpenAI compatibility) is work you either do yourself or delegate to a higher-level tool.
Reach for llama.cpp directly when: you need absolute control over quantization/flags, you’re deploying to unusual hardware (Raspberry Pi, older CPUs, mobile), or you’re building a product on top of llama.cpp’s C API. Otherwise, let Ollama or LM Studio wrap it for you.
Quick Start — Build, Download, Run
-hf downloads GGUF quantizations directly from Hugging Face (no manual conversion). llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint plus a minimal web UI at / for interactive use. Quantization suffixes: Q4_K_M is a safe default; Q5_K_M for higher quality; Q8_0 for near-lossless but larger.
# 1. Clone and build (Metal on macOS, CUDA on NVIDIA, CPU everywhere)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# macOS Metal
cmake -B build -DGGML_METAL=ON
# NVIDIA CUDA
# cmake -B build -DGGML_CUDA=ON
# AMD ROCm
# cmake -B build -DGGML_HIP=ON
# CPU-only
# cmake -B build
cmake --build build --config Release -j
# 2. Grab a GGUF model from Hugging Face
# (e.g., Qwen 2.5 7B Instruct Q4_K_M)
./build/bin/llama-cli -hf Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M \
-p "Three facts about llama.cpp:" -n 200
# 3. OpenAI-compatible server
./build/bin/llama-server -hf Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M \
--host 0.0.0.0 --port 8080 -c 8192
# 4. Call it like OpenAI
python - <<'PY'
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8080/v1", api_key="llama-cpp")
print(c.chat.completions.create(
model="qwen",
messages=[{"role":"user","content":"One-sentence llama.cpp fact."}],
).choices[0].message.content)
PYKey Features
Zero Python dependency
Pure C++. No virtualenv, no pip conflicts, no CUDA-Python version matrix. Single binary distribution.
Multi-backend support
CPU (AVX2/AVX512), CUDA (NVIDIA), ROCm (AMD), Metal (Apple), Vulkan (generic), SYCL (Intel), MUSA (Moore Threads). Compile once for your hardware, runs everywhere else with CPU fallback.
Aggressive quantization
Q2/Q3/Q4/Q5/Q6/Q8 plus K-quants and I-quants. Shrinks 70B models to ~40GB for single-GPU or 64GB-RAM deploy. Quality/size curves are well-documented.
GGUF format
Self-describing model format with metadata (architecture, prompt template, tokenizer). Replacing older GGML. Supported across the local LLM ecosystem.
llama-server binary
OpenAI-compatible HTTP server shipped in the repo. Supports chat completions, completions, embeddings, streaming, cancellation, slot management, and basic web UI.
Embeddings + speech + vision
Ships with llama-embedding, llama-cli, llama-bench, and experimental multimodal support (LLaVA, MiniCPM-V). One codebase, many modes.
Comparison
| Audience | Convenience | Control | Performance | |
|---|---|---|---|---|
| llama.cppthis | Developers, integrators | Low (bring your own UX) | Maximum | Excellent on any hardware |
| Ollama | Developers + semi-technical | High | Medium | Good (llama.cpp backend) |
| LM Studio | End users | Very high (GUI) | Low-medium | Good (llama.cpp / MLX) |
| vLLM | Production ops | Medium-low | High for GPU | Best on GPU |
Use Cases
01. Embedded or unusual hardware
Raspberry Pi, older laptops, mobile, single-board computers. llama.cpp runs where Python + CUDA-based engines can’t.
02. Building products on top of the C API
If you’re embedding LLM inference in a desktop app, game, or CLI, llama.cpp’s C API is the cleanest path. Also exposes Python (llama-cpp-python) and Rust bindings.
03. Precision control over quantization
When you want to experiment with specific K-quants or custom imatrix files, working directly with llama.cpp gives control that higher-level tools hide.
Pricing & License
llama.cpp: MIT open source. Free.
Hardware cost: matches the model size and quantization you pick. 7B at Q4 runs on 6GB RAM; 70B at Q4 needs ~40GB RAM or VRAM.
Operational cost: you own the deployment. For "just make it work", Ollama is easier. For "I need exactly this configuration", llama.cpp gives full control.
Related Assets on TokRepo
Pal MCP Server — Multi-Model AI Gateway for Claude Code
MCP server that lets Claude Code use Gemini, OpenAI, Grok, and Ollama as a unified AI dev team. Features model routing, CLI-to-CLI bridge, and conversation continuity across 7+ providers.
Ollama Model Library — Best AI Models for Local Use
Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.
Replicate — Run AI Models via Simple API Calls
Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.
Open WebUI — Self-Hosted AI Chat Platform
Feature-rich, offline-capable AI interface for Ollama, OpenAI, and local LLMs. Built-in RAG, voice, model builder. 130K+ stars.
Frequently Asked Questions
llama.cpp vs Ollama — which do I need?+
If you ask, Ollama. Ollama wraps llama.cpp with model management, OpenAI API polish, and cross-platform installers. Reach for raw llama.cpp when you need control Ollama doesn’t expose — custom quantization flags, unusual hardware, or embedding inference inside another product.
Is llama.cpp only for Llama models?+
Despite the name, no. It runs Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, Command-R, Yi, and dozens more — any architecture supported in the GGUF format. The community adds new architectures regularly.
How does performance compare to GPU-only engines?+
On GPU, vLLM/TensorRT-LLM win on raw throughput at scale. For single-stream performance on a single machine, llama.cpp with CUDA is competitive. On Apple Silicon, llama.cpp Metal is close to MLX for most models; MLX pulls ahead on some architectures.
What about llama-cpp-python?+
Python bindings to the C library. Useful when embedding inference in Python apps without spinning up an HTTP server. Maintained separately; lags the main repo by days to weeks.
Can llama.cpp fine-tune?+
LoRA-style fine-tuning via llama.cpp’s finetune binary exists but is limited compared to PyTorch-based tools. For serious fine-tuning use Axolotl, Unsloth, or MLX-LM; llama.cpp is inference-first.
Is the GGUF format going away?+
Not in the foreseeable future. Active and widely supported. Legacy GGML is deprecated — use GGUF for any new model conversion.