Local LLM

llama.cpp — The C++ Engine Under Ollama, LM Studio, and Most Local LLMs

llama.cpp is Georgi Gerganov’s MIT-licensed C++ implementation of Llama-family inference — the engine most local LLM tools build on. Supports CPU, CUDA, ROCm, Metal, Vulkan, and aggressive quantization for any hardware.

Official Site GitHub

Why llama.cpp

When Georgi Gerganov released llama.cpp in March 2023, it made running Llama on a MacBook a 20-line change instead of a research project. The implementation priorities — zero dependencies, aggressive quantization (2-8 bit), and maximum portability — turned out to define the local LLM category. Ollama, LM Studio, and LocalAI all wrap llama.cpp. If you install local LLM software in 2026, you’re probably running llama.cpp whether you know it or not.

Running llama.cpp directly gets you maximum control and minimum surface area. No runtime manager, no HTTP server by default (though one ships), no Python dependency. The trade: every convenience (model management, chat UIs, OpenAI compatibility) is work you either do yourself or delegate to a higher-level tool.

Reach for llama.cpp directly when: you need absolute control over quantization/flags, you’re deploying to unusual hardware (Raspberry Pi, older CPUs, mobile), or you’re building a product on top of llama.cpp’s C API. Otherwise, let Ollama or LM Studio wrap it for you.

Quick Start — Build, Download, Run

-hf downloads GGUF quantizations directly from Hugging Face (no manual conversion). llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint plus a minimal web UI at / for interactive use. Quantization suffixes: Q4_K_M is a safe default; Q5_K_M for higher quality; Q8_0 for near-lossless but larger.

# 1. Clone and build (Metal on macOS, CUDA on NVIDIA, CPU everywhere)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# macOS Metal
cmake -B build -DGGML_METAL=ON
# NVIDIA CUDA
# cmake -B build -DGGML_CUDA=ON
# AMD ROCm
# cmake -B build -DGGML_HIP=ON
# CPU-only
# cmake -B build
cmake --build build --config Release -j

# 2. Grab a GGUF model from Hugging Face
# (e.g., Qwen 2.5 7B Instruct Q4_K_M)
./build/bin/llama-cli -hf Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M \
  -p "Three facts about llama.cpp:" -n 200

# 3. OpenAI-compatible server
./build/bin/llama-server -hf Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M \
  --host 0.0.0.0 --port 8080 -c 8192

# 4. Call it like OpenAI
python - <<'PY'
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8080/v1", api_key="llama-cpp")
print(c.chat.completions.create(
    model="qwen",
    messages=[{"role":"user","content":"One-sentence llama.cpp fact."}],
).choices[0].message.content)
PY

Key Features

Zero Python dependency

Pure C++. No virtualenv, no pip conflicts, no CUDA-Python version matrix. Single binary distribution.

Multi-backend support

CPU (AVX2/AVX512), CUDA (NVIDIA), ROCm (AMD), Metal (Apple), Vulkan (generic), SYCL (Intel), MUSA (Moore Threads). Compile once for your hardware, runs everywhere else with CPU fallback.

Aggressive quantization

Q2/Q3/Q4/Q5/Q6/Q8 plus K-quants and I-quants. Shrinks 70B models to ~40GB for single-GPU or 64GB-RAM deploy. Quality/size curves are well-documented.

GGUF format

Self-describing model format with metadata (architecture, prompt template, tokenizer). Replacing older GGML. Supported across the local LLM ecosystem.

llama-server binary

OpenAI-compatible HTTP server shipped in the repo. Supports chat completions, completions, embeddings, streaming, cancellation, slot management, and basic web UI.

Embeddings + speech + vision

Ships with llama-embedding, llama-cli, llama-bench, and experimental multimodal support (LLaVA, MiniCPM-V). One codebase, many modes.

Comparison

	Audience	Convenience	Control	Performance
llama.cppthis	Developers, integrators	Low (bring your own UX)	Maximum	Excellent on any hardware
Ollama	Developers + semi-technical	High	Medium	Good (llama.cpp backend)
LM Studio	End users	Very high (GUI)	Low-medium	Good (llama.cpp / MLX)
vLLM	Production ops	Medium-low	High for GPU	Best on GPU

Use Cases

01. Embedded or unusual hardware

Raspberry Pi, older laptops, mobile, single-board computers. llama.cpp runs where Python + CUDA-based engines can’t.

02. Building products on top of the C API

If you’re embedding LLM inference in a desktop app, game, or CLI, llama.cpp’s C API is the cleanest path. Also exposes Python (llama-cpp-python) and Rust bindings.

03. Precision control over quantization

When you want to experiment with specific K-quants or custom imatrix files, working directly with llama.cpp gives control that higher-level tools hide.

Pricing & License

llama.cpp: MIT open source. Free.

Hardware cost: matches the model size and quantization you pick. 7B at Q4 runs on 6GB RAM; 70B at Q4 needs ~40GB RAM or VRAM.

Operational cost: you own the deployment. For "just make it work", Ollama is easier. For "I need exactly this configuration", llama.cpp gives full control.

Related Assets on TokRepo

LLaMA-Factory — Fine-Tune 100+ LLMs with a Unified Interface

LLaMA-Factory provides a web UI and CLI to fine-tune large language models including LLaMA, Mistral, Qwen, and more using LoRA, QLoRA, and full-parameter methods without writing training scripts.

LLaMA-Factory — Unified LLM Fine-Tuning Framework

LLaMA-Factory offers a web UI and CLI for fine-tuning over 100 large language models using methods like LoRA, QLoRA, and full-parameter training, with built-in evaluation and export.

Llama Index — Data Framework for LLM Applications

Leading data framework for connecting LLMs to external data. LlamaIndex handles ingestion, indexing, retrieval, and query engines for building production RAG applications.

Llama Stack — Meta Official LLM App Framework

Official Meta framework for building LLM applications with Llama models. Inference, safety, RAG, agents, evals, and tool use. Standardized APIs. 8.3K+ stars.

Frequently Asked Questions

llama.cpp vs Ollama — which do I need?+

If you ask, Ollama. Ollama wraps llama.cpp with model management, OpenAI API polish, and cross-platform installers. Reach for raw llama.cpp when you need control Ollama doesn’t expose — custom quantization flags, unusual hardware, or embedding inference inside another product.

Is llama.cpp only for Llama models?+

Despite the name, no. It runs Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, Command-R, Yi, and dozens more — any architecture supported in the GGUF format. The community adds new architectures regularly.

How does performance compare to GPU-only engines?+

On GPU, vLLM/TensorRT-LLM win on raw throughput at scale. For single-stream performance on a single machine, llama.cpp with CUDA is competitive. On Apple Silicon, llama.cpp Metal is close to MLX for most models; MLX pulls ahead on some architectures.

What about llama-cpp-python?+

Python bindings to the C library. Useful when embedding inference in Python apps without spinning up an HTTP server. Maintained separately; lags the main repo by days to weeks.

Can llama.cpp fine-tune?+

LoRA-style fine-tuning via llama.cpp’s finetune binary exists but is limited compared to PyTorch-based tools. For serious fine-tuning use Axolotl, Unsloth, or MLX-LM; llama.cpp is inference-first.

Is the GGUF format going away?+

Not in the foreseeable future. Active and widely supported. Legacy GGML is deprecated — use GGUF for any new model conversion.

Compare Alternatives

Ollama — Run LLMs Locally with One Command (2026 Guide)vLLM — High-Throughput GPU Inference Server (Production Scale)LM Studio — Desktop GUI for Local LLMs (Windows, Mac, Linux)MLX — Apple’s Machine Learning Framework for Apple Silicon