ScriptsMar 31, 2026·2 min read

llama.cpp — Run LLMs Locally in Pure C/C++

llama.cpp is a C/C++ LLM inference engine with 100K+ GitHub stars. Runs on CPU, Apple Silicon, NVIDIA, AMD GPUs. 1.5-8 bit quantization, no dependencies, supports 50+ model architectures. MIT licensed

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

# Install via Homebrew (macOS/Linux)
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

# Download a model and run
./build/bin/llama-cli -m model.gguf -p "Hello, I am" -n 128

# Or run as OpenAI-compatible server
./build/bin/llama-server -m model.gguf --port 8080

Intro

llama.cpp is a plain C/C++ implementation of LLM inference with zero dependencies, enabling efficient model execution across diverse hardware. With 100,000+ GitHub stars and MIT license, it is the most popular local LLM inference engine. llama.cpp supports Apple Silicon (Metal), NVIDIA (CUDA), AMD (HIP), Intel, Vulkan, and CPU inference. It provides 1.5-8 bit quantization for faster inference with smaller models, supports 50+ model architectures (LLaMA, Mistral, Qwen, Gemma, Phi, and more), and includes an OpenAI-compatible API server.

Best for: Developers running LLMs locally on any hardware without cloud dependencies Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf Hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel, Vulkan, RISC-V


Key Features

  • Zero dependencies: Pure C/C++ with no external libraries required
  • Universal hardware: CPU, Apple Metal, CUDA, HIP, Vulkan, SYCL, RISC-V
  • Multi-bit quantization: 1.5 to 8-bit for speed/quality tradeoff
  • 50+ model architectures: LLaMA, Mistral, Qwen, Gemma, Phi, multimodal models
  • OpenAI-compatible server: Drop-in replacement for local inference
  • CPU+GPU hybrid: Split models across CPU and GPU memory
  • GGUF format: Standard model format used across the ecosystem

FAQ

Q: What is llama.cpp? A: llama.cpp is a C/C++ LLM inference engine with 100K+ stars. Zero dependencies, runs on any hardware (CPU, Apple Silicon, NVIDIA, AMD), supports 50+ model architectures with 1.5-8 bit quantization. MIT licensed.

Q: How do I install llama.cpp? A: brew install llama.cpp on macOS/Linux, or build from source with cmake -B build && cmake --build build. Download GGUF models from Hugging Face.


🙏

Source & Thanks

Created by Georgi Gerganov. Licensed under MIT. ggml-org/llama.cpp — 100,000+ GitHub stars

Related Assets