# llama.cpp — Run LLMs Locally in Pure C/C++

> llama.cpp is a C/C++ LLM inference engine with 100K+ GitHub stars. Runs on CPU, Apple Silicon, NVIDIA, AMD GPUs. 1.5-8 bit quantization, no dependencies, supports 50+ model architectures. MIT licensed

## Install

Save as a script file and run:

## Quick Use

```bash
# Install via Homebrew (macOS/Linux)
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

# Download a model and run
./build/bin/llama-cli -m model.gguf -p "Hello, I am" -n 128

# Or run as OpenAI-compatible server
./build/bin/llama-server -m model.gguf --port 8080
```

---

## Intro

llama.cpp is a plain C/C++ implementation of LLM inference with zero dependencies, enabling efficient model execution across diverse hardware. With 100,000+ GitHub stars and MIT license, it is the most popular local LLM inference engine. llama.cpp supports Apple Silicon (Metal), NVIDIA (CUDA), AMD (HIP), Intel, Vulkan, and CPU inference. It provides 1.5-8 bit quantization for faster inference with smaller models, supports 50+ model architectures (LLaMA, Mistral, Qwen, Gemma, Phi, and more), and includes an OpenAI-compatible API server.

**Best for**: Developers running LLMs locally on any hardware without cloud dependencies
**Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf
**Hardware**: CPU, Apple Silicon, NVIDIA, AMD, Intel, Vulkan, RISC-V

---

## Key Features

- **Zero dependencies**: Pure C/C++ with no external libraries required
- **Universal hardware**: CPU, Apple Metal, CUDA, HIP, Vulkan, SYCL, RISC-V
- **Multi-bit quantization**: 1.5 to 8-bit for speed/quality tradeoff
- **50+ model architectures**: LLaMA, Mistral, Qwen, Gemma, Phi, multimodal models
- **OpenAI-compatible server**: Drop-in replacement for local inference
- **CPU+GPU hybrid**: Split models across CPU and GPU memory
- **GGUF format**: Standard model format used across the ecosystem

---

### FAQ

**Q: What is llama.cpp?**
A: llama.cpp is a C/C++ LLM inference engine with 100K+ stars. Zero dependencies, runs on any hardware (CPU, Apple Silicon, NVIDIA, AMD), supports 50+ model architectures with 1.5-8 bit quantization. MIT licensed.

**Q: How do I install llama.cpp?**
A: `brew install llama.cpp` on macOS/Linux, or build from source with `cmake -B build && cmake --build build`. Download GGUF models from Hugging Face.

---

## Source & Thanks

> Created by [Georgi Gerganov](https://github.com/ggml-org). Licensed under MIT.
> [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) — 100,000+ GitHub stars

---
Source: https://tokrepo.com/en/workflows/b2e0b71d-4b40-45c0-9609-bc5e2abe7c0f
Author: TokRepo精选