Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsMar 31, 2026·2 min de lectura

llama.cpp — Run LLMs Locally in Pure C/C++

llama.cpp is a C/C++ LLM inference engine with 100K+ GitHub stars. Runs on CPU, Apple Silicon, NVIDIA, AMD GPUs. 1.5-8 bit quantization, no dependencies, supports 50+ model architectures. MIT licensed

Script Depot · Community

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

llama.cpp — Run LLMs Locally in Pure C/C++

Comando de instalación directa

npx -y tokrepo@latest install b2e0b71d-4b40-45c0-9609-bc5e2abe7c0f --target codex

Ejecutar después de confirmar el plan con dry-run.

TL;DR

llama.cpp runs large language models locally on CPU and GPU with aggressive quantization and zero external dependencies.

§01

What it is

llama.cpp is an open-source LLM inference engine written in C/C++. It runs large language models locally on consumer hardware including CPUs, Apple Silicon, NVIDIA GPUs, and AMD GPUs. The project supports 1.5-8 bit quantization, which compresses models small enough to run on laptops and desktops without dedicated AI hardware.

llama.cpp is designed for developers, researchers, and privacy-conscious users who want to run LLMs locally without sending data to cloud APIs.

§02

How it saves time or tokens

Cloud API calls cost money per token and introduce network latency. llama.cpp lets you run inference locally at zero marginal cost after the initial model download. For development, testing, and experimentation, this means unlimited generations without worrying about API bills. Quantization (GGUF format) compresses a 70B parameter model to run on a machine with 32GB RAM, making local inference practical on standard hardware.

§03

How to use

Install llama.cpp:

# Homebrew (macOS/Linux)
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

Download a GGUF model file from Hugging Face or convert one yourself.

Run inference:

llama-cli -m model.gguf -p 'Explain the observer pattern in two sentences' -n 128

Or start a local OpenAI-compatible API server:

llama-server -m model.gguf --port 8080

§04

Example

Using the OpenAI-compatible server with a Python client:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:8080/v1',
    api_key='not-needed',
)

response = client.chat.completions.create(
    model='local-model',
    messages=[{'role': 'user', 'content': 'Write a haiku about compilers'}],
)
print(response.choices[0].message.content)

This works with any OpenAI-compatible client library, making it a drop-in replacement for cloud APIs during development.

§05

Related on TokRepo

Local LLM tools — Compare local inference frameworks
llama.cpp deep dive — Detailed llama.cpp guide and configurations

§06

Common pitfalls

Downloading a model that is too large for your RAM. A Q4_K_M quantized 70B model needs approximately 40GB RAM. Check model card memory requirements before downloading.
Not enabling GPU offloading when a GPU is available. Use -ngl 99 to offload all layers to GPU for significantly faster inference.
Using the wrong GGUF file for your hardware. Q4_K_M is a good default balance of quality and speed. Q2_K is smaller but noticeably lower quality. Q8_0 is highest quality but needs more RAM.
Starting with an overly complex configuration instead of defaults. Begin with the minimal setup, verify it works, then customize incrementally. This approach catches configuration errors early and keeps troubleshooting straightforward.

Preguntas frecuentes

What is GGUF format?+

GGUF (GPT-Generated Unified Format) is the model file format used by llama.cpp. It stores quantized model weights, tokenizer data, and metadata in a single file. GGUF replaced the older GGML format and supports more model architectures.

Which models can llama.cpp run?+

llama.cpp supports 50+ model architectures including Llama, Mistral, Phi, Gemma, Qwen, StarCoder, and many more. Any model that has been converted to GGUF format can be loaded. Hugging Face hosts thousands of pre-quantized GGUF models.

How does quantization affect model quality?+

Quantization reduces precision from 16-bit to 4-8 bits, which compresses model size by 2-4x. Quality loss is minimal at Q4_K_M and above. Below Q4, outputs may become noticeably less coherent. The tradeoff is between memory usage and generation quality.

Can llama.cpp use multiple GPUs?+

Yes. llama.cpp supports splitting model layers across multiple GPUs using the -ngl flag with layer assignments. This enables running larger models that do not fit in a single GPU's VRAM.

Is llama.cpp fast enough for production use?+

llama.cpp is suitable for production workloads with moderate concurrency. The server mode handles multiple concurrent requests. For high-throughput production serving, consider vLLM or TGI, which are optimized for batch processing and GPU utilization.

Referencias (3)

llama.cpp GitHub— llama.cpp is a C/C++ LLM inference engine
GGUF Specification— GGUF model format specification
llama.cpp Quantization Guide— Quantization methods and their quality tradeoffs

Relacionados en TokRepo

Local LLM tools llama.cpp guide Coding tools

🙏

Fuente y agradecimientos

Created by Georgi Gerganov. Licensed under MIT. ggml-org/llama.cpp — 100,000+ GitHub stars

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

代码Skills

Script Depot

Ollama — Run LLMs Locally

Run large language models locally on your machine. Supports Llama 3, Mistral, Gemma, Phi, and dozens more. One-command install, OpenAI-compatible API.

Skills

Script Depot

LocalAI — Run Any AI Model Locally, No GPU

LocalAI is an open-source AI engine running LLMs, vision, voice, and image models locally. 44.6K+ GitHub stars. OpenAI/Anthropic-compatible API, 35+ backends, MCP, agents. MIT licensed.

Skills

AI Open Source

Jan — Run AI Models Locally on Your Desktop

Open-source desktop app to run LLMs offline. Jan supports Llama, Mistral, and Gemma models with one-click download, OpenAI-compatible API, and full privacy.

Skills

Skill Factory