Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 4, 2026·3 min de lecture

KoboldCpp — Single-File Local LLM Inference Engine

KoboldCpp is a self-contained local LLM inference engine that runs GGUF models with GPU acceleration on consumer hardware, providing an OpenAI-compatible API and built-in web UI without requiring Python or complex setup.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
KoboldCpp LLM Engine
Commande CLI universelle
npx tokrepo install f0ec1009-4771-11f1-9bc6-00163e2b0d79

Introduction

KoboldCpp is a local LLM inference engine that runs GGUF-format models with a single executable file. It supports CPU, CUDA, Vulkan, and Metal acceleration, provides both a built-in chat UI and an OpenAI-compatible API, and requires no Python environment or package management—download, point at a model, and run.

What KoboldCpp Does

  • Runs any GGUF-format language model locally with CPU or GPU acceleration
  • Provides an OpenAI-compatible API endpoint for integration with other tools
  • Includes a built-in web UI for chat, story writing, and instruct-mode interactions
  • Supports context sizes up to 128K tokens with flash attention and quantized KV cache
  • Offers model layer splitting across CPU and GPU for partial offloading

Architecture Overview

KoboldCpp is a C/C++ application built on top of llama.cpp's inference backend, extended with a ConcurrentLib wrapper for multi-request handling. It compiles to a single binary embedding an HTTP server (based on CivetWeb), the llama.cpp GGML runtime, and a bundled web UI. GPU backends are selected at compile time or via runtime flags.

Self-Hosting & Configuration

  • Download a pre-built binary—no installation or dependencies required
  • Launch with --model pointing to any GGUF file and optional --gpulayers for GPU offload
  • Configure context size (--contextsize), batch size, and threading via CLI flags
  • Use --launch to auto-open the web UI in your default browser
  • Expose as an API server behind a reverse proxy for multi-user access

Key Features

  • True single-file deployment: one binary, no Python, no pip, no Docker required
  • Multi-backend GPU support: CUDA, Vulkan, CLBlast, Metal, and CPU fallback
  • OpenAI-compatible API for drop-in use with existing tools and libraries
  • Streaming text generation with configurable samplers (temperature, top-k, top-p, mirostat)
  • Smart context management with automatic prompt caching for faster re-generation

Comparison with Similar Tools

  • llama.cpp server — minimal API; KoboldCpp adds a full web UI and advanced sampling options
  • Ollama — easier model management; KoboldCpp offers finer control over inference parameters
  • LM Studio — proprietary GUI; KoboldCpp is fully open-source and scriptable
  • vLLM — production multi-GPU serving; KoboldCpp targets single-machine consumer hardware
  • llamafile — similar single-file concept; KoboldCpp has a richer UI and more sampler options

FAQ

Q: What model formats does KoboldCpp support? A: GGUF format exclusively. Convert other formats using llama.cpp's conversion tools.

Q: Can I run it without a GPU? A: Yes. CPU-only mode works but is slower. Partial GPU offload (--gpulayers) improves speed.

Q: Is the API compatible with OpenAI client libraries? A: Yes, the /v1/chat/completions and /v1/completions endpoints follow the OpenAI spec.

Q: How much RAM do I need? A: Depends on model size and quantization. A 7B Q4 model needs about 4-6 GB; a 70B Q4 needs 35-40 GB.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires