# KoboldCpp — Single-File Local LLM Inference Engine > KoboldCpp is a self-contained local LLM inference engine that runs GGUF models with GPU acceleration on consumer hardware, providing an OpenAI-compatible API and built-in web UI without requiring Python or complex setup. ## Install Save as a script file and run: # KoboldCpp — Single-File Local LLM Inference Engine ## Quick Use ```bash # Download the single executable for your platform from GitHub Releases # Windows: koboldcpp.exe, Linux: koboldcpp-linux-x64 ./koboldcpp --model ./models/llama-3-8b.Q4_K_M.gguf --port 5001 # Web UI at http://localhost:5001 # OpenAI-compatible API at http://localhost:5001/v1 ``` ## Introduction KoboldCpp is a local LLM inference engine that runs GGUF-format models with a single executable file. It supports CPU, CUDA, Vulkan, and Metal acceleration, provides both a built-in chat UI and an OpenAI-compatible API, and requires no Python environment or package management—download, point at a model, and run. ## What KoboldCpp Does - Runs any GGUF-format language model locally with CPU or GPU acceleration - Provides an OpenAI-compatible API endpoint for integration with other tools - Includes a built-in web UI for chat, story writing, and instruct-mode interactions - Supports context sizes up to 128K tokens with flash attention and quantized KV cache - Offers model layer splitting across CPU and GPU for partial offloading ## Architecture Overview KoboldCpp is a C/C++ application built on top of llama.cpp's inference backend, extended with a ConcurrentLib wrapper for multi-request handling. It compiles to a single binary embedding an HTTP server (based on CivetWeb), the llama.cpp GGML runtime, and a bundled web UI. GPU backends are selected at compile time or via runtime flags. ## Self-Hosting & Configuration - Download a pre-built binary—no installation or dependencies required - Launch with --model pointing to any GGUF file and optional --gpulayers for GPU offload - Configure context size (--contextsize), batch size, and threading via CLI flags - Use --launch to auto-open the web UI in your default browser - Expose as an API server behind a reverse proxy for multi-user access ## Key Features - True single-file deployment: one binary, no Python, no pip, no Docker required - Multi-backend GPU support: CUDA, Vulkan, CLBlast, Metal, and CPU fallback - OpenAI-compatible API for drop-in use with existing tools and libraries - Streaming text generation with configurable samplers (temperature, top-k, top-p, mirostat) - Smart context management with automatic prompt caching for faster re-generation ## Comparison with Similar Tools - **llama.cpp server** — minimal API; KoboldCpp adds a full web UI and advanced sampling options - **Ollama** — easier model management; KoboldCpp offers finer control over inference parameters - **LM Studio** — proprietary GUI; KoboldCpp is fully open-source and scriptable - **vLLM** — production multi-GPU serving; KoboldCpp targets single-machine consumer hardware - **llamafile** — similar single-file concept; KoboldCpp has a richer UI and more sampler options ## FAQ **Q: What model formats does KoboldCpp support?** A: GGUF format exclusively. Convert other formats using llama.cpp's conversion tools. **Q: Can I run it without a GPU?** A: Yes. CPU-only mode works but is slower. Partial GPU offload (--gpulayers) improves speed. **Q: Is the API compatible with OpenAI client libraries?** A: Yes, the /v1/chat/completions and /v1/completions endpoints follow the OpenAI spec. **Q: How much RAM do I need?** A: Depends on model size and quantization. A 7B Q4 model needs about 4-6 GB; a 70B Q4 needs 35-40 GB. ## Sources - https://github.com/LostRuins/koboldcpp - https://koboldai.org/cpp --- Source: https://tokrepo.com/en/workflows/asset-f0ec1009 Author: Script Depot