# Shimmy — Python-Free Rust Inference Server for Local LLMs > Shimmy is a single-binary Rust inference server that serves GGUF and SafeTensors models via an OpenAI-compatible API, with hot model swapping and auto-discovery. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # Shimmy — Python-Free Rust Inference Server for Local LLMs ## Quick Use ```bash # Download the single binary curl -fsSL https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy chmod +x shimmy # Start serving a model ./shimmy serve --model ./models/my-model.gguf # Query via OpenAI-compatible API curl http://localhost:8080/v1/chat/completions -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}' ``` ## Introduction Shimmy eliminates the Python dependency chain from local LLM serving. It loads GGUF and SafeTensors model files directly, exposes an OpenAI-compatible HTTP API, and supports swapping models at runtime without restarting the server — all from a single Rust binary. ## What Shimmy Does - Serves large language models locally via an OpenAI-compatible REST API - Loads GGUF and SafeTensors formats without requiring Python, PyTorch, or pip - Supports hot model swapping — load, unload, or switch models via API without downtime - Auto-discovers models in a configured directory and makes them available immediately - Ships as a single static binary with no external dependencies ## Architecture Overview Shimmy is written in Rust and uses the llama.cpp and candle libraries for inference. The HTTP server is built on Axum and exposes the standard `/v1/chat/completions` and `/v1/completions` endpoints. Model management runs in a dedicated thread that handles loading, unloading, and memory allocation. Quantized GGUF models run on CPU; GPU acceleration is available via CUDA and Metal backends. ## Self-Hosting & Configuration - Download a prebuilt binary from GitHub releases — no build tools required - Place model files in a directory and point Shimmy at it with `--model-dir` - Configure listen address, port, and concurrency via command-line flags or environment variables - GPU acceleration enabled automatically when CUDA or Metal is detected - LoRA adapter loading supported for fine-tuned model variants ## Key Features - Zero-dependency single binary — no Python, no pip, no conda - Hot model swap without server restart - OpenAI API-compatible endpoints for drop-in integration - Automatic model discovery from a watched directory - CPU and GPU inference with quantization support ## Comparison with Similar Tools - **Ollama** — Go-based with its own model format; Shimmy uses standard GGUF/SafeTensors directly - **llama.cpp server** — C++ with manual setup; Shimmy wraps it in a polished Rust binary with hot swap - **vLLM** — Python-based, optimized for throughput; Shimmy targets simplicity and zero dependencies - **LocalAI** — Go-based with broad format support; Shimmy focuses on minimal footprint and fast startup ## FAQ **Q: What hardware do I need?** A: CPU inference works on any modern x86_64 or ARM machine. GPU acceleration requires CUDA or Apple Metal. **Q: Can I serve multiple models simultaneously?** A: Yes. Shimmy can load multiple models and route requests based on the model name in the API call. **Q: Is the API fully OpenAI-compatible?** A: It implements the chat completions and completions endpoints. Embeddings and other endpoints are planned. **Q: Does it support streaming responses?** A: Yes. Server-sent events streaming is supported on the chat completions endpoint. ## Sources - https://github.com/Michael-A-Kuykendall/shimmy --- Source: https://tokrepo.com/en/workflows/shimmy-python-free-rust-inference-server-local-llms-269ce92b Author: Script Depot