# Shimmy — Python-Free Rust Inference Server for Local LLMs

> Shimmy is a single-binary Rust inference server that serves GGUF and SafeTensors models via an OpenAI-compatible API, with hot model swapping and auto-discovery.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# Shimmy — Python-Free Rust Inference Server for Local LLMs

## Quick Use
```bash
# Download the single binary
curl -fsSL https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy
chmod +x shimmy
# Start serving a model
./shimmy serve --model ./models/my-model.gguf
# Query via OpenAI-compatible API
curl http://localhost:8080/v1/chat/completions -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'
```

## Introduction
Shimmy eliminates the Python dependency chain from local LLM serving. It loads GGUF and SafeTensors model files directly, exposes an OpenAI-compatible HTTP API, and supports swapping models at runtime without restarting the server — all from a single Rust binary.

## What Shimmy Does
- Serves large language models locally via an OpenAI-compatible REST API
- Loads GGUF and SafeTensors formats without requiring Python, PyTorch, or pip
- Supports hot model swapping — load, unload, or switch models via API without downtime
- Auto-discovers models in a configured directory and makes them available immediately
- Ships as a single static binary with no external dependencies

## Architecture Overview
Shimmy is written in Rust and uses the llama.cpp and candle libraries for inference. The HTTP server is built on Axum and exposes the standard `/v1/chat/completions` and `/v1/completions` endpoints. Model management runs in a dedicated thread that handles loading, unloading, and memory allocation. Quantized GGUF models run on CPU; GPU acceleration is available via CUDA and Metal backends.

## Self-Hosting & Configuration
- Download a prebuilt binary from GitHub releases — no build tools required
- Place model files in a directory and point Shimmy at it with `--model-dir`
- Configure listen address, port, and concurrency via command-line flags or environment variables
- GPU acceleration enabled automatically when CUDA or Metal is detected
- LoRA adapter loading supported for fine-tuned model variants

## Key Features
- Zero-dependency single binary — no Python, no pip, no conda
- Hot model swap without server restart
- OpenAI API-compatible endpoints for drop-in integration
- Automatic model discovery from a watched directory
- CPU and GPU inference with quantization support

## Comparison with Similar Tools
- **Ollama** — Go-based with its own model format; Shimmy uses standard GGUF/SafeTensors directly
- **llama.cpp server** — C++ with manual setup; Shimmy wraps it in a polished Rust binary with hot swap
- **vLLM** — Python-based, optimized for throughput; Shimmy targets simplicity and zero dependencies
- **LocalAI** — Go-based with broad format support; Shimmy focuses on minimal footprint and fast startup

## FAQ
**Q: What hardware do I need?**
A: CPU inference works on any modern x86_64 or ARM machine. GPU acceleration requires CUDA or Apple Metal.

**Q: Can I serve multiple models simultaneously?**
A: Yes. Shimmy can load multiple models and route requests based on the model name in the API call.

**Q: Is the API fully OpenAI-compatible?**
A: It implements the chat completions and completions endpoints. Embeddings and other endpoints are planned.

**Q: Does it support streaming responses?**
A: Yes. Server-sent events streaming is supported on the chat completions endpoint.

## Sources
- https://github.com/Michael-A-Kuykendall/shimmy

---
Source: https://tokrepo.com/en/workflows/shimmy-python-free-rust-inference-server-local-llms-269ce92b
Author: Script Depot