How do I install Shimmy — Python-Free Rust Inference Server for Local LLMs?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Shimmy — Python-Free Rust Inference Server for Local LLMs

Introduction

Shimmy eliminates the Python dependency chain from local LLM serving. It loads GGUF and SafeTensors model files directly, exposes an OpenAI-compatible HTTP API, and supports swapping models at runtime without restarting the server — all from a single Rust binary.

What Shimmy Does

Serves large language models locally via an OpenAI-compatible REST API
Loads GGUF and SafeTensors formats without requiring Python, PyTorch, or pip
Supports hot model swapping — load, unload, or switch models via API without downtime
Auto-discovers models in a configured directory and makes them available immediately
Ships as a single static binary with no external dependencies

Architecture Overview

Shimmy is written in Rust and uses the llama.cpp and candle libraries for inference. The HTTP server is built on Axum and exposes the standard /v1/chat/completions and /v1/completions endpoints. Model management runs in a dedicated thread that handles loading, unloading, and memory allocation. Quantized GGUF models run on CPU; GPU acceleration is available via CUDA and Metal backends.

Self-Hosting & Configuration

Download a prebuilt binary from GitHub releases — no build tools required
Place model files in a directory and point Shimmy at it with --model-dir
Configure listen address, port, and concurrency via command-line flags or environment variables
GPU acceleration enabled automatically when CUDA or Metal is detected
LoRA adapter loading supported for fine-tuned model variants

Key Features

Zero-dependency single binary — no Python, no pip, no conda
Hot model swap without server restart
OpenAI API-compatible endpoints for drop-in integration
Automatic model discovery from a watched directory
CPU and GPU inference with quantization support

Comparison with Similar Tools

Ollama — Go-based with its own model format; Shimmy uses standard GGUF/SafeTensors directly
llama.cpp server — C++ with manual setup; Shimmy wraps it in a polished Rust binary with hot swap
vLLM — Python-based, optimized for throughput; Shimmy targets simplicity and zero dependencies
LocalAI — Go-based with broad format support; Shimmy focuses on minimal footprint and fast startup

FAQ

Q: What hardware do I need? A: CPU inference works on any modern x86_64 or ARM machine. GPU acceleration requires CUDA or Apple Metal.

Q: Can I serve multiple models simultaneously? A: Yes. Shimmy can load multiple models and route requests based on the model name in the API call.

Q: Is the API fully OpenAI-compatible? A: It implements the chat completions and completions endpoints. Embeddings and other endpoints are planned.

Q: Does it support streaming responses? A: Yes. Server-sent events streaming is supported on the chat completions endpoint.

Sources

https://github.com/Michael-A-Kuykendall/shimmy

Shimmy — Python-Free Rust Inference Server for Local LLMs

Introduction

What Shimmy Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Worktrunk — Git Worktree Manager for Parallel AI Agents

microsandbox — Secure Local Sandboxes for AI Agents

Pentagi — Autonomous AI Agents for Penetration Testing