Skills2026年5月1日·1 分钟阅读

Shimmy — Python-Free Rust Inference Server for Local LLMs

Shimmy is a single-binary Rust inference server that serves GGUF and SafeTensors models via an OpenAI-compatible API, with hot model swapping and auto-discovery.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Needs Confirmation · 64/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Shimmy Overview
通用 CLI 安装命令
npx tokrepo install 269ce92b-4558-11f1-9bc6-00163e2b0d79

Introduction

Shimmy eliminates the Python dependency chain from local LLM serving. It loads GGUF and SafeTensors model files directly, exposes an OpenAI-compatible HTTP API, and supports swapping models at runtime without restarting the server — all from a single Rust binary.

What Shimmy Does

  • Serves large language models locally via an OpenAI-compatible REST API
  • Loads GGUF and SafeTensors formats without requiring Python, PyTorch, or pip
  • Supports hot model swapping — load, unload, or switch models via API without downtime
  • Auto-discovers models in a configured directory and makes them available immediately
  • Ships as a single static binary with no external dependencies

Architecture Overview

Shimmy is written in Rust and uses the llama.cpp and candle libraries for inference. The HTTP server is built on Axum and exposes the standard /v1/chat/completions and /v1/completions endpoints. Model management runs in a dedicated thread that handles loading, unloading, and memory allocation. Quantized GGUF models run on CPU; GPU acceleration is available via CUDA and Metal backends.

Self-Hosting & Configuration

  • Download a prebuilt binary from GitHub releases — no build tools required
  • Place model files in a directory and point Shimmy at it with --model-dir
  • Configure listen address, port, and concurrency via command-line flags or environment variables
  • GPU acceleration enabled automatically when CUDA or Metal is detected
  • LoRA adapter loading supported for fine-tuned model variants

Key Features

  • Zero-dependency single binary — no Python, no pip, no conda
  • Hot model swap without server restart
  • OpenAI API-compatible endpoints for drop-in integration
  • Automatic model discovery from a watched directory
  • CPU and GPU inference with quantization support

Comparison with Similar Tools

  • Ollama — Go-based with its own model format; Shimmy uses standard GGUF/SafeTensors directly
  • llama.cpp server — C++ with manual setup; Shimmy wraps it in a polished Rust binary with hot swap
  • vLLM — Python-based, optimized for throughput; Shimmy targets simplicity and zero dependencies
  • LocalAI — Go-based with broad format support; Shimmy focuses on minimal footprint and fast startup

FAQ

Q: What hardware do I need? A: CPU inference works on any modern x86_64 or ARM machine. GPU acceleration requires CUDA or Apple Metal.

Q: Can I serve multiple models simultaneously? A: Yes. Shimmy can load multiple models and route requests based on the model name in the API call.

Q: Is the API fully OpenAI-compatible? A: It implements the chat completions and completions endpoints. Embeddings and other endpoints are planned.

Q: Does it support streaming responses? A: Yes. Server-sent events streaming is supported on the chat completions endpoint.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产