# mistral.rs — Blazingly Fast LLM Inference in Rust

> mistral.rs is a cross-platform LLM inference engine written in Rust that supports 40+ model families including text, vision, and speech. It provides OpenAI-compatible APIs, quantization, PagedAttention, and both Rust and Python SDKs.

## Install

Save in your project root:

# mistral.rs — Blazingly Fast LLM Inference in Rust

## Quick Use
```bash
# Install from crates.io
cargo install mistralrs-server

# Or run with Docker (CUDA)
docker run -p 1234:1234 
  ghcr.io/ericlbuehler/mistral.rs:latest 
  --port 1234 --isq Q4K plain -m meta-llama/Meta-Llama-3-8B-Instruct

# Query the OpenAI-compatible API
curl http://localhost:1234/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello"}]}'
```

## Introduction
mistral.rs is a fast, flexible LLM inference engine created by Eric Buehler. Written in Rust for performance and memory safety, it supports a broad range of model architectures out of the box and provides an OpenAI-compatible HTTP server, making it a drop-in replacement for local inference in AI application stacks.

## What mistral.rs Does
- Serves LLM inference through an OpenAI-compatible Chat Completions and Responses API
- Supports text, vision, audio, speech, image generation, and embedding models
- Applies in-situ quantization (ISQ) to compress models on the fly without pre-quantization
- Implements PagedAttention and FlashAttention for efficient memory management
- Provides both a Rust crate and Python package for programmatic use

## Architecture Overview
mistral.rs is built in Rust using the Candle tensor library for GPU and CPU computation. The server handles concurrent requests with an async runtime, dispatching them to a model pipeline that manages tokenization, KV cache, and sampling. PagedAttention allows efficient memory sharing between sequences, while ISQ converts model weights to lower precision at load time. The modular architecture supports adding new model families through a trait-based plugin system.

## Self-Hosting & Configuration
- Start the server with `mistralrs-server` specifying model path, quantization, and port
- ISQ quantization types include Q4_0, Q4_K, Q8_0, and more, applied at model load time
- LoRA and QLoRA adapters can be loaded alongside the base model for fine-tuned behavior
- Environment variables control CUDA device selection, thread count, and logging verbosity
- Docker images are provided for CUDA, Metal, and CPU-only deployments

## Key Features
- In-situ quantization compresses models at load time without requiring pre-quantized checkpoints
- Supports 40+ model families: Llama, Mistral, Gemma, Phi, Qwen, and many more
- Tool calling and structured JSON output for agentic workflows
- MCP (Model Context Protocol) server built in for integration with AI assistants
- Speculative decoding for faster token generation with draft models

## Comparison with Similar Tools
- **llama.cpp** — C/C++ inference engine with GGUF format; broader hardware support but C-based API
- **vLLM** — Python-based high-throughput serving; better for large-scale deployments but heavier runtime
- **Ollama** — User-friendly local LLM runner; easier setup but less control over inference parameters
- **SGLang** — Python serving framework with RadixAttention; focused on structured generation
- **Candle** — The underlying Rust ML framework; mistral.rs builds on Candle and adds serving infrastructure

## FAQ
**Q: What is in-situ quantization (ISQ)?**
A: ISQ converts model weights to a lower-precision format (like Q4 or Q8) when the model is loaded, without needing a separate quantization step. This lets you run any Hugging Face model with reduced memory instantly.

**Q: Does mistral.rs support Apple Silicon?**
A: Yes. mistral.rs supports Metal acceleration on Apple Silicon Macs, as well as CUDA on NVIDIA GPUs and CPU-only mode.

**Q: Can I use mistral.rs as a drop-in replacement for OpenAI?**
A: Yes. The HTTP server exposes `/v1/chat/completions` and other endpoints compatible with the OpenAI API format, so existing client code works without changes.

**Q: How does performance compare to llama.cpp?**
A: Performance varies by model and hardware. mistral.rs offers competitive throughput with additional features like PagedAttention and built-in ISQ. For pure token-per-second speed, both engines are in the same tier.

## Sources
- https://github.com/EricLBuehler/mistral.rs
- https://docs.rs/mistralrs

---
Source: https://tokrepo.com/en/workflows/asset-20408a97
Author: AI Open Source