ConfigsMay 15, 2026·3 min read

mistral.rs — Blazingly Fast LLM Inference in Rust

mistral.rs is a cross-platform LLM inference engine written in Rust that supports 40+ model families including text, vision, and speech. It provides OpenAI-compatible APIs, quantization, PagedAttention, and both Rust and Python SDKs.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Script
Install
Stage only
Trust
Trust: Established
Entrypoint
mistral.rs
Universal CLI install command
npx tokrepo install 20408a97-5017-11f1-9bc6-00163e2b0d79

Introduction

mistral.rs is a fast, flexible LLM inference engine created by Eric Buehler. Written in Rust for performance and memory safety, it supports a broad range of model architectures out of the box and provides an OpenAI-compatible HTTP server, making it a drop-in replacement for local inference in AI application stacks.

What mistral.rs Does

  • Serves LLM inference through an OpenAI-compatible Chat Completions and Responses API
  • Supports text, vision, audio, speech, image generation, and embedding models
  • Applies in-situ quantization (ISQ) to compress models on the fly without pre-quantization
  • Implements PagedAttention and FlashAttention for efficient memory management
  • Provides both a Rust crate and Python package for programmatic use

Architecture Overview

mistral.rs is built in Rust using the Candle tensor library for GPU and CPU computation. The server handles concurrent requests with an async runtime, dispatching them to a model pipeline that manages tokenization, KV cache, and sampling. PagedAttention allows efficient memory sharing between sequences, while ISQ converts model weights to lower precision at load time. The modular architecture supports adding new model families through a trait-based plugin system.

Self-Hosting & Configuration

  • Start the server with mistralrs-server specifying model path, quantization, and port
  • ISQ quantization types include Q4_0, Q4_K, Q8_0, and more, applied at model load time
  • LoRA and QLoRA adapters can be loaded alongside the base model for fine-tuned behavior
  • Environment variables control CUDA device selection, thread count, and logging verbosity
  • Docker images are provided for CUDA, Metal, and CPU-only deployments

Key Features

  • In-situ quantization compresses models at load time without requiring pre-quantized checkpoints
  • Supports 40+ model families: Llama, Mistral, Gemma, Phi, Qwen, and many more
  • Tool calling and structured JSON output for agentic workflows
  • MCP (Model Context Protocol) server built in for integration with AI assistants
  • Speculative decoding for faster token generation with draft models

Comparison with Similar Tools

  • llama.cpp — C/C++ inference engine with GGUF format; broader hardware support but C-based API
  • vLLM — Python-based high-throughput serving; better for large-scale deployments but heavier runtime
  • Ollama — User-friendly local LLM runner; easier setup but less control over inference parameters
  • SGLang — Python serving framework with RadixAttention; focused on structured generation
  • Candle — The underlying Rust ML framework; mistral.rs builds on Candle and adds serving infrastructure

FAQ

Q: What is in-situ quantization (ISQ)? A: ISQ converts model weights to a lower-precision format (like Q4 or Q8) when the model is loaded, without needing a separate quantization step. This lets you run any Hugging Face model with reduced memory instantly.

Q: Does mistral.rs support Apple Silicon? A: Yes. mistral.rs supports Metal acceleration on Apple Silicon Macs, as well as CUDA on NVIDIA GPUs and CPU-only mode.

Q: Can I use mistral.rs as a drop-in replacement for OpenAI? A: Yes. The HTTP server exposes /v1/chat/completions and other endpoints compatible with the OpenAI API format, so existing client code works without changes.

Q: How does performance compare to llama.cpp? A: Performance varies by model and hardware. mistral.rs offers competitive throughput with additional features like PagedAttention and built-in ISQ. For pure token-per-second speed, both engines are in the same tier.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets