ConfigsApr 14, 2026·3 min read

Candle — Minimalist Machine Learning Framework for Rust

Candle is a Rust-native ML framework focused on inference performance, small binaries, and serverless deployment. It runs Llama, Whisper, Stable Diffusion, and other PyTorch models in pure Rust — no Python required.

AI Open Source
AI Open Source · Community

Introduction

Candle is Hugging Face's answer to "what if PyTorch was Rust-native?" It's a minimalist ML framework written entirely in Rust, designed for production inference: small binaries, low memory, easy WASM/serverless deployment, and fast startup — all advantages over Python-based stacks.

With over 20,000 GitHub stars, Candle ships reference implementations of Llama, Mistral, Qwen, Whisper, Stable Diffusion, BERT, and dozens of other models. It supports CUDA, Metal, MKL, and CPU backends.

What Candle Does

Candle provides PyTorch-like tensors and nn modules in Rust. The candle-transformers crate has reference implementations of popular architectures. Models load from safetensors files (the same format Hugging Face uses), so you can take any HF model checkpoint and run it from a Rust binary with no Python dependencies.

Architecture Overview

Rust app
      |
candle-core         (Tensor, autograd, devices)
candle-nn           (Linear, LayerNorm, Embedding, ...)
candle-transformers (model architectures: Llama, Qwen, Whisper, ...)
      |
Backend choice:
  CPU (MKL / accelerate / pure Rust)
  CUDA (NVIDIA)
  Metal (Apple Silicon)
  WebGPU (browser)
      |
Model weights: safetensors / GGUF
      |
Deployment:
  Standalone binary (small)
  WASM module (browser)
  Serverless (Lambda, Cloudflare Workers)

Self-Hosting & Configuration

// Run Llama-style inference from Rust
use candle_core::{Device, Tensor};
use candle_transformers::models::llama::{Llama, Config, Cache};
use hf_hub::api::sync::Api;

fn main() -> anyhow::Result<()> {
    let device = Device::cuda_if_available(0)?;
    let api = Api::new()?;
    let repo = api.model("meta-llama/Llama-3.2-1B".into());
    let weights = repo.get("model.safetensors")?;
    // ... load Config + tokenizer, build Llama, generate tokens
    Ok(())
}
# Cargo.toml — choose a backend feature
[dependencies]
candle-core = { version = "0.6", features = ["cuda"] }
candle-nn = { version = "0.6", features = ["cuda"] }
candle-transformers = "0.6"
# Or features = ["metal"] for Apple Silicon
# Or features = ["accelerate"] for macOS Accelerate framework

Key Features

  • Pure Rust — no Python, no PyTorch — easy to embed in any Rust binary
  • Multi-backend — CPU, CUDA, Metal, MKL, accelerate
  • PyTorch-like API — Tensor, nn modules, autograd familiar to PyTorch users
  • Reference models — Llama, Mistral, Qwen, Whisper, SD, BERT, ViT
  • safetensors / GGUF support — load existing HF weights or quantized models
  • WASM / WebGPU — run models in browsers
  • Serverless friendly — small binaries, fast cold start
  • First-party HF integration — pull models via hf-hub crate

Comparison with Similar Tools

Feature Candle tch-rs Burn ONNX Runtime llama.cpp
Language Rust (native) Rust (libtorch FFI) Rust (native) C++ + bindings C/C++
Python required No No No No No
Backend CPU/CUDA/Metal CUDA/Metal via libtorch CPU/CUDA/Metal/WebGPU Many CPU/CUDA/Metal
Training Yes Yes Yes No No
Model breadth Many (HF) Any PyTorch model Growing ONNX zoo Llama-family
Best For Rust-native AI inference PyTorch from Rust Pure Rust deep learning Cross-platform inference Local LLMs

FAQ

Q: Candle vs llama.cpp? A: llama.cpp is a focused C++ implementation of Llama-family models. Candle is a general Rust ML framework — broader model support, training capability, and Rust ecosystem integration. llama.cpp wins for pure CPU inference of supported models.

Q: Why pick Candle over PyTorch? A: Smaller deployment footprint, Rust-native (no Python runtime), ideal for Lambda/Workers/embedded. PyTorch wins for training and research; Candle wins for production inference in Rust ecosystems.

Q: Does it support training? A: Yes, basic training (autograd, optimizers, common modules). For large-scale distributed training, PyTorch + DeepSpeed is still more mature.

Q: What about model conversion? A: Candle reads safetensors directly. For PyTorch checkpoints, convert with safetensors's Python tool. GGUF (quantized) is supported for Llama-family models.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets