Key Features
Universal Hardware Support
| Platform | Backend |
|---|---|
| NVIDIA GPU | CUDA |
| Apple Silicon | Metal |
| AMD GPU | Vulkan/ROCm |
| Browsers | WebGPU |
| iOS | Metal + Core ML |
| Android | Vulkan + OpenCL |
| CPU | x86 / ARM |
WebLLM (Browser)
Run LLMs entirely in the browser with WebGPU — no server needed:
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3-8B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello!" }],
});Quantization
4-bit and 3-bit quantization for running large models on consumer hardware.
OpenAI-Compatible API
REST server with OpenAI-compatible endpoints — drop-in replacement.
Native Performance
ML compilation optimizes models for each specific hardware target, achieving near-optimal throughput.
FAQ
Q: What is MLC-LLM? A: A universal LLM deployment engine that compiles models for native performance on any hardware — phones, browsers, GPUs, and CPUs. 22K+ stars, Apache 2.0.
Q: Can I run Llama 3 on my iPhone? A: Yes, MLC-LLM compiles Llama 3 (quantized) for iOS with Metal acceleration. There's an iOS app available.