Introduction
WebLLM brings LLM inference to the browser using WebGPU acceleration. It compiles models via Apache TVM into a format that runs natively on the GPU through the browser's WebGPU API. No server, no data leaves the device.
What WebLLM Does
- Runs quantized LLMs (Llama, Mistral, Phi, Qwen, Gemma) in the browser at near-native speed
- Uses WebGPU for GPU-accelerated inference without plugins or extensions
- Provides an OpenAI-compatible chat completions API in JavaScript
- Supports streaming responses, JSON mode, and function calling
- Caches model weights in the browser for instant subsequent loads
Architecture Overview
WebLLM uses MLC-LLM's compilation pipeline to convert models into TVM-optimized WebGPU shaders. The runtime loads model weights into GPU memory via the WebGPU API and executes transformer layers as compute shader dispatches. A JavaScript wrapper exposes the OpenAI-compatible API.
Self-Hosting & Configuration
- Install via npm: @mlc-ai/web-llm
- No server needed; the library runs entirely client-side
- Pre-compiled model variants available for different quantization levels
- Configure max generation length, temperature, and system prompts via the API
- Works in Chrome, Edge, and other browsers with WebGPU support
Key Features
- Full privacy: all computation happens on the user's device
- OpenAI-compatible API makes it a drop-in replacement for cloud calls
- Supports 4-bit and 3-bit quantized models to fit in consumer GPU VRAM
- Service Worker mode enables background LLM processing
- Web Worker support keeps the main thread responsive during generation
Comparison with Similar Tools
- Ollama — desktop-native local LLM runner; WebLLM runs in the browser with no install
- llama.cpp (WASM) — CPU-based WASM port; WebLLM uses WebGPU for GPU acceleration
- Transformers.js — Hugging Face's browser ML; WebLLM focuses on LLM chat with better GPU utilization
- MLC-LLM — the parent project covering native platforms; WebLLM targets the browser specifically
- llamafile — single-file LLM executables; WebLLM requires no download beyond browser caching
FAQ
Q: Which browsers support WebLLM? A: Chrome 113+, Edge 113+, and other Chromium-based browsers with WebGPU enabled. Firefox support is experimental.
Q: What models can I run? A: Pre-compiled models include Llama 3, Mistral, Phi-3, Qwen, Gemma, and others in various quantization levels.
Q: How much VRAM do I need? A: A 4-bit quantized 8B model needs roughly 4-5 GB of GPU memory. Smaller 3B models run on integrated GPUs.
Q: Can I fine-tune models with WebLLM? A: No. WebLLM is inference-only. Fine-tune models offline and compile them for WebLLM using the MLC-LLM toolchain.