Introduction
WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.
What WebLLM Does
- Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
- Provides an OpenAI-compatible chat completions API for drop-in usage
- Supports structured JSON output and function calling
- Handles model caching in the browser for faster subsequent loads
- Enables streaming token generation with real-time UI updates
Architecture Overview
WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.
Self-Hosting & Configuration
- Install via npm or use the CDN bundle for quick prototyping
- Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
- Models are downloaded once and cached in IndexedDB for offline reuse
- Configure model choice, temperature, and max tokens through the engine options
- No backend server, API keys, or cloud dependencies needed
Key Features
- Full privacy: all inference happens on-device with zero data leaving the browser
- OpenAI-compatible API makes migration from cloud to local seamless
- Streaming support for responsive chat interfaces
- Pre-built model library covering popular open-weight LLMs
- Works on any platform with WebGPU support including laptops and tablets
Comparison with Similar Tools
- Ollama — native binary for local inference vs. browser-only with no install
- llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
- Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
- LM Studio — desktop app with UI vs. embeddable library for web developers
FAQ
Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.
Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.
Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.
Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.