# WebLLM — In-Browser LLM Inference Engine > Run large language models entirely in the browser with WebGPU acceleration and no server required. ## Install Save as a script file and run: # WebLLM — In-Browser LLM Inference Engine ## Quick Use ```bash npm install @mlc-ai/web-llm ``` ```javascript import { CreateMLCEngine } from "@mlc-ai/web-llm"; const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC"); const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello!" }] }); console.log(reply.choices[0].message.content); ``` ## Introduction WebLLM brings large language model inference directly into the browser using WebGPU for hardware acceleration. It eliminates the need for server-side computation, enabling fully private, offline-capable AI chat and text generation on any modern browser that supports WebGPU. ## What WebLLM Does - Runs LLMs (Llama, Mistral, Phi, Gemma, Qwen) entirely client-side in the browser - Leverages WebGPU for near-native GPU performance without plugins or extensions - Provides an OpenAI-compatible chat completions API for drop-in integration - Supports streaming responses, JSON mode, and function calling - Caches model weights in browser storage for instant subsequent loads ## Architecture Overview WebLLM compiles models through Apache TVM's machine learning compiler stack into WebGPU-optimized shaders. At runtime, a lightweight JavaScript engine loads quantized model weights into GPU memory via the WebGPU API, executes transformer attention and feed-forward layers as compute shaders, and exposes an OpenAI-compatible interface. A service worker mode allows background inference without blocking the main UI thread. ## Self-Hosting & Configuration - Install via npm or load from a CDN script tag in any web project - Choose from dozens of pre-quantized models hosted on Hugging Face - Configure quantization level (q4f16, q4f32) to balance quality vs. VRAM usage - Set context window size and generation parameters (temperature, top-p) per request - Use the service worker engine variant for multi-tab or PWA deployments ## Key Features - Zero server dependency means complete data privacy for end users - OpenAI-compatible API makes migration from cloud LLMs straightforward - Supports structured output via JSON mode and grammar-guided decoding - Pre-built model library covers instruction-tuned and code-generation models - Works on Chrome, Edge, and any browser with WebGPU support ## Comparison with Similar Tools - **Ollama** — runs models locally on desktop but requires a native binary; WebLLM runs purely in-browser - **llama.cpp (WASM)** — compiles to WebAssembly with CPU-only execution; WebLLM uses WebGPU for GPU acceleration - **Transformers.js** — targets smaller encoder models via ONNX Runtime; WebLLM handles full-size decoder LLMs - **LM Studio** — desktop GUI application requiring installation; WebLLM needs only a web page - **PrivateGPT** — server-side Python stack; WebLLM is a client-side JavaScript library ## FAQ **Q: What browsers support WebLLM?** A: Any browser with WebGPU enabled, including Chrome 113+, Edge 113+, and recent builds of Firefox and Safari. **Q: How much VRAM do I need?** A: A 4-bit quantized 8B model requires roughly 4-5 GB of GPU memory. Smaller 1-3B models work with 2 GB. **Q: Can I fine-tune or add my own models?** A: WebLLM consumes pre-compiled model libraries. You can compile custom models using the MLC-LLM toolchain and host them for WebLLM to load. **Q: Does it work on mobile devices?** A: WebGPU support on mobile is still emerging. Android Chrome has partial support; iOS Safari support is in development. ## Sources - https://github.com/mlc-ai/web-llm - https://webllm.mlc.ai/ --- Source: https://tokrepo.com/en/workflows/asset-4a347063 Author: Script Depot