# WebLLM — High-Performance In-Browser LLM Inference Engine > Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM. ## Install Save in your project root: # WebLLM — High-Performance In-Browser LLM Inference Engine ## Quick Use ```bash npm install @mlc-ai/web-llm # In your JavaScript/TypeScript code: # import { CreateMLCEngine } from "@mlc-ai/web-llm"; # const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC"); # const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello" }] }); ``` ## Introduction WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack. ## What WebLLM Does - Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU - Provides an OpenAI-compatible chat completions API for drop-in usage - Supports structured JSON output and function calling - Handles model caching in the browser for faster subsequent loads - Enables streaming token generation with real-time UI updates ## Architecture Overview WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes. ## Self-Hosting & Configuration - Install via npm or use the CDN bundle for quick prototyping - Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly) - Models are downloaded once and cached in IndexedDB for offline reuse - Configure model choice, temperature, and max tokens through the engine options - No backend server, API keys, or cloud dependencies needed ## Key Features - Full privacy: all inference happens on-device with zero data leaving the browser - OpenAI-compatible API makes migration from cloud to local seamless - Streaming support for responsive chat interfaces - Pre-built model library covering popular open-weight LLMs - Works on any platform with WebGPU support including laptops and tablets ## Comparison with Similar Tools - **Ollama** — native binary for local inference vs. browser-only with no install - **llama.cpp (WASM)** — CPU-bound WASM vs. GPU-accelerated WebGPU - **Transformers.js** — ONNX-based browser inference vs. TVM-compiled GPU kernels - **LM Studio** — desktop app with UI vs. embeddable library for web developers ## FAQ **Q: Which browsers support WebLLM?** A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging. **Q: How large are the model downloads?** A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download. **Q: Can I use my own fine-tuned model?** A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM. **Q: Is it fast enough for real-time chat?** A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat. ## Sources - https://github.com/mlc-ai/web-llm - https://webllm.mlc.ai/ --- Source: https://tokrepo.com/en/workflows/asset-9a2c4722 Author: AI Open Source