# WebLLM — High-Performance In-Browser LLM Inference > A JavaScript library that runs large language models directly in the browser using WebGPU, enabling private on-device AI without a server. ## Install Save as a script file and run: # WebLLM — High-Performance In-Browser LLM Inference ## Quick Use ```javascript import { CreateMLCEngine } from "@mlc-ai/web-llm"; const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f16_1-MLC"); const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello!" }] }); console.log(reply.choices[0].message.content); ``` ## Introduction WebLLM brings LLM inference to the browser using WebGPU acceleration. It compiles models via Apache TVM into a format that runs natively on the GPU through the browser's WebGPU API. No server, no data leaves the device. ## What WebLLM Does - Runs quantized LLMs (Llama, Mistral, Phi, Qwen, Gemma) in the browser at near-native speed - Uses WebGPU for GPU-accelerated inference without plugins or extensions - Provides an OpenAI-compatible chat completions API in JavaScript - Supports streaming responses, JSON mode, and function calling - Caches model weights in the browser for instant subsequent loads ## Architecture Overview WebLLM uses MLC-LLM's compilation pipeline to convert models into TVM-optimized WebGPU shaders. The runtime loads model weights into GPU memory via the WebGPU API and executes transformer layers as compute shader dispatches. A JavaScript wrapper exposes the OpenAI-compatible API. ## Self-Hosting & Configuration - Install via npm: @mlc-ai/web-llm - No server needed; the library runs entirely client-side - Pre-compiled model variants available for different quantization levels - Configure max generation length, temperature, and system prompts via the API - Works in Chrome, Edge, and other browsers with WebGPU support ## Key Features - Full privacy: all computation happens on the user's device - OpenAI-compatible API makes it a drop-in replacement for cloud calls - Supports 4-bit and 3-bit quantized models to fit in consumer GPU VRAM - Service Worker mode enables background LLM processing - Web Worker support keeps the main thread responsive during generation ## Comparison with Similar Tools - **Ollama** — desktop-native local LLM runner; WebLLM runs in the browser with no install - **llama.cpp (WASM)** — CPU-based WASM port; WebLLM uses WebGPU for GPU acceleration - **Transformers.js** — Hugging Face's browser ML; WebLLM focuses on LLM chat with better GPU utilization - **MLC-LLM** — the parent project covering native platforms; WebLLM targets the browser specifically - **llamafile** — single-file LLM executables; WebLLM requires no download beyond browser caching ## FAQ **Q: Which browsers support WebLLM?** A: Chrome 113+, Edge 113+, and other Chromium-based browsers with WebGPU enabled. Firefox support is experimental. **Q: What models can I run?** A: Pre-compiled models include Llama 3, Mistral, Phi-3, Qwen, Gemma, and others in various quantization levels. **Q: How much VRAM do I need?** A: A 4-bit quantized 8B model needs roughly 4-5 GB of GPU memory. Smaller 3B models run on integrated GPUs. **Q: Can I fine-tune models with WebLLM?** A: No. WebLLM is inference-only. Fine-tune models offline and compile them for WebLLM using the MLC-LLM toolchain. ## Sources - https://github.com/mlc-ai/web-llm - https://webllm.mlc.ai/ --- Source: https://tokrepo.com/en/workflows/993569ba-416b-11f1-9bc6-00163e2b0d79 Author: Script Depot