Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsApr 26, 2026·3 min de lecture

WebLLM — High-Performance In-Browser LLM Inference

A JavaScript library that runs large language models directly in the browser using WebGPU, enabling private on-device AI without a server.

Introduction

WebLLM brings LLM inference to the browser using WebGPU acceleration. It compiles models via Apache TVM into a format that runs natively on the GPU through the browser's WebGPU API. No server, no data leaves the device.

What WebLLM Does

  • Runs quantized LLMs (Llama, Mistral, Phi, Qwen, Gemma) in the browser at near-native speed
  • Uses WebGPU for GPU-accelerated inference without plugins or extensions
  • Provides an OpenAI-compatible chat completions API in JavaScript
  • Supports streaming responses, JSON mode, and function calling
  • Caches model weights in the browser for instant subsequent loads

Architecture Overview

WebLLM uses MLC-LLM's compilation pipeline to convert models into TVM-optimized WebGPU shaders. The runtime loads model weights into GPU memory via the WebGPU API and executes transformer layers as compute shader dispatches. A JavaScript wrapper exposes the OpenAI-compatible API.

Self-Hosting & Configuration

  • Install via npm: @mlc-ai/web-llm
  • No server needed; the library runs entirely client-side
  • Pre-compiled model variants available for different quantization levels
  • Configure max generation length, temperature, and system prompts via the API
  • Works in Chrome, Edge, and other browsers with WebGPU support

Key Features

  • Full privacy: all computation happens on the user's device
  • OpenAI-compatible API makes it a drop-in replacement for cloud calls
  • Supports 4-bit and 3-bit quantized models to fit in consumer GPU VRAM
  • Service Worker mode enables background LLM processing
  • Web Worker support keeps the main thread responsive during generation

Comparison with Similar Tools

  • Ollama — desktop-native local LLM runner; WebLLM runs in the browser with no install
  • llama.cpp (WASM) — CPU-based WASM port; WebLLM uses WebGPU for GPU acceleration
  • Transformers.js — Hugging Face's browser ML; WebLLM focuses on LLM chat with better GPU utilization
  • MLC-LLM — the parent project covering native platforms; WebLLM targets the browser specifically
  • llamafile — single-file LLM executables; WebLLM requires no download beyond browser caching

FAQ

Q: Which browsers support WebLLM? A: Chrome 113+, Edge 113+, and other Chromium-based browsers with WebGPU enabled. Firefox support is experimental.

Q: What models can I run? A: Pre-compiled models include Llama 3, Mistral, Phi-3, Qwen, Gemma, and others in various quantization levels.

Q: How much VRAM do I need? A: A 4-bit quantized 8B model needs roughly 4-5 GB of GPU memory. Smaller 3B models run on integrated GPUs.

Q: Can I fine-tune models with WebLLM? A: No. WebLLM is inference-only. Fine-tune models offline and compile them for WebLLM using the MLC-LLM toolchain.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires