Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 31, 2026·3 min de lecture

WebLLM — In-Browser LLM Inference Engine

Run large language models entirely in the browser with WebGPU acceleration and no server required.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
WebLLM
Commande d'installation directe
npx -y tokrepo@latest install 4a347063-5cea-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

WebLLM brings large language model inference directly into the browser using WebGPU for hardware acceleration. It eliminates the need for server-side computation, enabling fully private, offline-capable AI chat and text generation on any modern browser that supports WebGPU.

What WebLLM Does

  • Runs LLMs (Llama, Mistral, Phi, Gemma, Qwen) entirely client-side in the browser
  • Leverages WebGPU for near-native GPU performance without plugins or extensions
  • Provides an OpenAI-compatible chat completions API for drop-in integration
  • Supports streaming responses, JSON mode, and function calling
  • Caches model weights in browser storage for instant subsequent loads

Architecture Overview

WebLLM compiles models through Apache TVM's machine learning compiler stack into WebGPU-optimized shaders. At runtime, a lightweight JavaScript engine loads quantized model weights into GPU memory via the WebGPU API, executes transformer attention and feed-forward layers as compute shaders, and exposes an OpenAI-compatible interface. A service worker mode allows background inference without blocking the main UI thread.

Self-Hosting & Configuration

  • Install via npm or load from a CDN script tag in any web project
  • Choose from dozens of pre-quantized models hosted on Hugging Face
  • Configure quantization level (q4f16, q4f32) to balance quality vs. VRAM usage
  • Set context window size and generation parameters (temperature, top-p) per request
  • Use the service worker engine variant for multi-tab or PWA deployments

Key Features

  • Zero server dependency means complete data privacy for end users
  • OpenAI-compatible API makes migration from cloud LLMs straightforward
  • Supports structured output via JSON mode and grammar-guided decoding
  • Pre-built model library covers instruction-tuned and code-generation models
  • Works on Chrome, Edge, and any browser with WebGPU support

Comparison with Similar Tools

  • Ollama — runs models locally on desktop but requires a native binary; WebLLM runs purely in-browser
  • llama.cpp (WASM) — compiles to WebAssembly with CPU-only execution; WebLLM uses WebGPU for GPU acceleration
  • Transformers.js — targets smaller encoder models via ONNX Runtime; WebLLM handles full-size decoder LLMs
  • LM Studio — desktop GUI application requiring installation; WebLLM needs only a web page
  • PrivateGPT — server-side Python stack; WebLLM is a client-side JavaScript library

FAQ

Q: What browsers support WebLLM? A: Any browser with WebGPU enabled, including Chrome 113+, Edge 113+, and recent builds of Firefox and Safari.

Q: How much VRAM do I need? A: A 4-bit quantized 8B model requires roughly 4-5 GB of GPU memory. Smaller 1-3B models work with 2 GB.

Q: Can I fine-tune or add my own models? A: WebLLM consumes pre-compiled model libraries. You can compile custom models using the MLC-LLM toolchain and host them for WebLLM to load.

Q: Does it work on mobile devices? A: WebGPU support on mobile is still emerging. Android Chrome has partial support; iOS Safari support is in development.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires