Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsJul 3, 2026·3 min de lecture

WebLLM — High-Performance In-Browser LLM Inference Engine

Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
WebLLM
Commande d'installation directe
npx -y tokrepo@latest install 9a2c4722-771d-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

What WebLLM Does

  • Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
  • Provides an OpenAI-compatible chat completions API for drop-in usage
  • Supports structured JSON output and function calling
  • Handles model caching in the browser for faster subsequent loads
  • Enables streaming token generation with real-time UI updates

Architecture Overview

WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

Self-Hosting & Configuration

  • Install via npm or use the CDN bundle for quick prototyping
  • Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
  • Models are downloaded once and cached in IndexedDB for offline reuse
  • Configure model choice, temperature, and max tokens through the engine options
  • No backend server, API keys, or cloud dependencies needed

Key Features

  • Full privacy: all inference happens on-device with zero data leaving the browser
  • OpenAI-compatible API makes migration from cloud to local seamless
  • Streaming support for responsive chat interfaces
  • Pre-built model library covering popular open-weight LLMs
  • Works on any platform with WebGPU support including laptops and tablets

Comparison with Similar Tools

  • Ollama — native binary for local inference vs. browser-only with no install
  • llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
  • Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
  • LM Studio — desktop app with UI vs. embeddable library for web developers

FAQ

Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires