Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 31, 2026·3 min de lectura

WebLLM — In-Browser LLM Inference Engine

Run large language models entirely in the browser with WebGPU acceleration and no server required.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
WebLLM
Comando de instalación directa
npx -y tokrepo@latest install 4a347063-5cea-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

WebLLM brings large language model inference directly into the browser using WebGPU for hardware acceleration. It eliminates the need for server-side computation, enabling fully private, offline-capable AI chat and text generation on any modern browser that supports WebGPU.

What WebLLM Does

  • Runs LLMs (Llama, Mistral, Phi, Gemma, Qwen) entirely client-side in the browser
  • Leverages WebGPU for near-native GPU performance without plugins or extensions
  • Provides an OpenAI-compatible chat completions API for drop-in integration
  • Supports streaming responses, JSON mode, and function calling
  • Caches model weights in browser storage for instant subsequent loads

Architecture Overview

WebLLM compiles models through Apache TVM's machine learning compiler stack into WebGPU-optimized shaders. At runtime, a lightweight JavaScript engine loads quantized model weights into GPU memory via the WebGPU API, executes transformer attention and feed-forward layers as compute shaders, and exposes an OpenAI-compatible interface. A service worker mode allows background inference without blocking the main UI thread.

Self-Hosting & Configuration

  • Install via npm or load from a CDN script tag in any web project
  • Choose from dozens of pre-quantized models hosted on Hugging Face
  • Configure quantization level (q4f16, q4f32) to balance quality vs. VRAM usage
  • Set context window size and generation parameters (temperature, top-p) per request
  • Use the service worker engine variant for multi-tab or PWA deployments

Key Features

  • Zero server dependency means complete data privacy for end users
  • OpenAI-compatible API makes migration from cloud LLMs straightforward
  • Supports structured output via JSON mode and grammar-guided decoding
  • Pre-built model library covers instruction-tuned and code-generation models
  • Works on Chrome, Edge, and any browser with WebGPU support

Comparison with Similar Tools

  • Ollama — runs models locally on desktop but requires a native binary; WebLLM runs purely in-browser
  • llama.cpp (WASM) — compiles to WebAssembly with CPU-only execution; WebLLM uses WebGPU for GPU acceleration
  • Transformers.js — targets smaller encoder models via ONNX Runtime; WebLLM handles full-size decoder LLMs
  • LM Studio — desktop GUI application requiring installation; WebLLM needs only a web page
  • PrivateGPT — server-side Python stack; WebLLM is a client-side JavaScript library

FAQ

Q: What browsers support WebLLM? A: Any browser with WebGPU enabled, including Chrome 113+, Edge 113+, and recent builds of Firefox and Safari.

Q: How much VRAM do I need? A: A 4-bit quantized 8B model requires roughly 4-5 GB of GPU memory. Smaller 1-3B models work with 2 GB.

Q: Can I fine-tune or add my own models? A: WebLLM consumes pre-compiled model libraries. You can compile custom models using the MLC-LLM toolchain and host them for WebLLM to load.

Q: Does it work on mobile devices? A: WebGPU support on mobile is still emerging. Android Chrome has partial support; iOS Safari support is in development.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados