Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsJul 3, 2026·3 min de lectura

WebLLM — High-Performance In-Browser LLM Inference Engine

Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
WebLLM
Comando de instalación directa
npx -y tokrepo@latest install 9a2c4722-771d-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

What WebLLM Does

  • Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
  • Provides an OpenAI-compatible chat completions API for drop-in usage
  • Supports structured JSON output and function calling
  • Handles model caching in the browser for faster subsequent loads
  • Enables streaming token generation with real-time UI updates

Architecture Overview

WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

Self-Hosting & Configuration

  • Install via npm or use the CDN bundle for quick prototyping
  • Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
  • Models are downloaded once and cached in IndexedDB for offline reuse
  • Configure model choice, temperature, and max tokens through the engine options
  • No backend server, API keys, or cloud dependencies needed

Key Features

  • Full privacy: all inference happens on-device with zero data leaving the browser
  • OpenAI-compatible API makes migration from cloud to local seamless
  • Streaming support for responsive chat interfaces
  • Pre-built model library covering popular open-weight LLMs
  • Works on any platform with WebGPU support including laptops and tablets

Comparison with Similar Tools

  • Ollama — native binary for local inference vs. browser-only with no install
  • llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
  • Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
  • LM Studio — desktop app with UI vs. embeddable library for web developers

FAQ

Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados