ConfigsJul 3, 2026·3 min read

WebLLM — High-Performance In-Browser LLM Inference Engine

Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
WebLLM
Direct install command
npx -y tokrepo@latest install 9a2c4722-771d-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

What WebLLM Does

  • Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
  • Provides an OpenAI-compatible chat completions API for drop-in usage
  • Supports structured JSON output and function calling
  • Handles model caching in the browser for faster subsequent loads
  • Enables streaming token generation with real-time UI updates

Architecture Overview

WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

Self-Hosting & Configuration

  • Install via npm or use the CDN bundle for quick prototyping
  • Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
  • Models are downloaded once and cached in IndexedDB for offline reuse
  • Configure model choice, temperature, and max tokens through the engine options
  • No backend server, API keys, or cloud dependencies needed

Key Features

  • Full privacy: all inference happens on-device with zero data leaving the browser
  • OpenAI-compatible API makes migration from cloud to local seamless
  • Streaming support for responsive chat interfaces
  • Pre-built model library covering popular open-weight LLMs
  • Works on any platform with WebGPU support including laptops and tablets

Comparison with Similar Tools

  • Ollama — native binary for local inference vs. browser-only with no install
  • llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
  • Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
  • LM Studio — desktop app with UI vs. embeddable library for web developers

FAQ

Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets