How do I install WebLLM — High-Performance In-Browser LLM Inference Engine?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

WebLLM — High-Performance In-Browser LLM Inference Engine

Introduction

WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

What WebLLM Does

Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
Provides an OpenAI-compatible chat completions API for drop-in usage
Supports structured JSON output and function calling
Handles model caching in the browser for faster subsequent loads
Enables streaming token generation with real-time UI updates

Architecture Overview

WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

Self-Hosting & Configuration

Install via npm or use the CDN bundle for quick prototyping
Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
Models are downloaded once and cached in IndexedDB for offline reuse
Configure model choice, temperature, and max tokens through the engine options
No backend server, API keys, or cloud dependencies needed

Key Features

Full privacy: all inference happens on-device with zero data leaving the browser
OpenAI-compatible API makes migration from cloud to local seamless
Streaming support for responsive chat interfaces
Pre-built model library covering popular open-weight LLMs
Works on any platform with WebGPU support including laptops and tablets

Comparison with Similar Tools

Ollama — native binary for local inference vs. browser-only with no install
llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
LM Studio — desktop app with UI vs. embeddable library for web developers

FAQ

Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

WebLLM — High-Performance In-Browser LLM Inference Engine

Agent 可直接安装

Introduction

What WebLLM Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

vLLM — High-Throughput LLM Serving Engine

WebLLM — High-Performance In-Browser LLM Inference

WebLLM — Run Large Language Models Directly in the Browser

WebLLM — In-Browser LLM Inference Engine