Configs2026年7月3日·1 分钟阅读

WebLLM — High-Performance In-Browser LLM Inference Engine

Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
WebLLM
直接安装命令
npx -y tokrepo@latest install 9a2c4722-771d-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

What WebLLM Does

  • Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
  • Provides an OpenAI-compatible chat completions API for drop-in usage
  • Supports structured JSON output and function calling
  • Handles model caching in the browser for faster subsequent loads
  • Enables streaming token generation with real-time UI updates

Architecture Overview

WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

Self-Hosting & Configuration

  • Install via npm or use the CDN bundle for quick prototyping
  • Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
  • Models are downloaded once and cached in IndexedDB for offline reuse
  • Configure model choice, temperature, and max tokens through the engine options
  • No backend server, API keys, or cloud dependencies needed

Key Features

  • Full privacy: all inference happens on-device with zero data leaving the browser
  • OpenAI-compatible API makes migration from cloud to local seamless
  • Streaming support for responsive chat interfaces
  • Pre-built model library covering popular open-weight LLMs
  • Works on any platform with WebGPU support including laptops and tablets

Comparison with Similar Tools

  • Ollama — native binary for local inference vs. browser-only with no install
  • llama.cpp (WASM) — CPU-bound WASM vs. GPU-accelerated WebGPU
  • Transformers.js — ONNX-based browser inference vs. TVM-compiled GPU kernels
  • LM Studio — desktop app with UI vs. embeddable library for web developers

FAQ

Q: Which browsers support WebLLM? A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

Q: How large are the model downloads? A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

Q: Can I use my own fine-tuned model? A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

Q: Is it fast enough for real-time chat? A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产