Scripts2026年5月31日·1 分钟阅读

WebLLM — In-Browser LLM Inference Engine

Run large language models entirely in the browser with WebGPU acceleration and no server required.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
WebLLM
直接安装命令
npx -y tokrepo@latest install 4a347063-5cea-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

WebLLM brings large language model inference directly into the browser using WebGPU for hardware acceleration. It eliminates the need for server-side computation, enabling fully private, offline-capable AI chat and text generation on any modern browser that supports WebGPU.

What WebLLM Does

  • Runs LLMs (Llama, Mistral, Phi, Gemma, Qwen) entirely client-side in the browser
  • Leverages WebGPU for near-native GPU performance without plugins or extensions
  • Provides an OpenAI-compatible chat completions API for drop-in integration
  • Supports streaming responses, JSON mode, and function calling
  • Caches model weights in browser storage for instant subsequent loads

Architecture Overview

WebLLM compiles models through Apache TVM's machine learning compiler stack into WebGPU-optimized shaders. At runtime, a lightweight JavaScript engine loads quantized model weights into GPU memory via the WebGPU API, executes transformer attention and feed-forward layers as compute shaders, and exposes an OpenAI-compatible interface. A service worker mode allows background inference without blocking the main UI thread.

Self-Hosting & Configuration

  • Install via npm or load from a CDN script tag in any web project
  • Choose from dozens of pre-quantized models hosted on Hugging Face
  • Configure quantization level (q4f16, q4f32) to balance quality vs. VRAM usage
  • Set context window size and generation parameters (temperature, top-p) per request
  • Use the service worker engine variant for multi-tab or PWA deployments

Key Features

  • Zero server dependency means complete data privacy for end users
  • OpenAI-compatible API makes migration from cloud LLMs straightforward
  • Supports structured output via JSON mode and grammar-guided decoding
  • Pre-built model library covers instruction-tuned and code-generation models
  • Works on Chrome, Edge, and any browser with WebGPU support

Comparison with Similar Tools

  • Ollama — runs models locally on desktop but requires a native binary; WebLLM runs purely in-browser
  • llama.cpp (WASM) — compiles to WebAssembly with CPU-only execution; WebLLM uses WebGPU for GPU acceleration
  • Transformers.js — targets smaller encoder models via ONNX Runtime; WebLLM handles full-size decoder LLMs
  • LM Studio — desktop GUI application requiring installation; WebLLM needs only a web page
  • PrivateGPT — server-side Python stack; WebLLM is a client-side JavaScript library

FAQ

Q: What browsers support WebLLM? A: Any browser with WebGPU enabled, including Chrome 113+, Edge 113+, and recent builds of Firefox and Safari.

Q: How much VRAM do I need? A: A 4-bit quantized 8B model requires roughly 4-5 GB of GPU memory. Smaller 1-3B models work with 2 GB.

Q: Can I fine-tune or add my own models? A: WebLLM consumes pre-compiled model libraries. You can compile custom models using the MLC-LLM toolchain and host them for WebLLM to load.

Q: Does it work on mobile devices? A: WebGPU support on mobile is still emerging. Android Chrome has partial support; iOS Safari support is in development.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产