Skills2026年4月26日·1 分钟阅读

WebLLM — High-Performance In-Browser LLM Inference

A JavaScript library that runs large language models directly in the browser using WebGPU, enabling private on-device AI without a server.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
WebLLM
直接安装命令
npx -y tokrepo@latest install 993569ba-416b-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

WebLLM brings LLM inference to the browser using WebGPU acceleration. It compiles models via Apache TVM into a format that runs natively on the GPU through the browser's WebGPU API. No server, no data leaves the device.

What WebLLM Does

  • Runs quantized LLMs (Llama, Mistral, Phi, Qwen, Gemma) in the browser at near-native speed
  • Uses WebGPU for GPU-accelerated inference without plugins or extensions
  • Provides an OpenAI-compatible chat completions API in JavaScript
  • Supports streaming responses, JSON mode, and function calling
  • Caches model weights in the browser for instant subsequent loads

Architecture Overview

WebLLM uses MLC-LLM's compilation pipeline to convert models into TVM-optimized WebGPU shaders. The runtime loads model weights into GPU memory via the WebGPU API and executes transformer layers as compute shader dispatches. A JavaScript wrapper exposes the OpenAI-compatible API.

Self-Hosting & Configuration

  • Install via npm: @mlc-ai/web-llm
  • No server needed; the library runs entirely client-side
  • Pre-compiled model variants available for different quantization levels
  • Configure max generation length, temperature, and system prompts via the API
  • Works in Chrome, Edge, and other browsers with WebGPU support

Key Features

  • Full privacy: all computation happens on the user's device
  • OpenAI-compatible API makes it a drop-in replacement for cloud calls
  • Supports 4-bit and 3-bit quantized models to fit in consumer GPU VRAM
  • Service Worker mode enables background LLM processing
  • Web Worker support keeps the main thread responsive during generation

Comparison with Similar Tools

  • Ollama — desktop-native local LLM runner; WebLLM runs in the browser with no install
  • llama.cpp (WASM) — CPU-based WASM port; WebLLM uses WebGPU for GPU acceleration
  • Transformers.js — Hugging Face's browser ML; WebLLM focuses on LLM chat with better GPU utilization
  • MLC-LLM — the parent project covering native platforms; WebLLM targets the browser specifically
  • llamafile — single-file LLM executables; WebLLM requires no download beyond browser caching

FAQ

Q: Which browsers support WebLLM? A: Chrome 113+, Edge 113+, and other Chromium-based browsers with WebGPU enabled. Firefox support is experimental.

Q: What models can I run? A: Pre-compiled models include Llama 3, Mistral, Phi-3, Qwen, Gemma, and others in various quantization levels.

Q: How much VRAM do I need? A: A 4-bit quantized 8B model needs roughly 4-5 GB of GPU memory. Smaller 3B models run on integrated GPUs.

Q: Can I fine-tune models with WebLLM? A: No. WebLLM is inference-only. Fine-tune models offline and compile them for WebLLM using the MLC-LLM toolchain.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产