# WebLLM — In-Browser LLM Inference Engine

> Run large language models entirely in the browser with WebGPU acceleration and no server required.

## Install

Save as a script file and run:

# WebLLM — In-Browser LLM Inference Engine

## Quick Use
```bash
npm install @mlc-ai/web-llm
```
```javascript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC");
const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(reply.choices[0].message.content);
```

## Introduction
WebLLM brings large language model inference directly into the browser using WebGPU for hardware acceleration. It eliminates the need for server-side computation, enabling fully private, offline-capable AI chat and text generation on any modern browser that supports WebGPU.

## What WebLLM Does
- Runs LLMs (Llama, Mistral, Phi, Gemma, Qwen) entirely client-side in the browser
- Leverages WebGPU for near-native GPU performance without plugins or extensions
- Provides an OpenAI-compatible chat completions API for drop-in integration
- Supports streaming responses, JSON mode, and function calling
- Caches model weights in browser storage for instant subsequent loads

## Architecture Overview
WebLLM compiles models through Apache TVM's machine learning compiler stack into WebGPU-optimized shaders. At runtime, a lightweight JavaScript engine loads quantized model weights into GPU memory via the WebGPU API, executes transformer attention and feed-forward layers as compute shaders, and exposes an OpenAI-compatible interface. A service worker mode allows background inference without blocking the main UI thread.

## Self-Hosting & Configuration
- Install via npm or load from a CDN script tag in any web project
- Choose from dozens of pre-quantized models hosted on Hugging Face
- Configure quantization level (q4f16, q4f32) to balance quality vs. VRAM usage
- Set context window size and generation parameters (temperature, top-p) per request
- Use the service worker engine variant for multi-tab or PWA deployments

## Key Features
- Zero server dependency means complete data privacy for end users
- OpenAI-compatible API makes migration from cloud LLMs straightforward
- Supports structured output via JSON mode and grammar-guided decoding
- Pre-built model library covers instruction-tuned and code-generation models
- Works on Chrome, Edge, and any browser with WebGPU support

## Comparison with Similar Tools
- **Ollama** — runs models locally on desktop but requires a native binary; WebLLM runs purely in-browser
- **llama.cpp (WASM)** — compiles to WebAssembly with CPU-only execution; WebLLM uses WebGPU for GPU acceleration
- **Transformers.js** — targets smaller encoder models via ONNX Runtime; WebLLM handles full-size decoder LLMs
- **LM Studio** — desktop GUI application requiring installation; WebLLM needs only a web page
- **PrivateGPT** — server-side Python stack; WebLLM is a client-side JavaScript library

## FAQ
**Q: What browsers support WebLLM?**
A: Any browser with WebGPU enabled, including Chrome 113+, Edge 113+, and recent builds of Firefox and Safari.

**Q: How much VRAM do I need?**
A: A 4-bit quantized 8B model requires roughly 4-5 GB of GPU memory. Smaller 1-3B models work with 2 GB.

**Q: Can I fine-tune or add my own models?**
A: WebLLM consumes pre-compiled model libraries. You can compile custom models using the MLC-LLM toolchain and host them for WebLLM to load.

**Q: Does it work on mobile devices?**
A: WebGPU support on mobile is still emerging. Android Chrome has partial support; iOS Safari support is in development.

## Sources
- https://github.com/mlc-ai/web-llm
- https://webllm.mlc.ai/

---
Source: https://tokrepo.com/en/workflows/asset-4a347063
Author: Script Depot