# WebLLM — Run Large Language Models Directly in the Browser

> WebLLM is an MLC project that brings LLM inference to web browsers using WebGPU. It runs models like LLaMA, Mistral, and Phi entirely client-side with no server required, enabling private AI chat and text generation from any modern browser.

## Install

Save as a script file and run:

# WebLLM — Run Large Language Models Directly in the Browser

## Quick Use
```bash
npm install @mlc-ai/web-llm
```
```javascript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC");
const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(reply.choices[0].message.content);
```

## Introduction
WebLLM brings large language model inference directly into the web browser using WebGPU acceleration. Developed by the MLC AI team, it compiles and optimizes models to run entirely on the client device, eliminating server costs and keeping all data private. It provides an OpenAI-compatible API that makes integrating browser-based AI straightforward.

## What WebLLM Does
- Runs quantized LLMs (LLaMA, Mistral, Phi, Gemma, Qwen) entirely in the browser via WebGPU
- Provides an OpenAI-compatible chat completions API for drop-in replacement of cloud endpoints
- Supports streaming responses, JSON mode, and function calling in the browser
- Caches model weights in browser storage for instant subsequent loads
- Enables private AI applications where no data leaves the user device

## Architecture Overview
WebLLM uses Apache TVM and MLC-LLM to compile model computation graphs into WebGPU shader programs. Models are quantized (typically 4-bit) and split into cached shards downloaded on first use. A JavaScript runtime orchestrates the GPU kernels, KV-cache management, and token sampling. The engine exposes an async API that mirrors the OpenAI SDK, so existing client code works with minimal changes.

## Self-Hosting & Configuration
- Install the npm package: `npm install @mlc-ai/web-llm`
- Choose a model from the pre-compiled model list (e.g., Phi-3.5-mini, LLaMA-3.1-8B)
- Models download and cache in IndexedDB on first load (typically 2-5 GB for an 8B model)
- WebGPU must be enabled in the browser (Chrome 113+, Edge 113+, Firefox Nightly)
- Custom models can be compiled with the MLC-LLM toolchain and loaded via a model URL

## Key Features
- Zero server infrastructure — all inference runs on the user GPU via WebGPU
- OpenAI-compatible API with chat completions, streaming, and function calling
- Model caching in IndexedDB eliminates re-downloads across sessions
- Support for Service Worker deployment enabling background AI processing
- Pre-compiled model library covering popular open-weight models at various quantization levels

## Comparison with Similar Tools
- **Ollama** — local LLM runner but requires native installation; WebLLM runs in any browser tab
- **llama.cpp (WASM)** — CPU-only WASM builds are slower; WebLLM uses WebGPU for GPU acceleration
- **Transformers.js** — Hugging Face browser inference focused on smaller models; WebLLM handles 7B+ models
- **LM Studio** — native desktop app with good UX but not embeddable in web applications
- **vLLM** — server-side high-throughput engine; WebLLM is client-side for privacy and zero-cost serving

## FAQ
**Q: What hardware is needed to run LLMs in the browser?**
A: A GPU with 4+ GB VRAM and a WebGPU-capable browser. Most modern laptops and desktops with discrete or integrated GPUs work for 3B-8B models.

**Q: How fast is WebLLM compared to native inference?**
A: WebGPU adds some overhead versus native CUDA, but quantized models achieve 20-60 tokens/second on mid-range GPUs, fast enough for interactive chat.

**Q: Does data leave my device?**
A: No. All computation and model weights stay in the browser. No network requests are made during inference.

**Q: Can I use WebLLM in a Chrome Extension?**
A: Yes, WebLLM supports Service Worker contexts, making it suitable for browser extensions with background AI processing.

## Sources
- https://github.com/mlc-ai/web-llm
- https://webllm.mlc.ai

---
Source: https://tokrepo.com/en/workflows/6469e991-3d9d-11f1-9bc6-00163e2b0d79
Author: Script Depot