# WebLLM — High-Performance In-Browser LLM Inference

> A JavaScript library that runs large language models directly in the browser using WebGPU, enabling private on-device AI without a server.

## Install

Save as a script file and run:

# WebLLM — High-Performance In-Browser LLM Inference

## Quick Use
```javascript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(reply.choices[0].message.content);
```

## Introduction
WebLLM brings LLM inference to the browser using WebGPU acceleration. It compiles models via Apache TVM into a format that runs natively on the GPU through the browser's WebGPU API. No server, no data leaves the device.

## What WebLLM Does
- Runs quantized LLMs (Llama, Mistral, Phi, Qwen, Gemma) in the browser at near-native speed
- Uses WebGPU for GPU-accelerated inference without plugins or extensions
- Provides an OpenAI-compatible chat completions API in JavaScript
- Supports streaming responses, JSON mode, and function calling
- Caches model weights in the browser for instant subsequent loads

## Architecture Overview
WebLLM uses MLC-LLM's compilation pipeline to convert models into TVM-optimized WebGPU shaders. The runtime loads model weights into GPU memory via the WebGPU API and executes transformer layers as compute shader dispatches. A JavaScript wrapper exposes the OpenAI-compatible API.

## Self-Hosting & Configuration
- Install via npm: @mlc-ai/web-llm
- No server needed; the library runs entirely client-side
- Pre-compiled model variants available for different quantization levels
- Configure max generation length, temperature, and system prompts via the API
- Works in Chrome, Edge, and other browsers with WebGPU support

## Key Features
- Full privacy: all computation happens on the user's device
- OpenAI-compatible API makes it a drop-in replacement for cloud calls
- Supports 4-bit and 3-bit quantized models to fit in consumer GPU VRAM
- Service Worker mode enables background LLM processing
- Web Worker support keeps the main thread responsive during generation

## Comparison with Similar Tools
- **Ollama** — desktop-native local LLM runner; WebLLM runs in the browser with no install
- **llama.cpp (WASM)** — CPU-based WASM port; WebLLM uses WebGPU for GPU acceleration
- **Transformers.js** — Hugging Face's browser ML; WebLLM focuses on LLM chat with better GPU utilization
- **MLC-LLM** — the parent project covering native platforms; WebLLM targets the browser specifically
- **llamafile** — single-file LLM executables; WebLLM requires no download beyond browser caching

## FAQ
**Q: Which browsers support WebLLM?**
A: Chrome 113+, Edge 113+, and other Chromium-based browsers with WebGPU enabled. Firefox support is experimental.

**Q: What models can I run?**
A: Pre-compiled models include Llama 3, Mistral, Phi-3, Qwen, Gemma, and others in various quantization levels.

**Q: How much VRAM do I need?**
A: A 4-bit quantized 8B model needs roughly 4-5 GB of GPU memory. Smaller 3B models run on integrated GPUs.

**Q: Can I fine-tune models with WebLLM?**
A: No. WebLLM is inference-only. Fine-tune models offline and compile them for WebLLM using the MLC-LLM toolchain.

## Sources
- https://github.com/mlc-ai/web-llm
- https://webllm.mlc.ai/

---
Source: https://tokrepo.com/en/workflows/993569ba-416b-11f1-9bc6-00163e2b0d79
Author: Script Depot