# WebLLM — High-Performance In-Browser LLM Inference Engine

> Run large language models directly in your browser with WebGPU acceleration. No server required, full privacy, powered by Apache TVM.

## Install

Save in your project root:

# WebLLM — High-Performance In-Browser LLM Inference Engine

## Quick Use
```bash
npm install @mlc-ai/web-llm
# In your JavaScript/TypeScript code:
# import { CreateMLCEngine } from "@mlc-ai/web-llm";
# const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC");
# const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello" }] });
```

## Introduction
WebLLM is an inference engine that runs large language models entirely inside the browser using WebGPU. It eliminates the need for server-side inference, keeping all data on the user's device while delivering near-native GPU performance through the MLC-LLM compilation stack.

## What WebLLM Does
- Runs LLMs like Llama, Mistral, Phi, and Gemma directly in the browser via WebGPU
- Provides an OpenAI-compatible chat completions API for drop-in usage
- Supports structured JSON output and function calling
- Handles model caching in the browser for faster subsequent loads
- Enables streaming token generation with real-time UI updates

## Architecture Overview
WebLLM compiles models through Apache TVM into WebGPU shaders. At runtime, it downloads pre-compiled model weights and a lightweight WASM runtime into the browser. The engine manages GPU memory, KV-cache, and tokenization locally. It exposes an API surface compatible with the OpenAI SDK so existing code can switch to local inference with minimal changes.

## Self-Hosting & Configuration
- Install via npm or use the CDN bundle for quick prototyping
- Requires a WebGPU-capable browser (Chrome 113+, Edge 113+, or Firefox Nightly)
- Models are downloaded once and cached in IndexedDB for offline reuse
- Configure model choice, temperature, and max tokens through the engine options
- No backend server, API keys, or cloud dependencies needed

## Key Features
- Full privacy: all inference happens on-device with zero data leaving the browser
- OpenAI-compatible API makes migration from cloud to local seamless
- Streaming support for responsive chat interfaces
- Pre-built model library covering popular open-weight LLMs
- Works on any platform with WebGPU support including laptops and tablets

## Comparison with Similar Tools
- **Ollama** — native binary for local inference vs. browser-only with no install
- **llama.cpp (WASM)** — CPU-bound WASM vs. GPU-accelerated WebGPU
- **Transformers.js** — ONNX-based browser inference vs. TVM-compiled GPU kernels
- **LM Studio** — desktop app with UI vs. embeddable library for web developers

## FAQ
**Q: Which browsers support WebLLM?**
A: Chrome and Edge 113+ have stable WebGPU support. Firefox Nightly also works. Safari support is emerging.

**Q: How large are the model downloads?**
A: Quantized models range from 1-4 GB depending on the model and quantization level. They are cached after the first download.

**Q: Can I use my own fine-tuned model?**
A: Yes. You can compile custom models using the MLC-LLM toolchain and load them into WebLLM.

**Q: Is it fast enough for real-time chat?**
A: On modern GPUs, WebLLM achieves 30-80 tokens per second for 7-8B parameter models, which is sufficient for interactive chat.

## Sources
- https://github.com/mlc-ai/web-llm
- https://webllm.mlc.ai/

---
Source: https://tokrepo.com/en/workflows/asset-9a2c4722
Author: AI Open Source