Local LLM

text-generation-webui (oobabooga) — Swiss-Army Local LLM UI

text-generation-webui is the Gradio-based multi-loader UI that researchers reach for when they need everything — multiple backends, LoRA training, quantization experiments, extensions, and a familiar chat UI in one package.

Official Site GitHub

Why text-generation-webui

Long before Ollama had an API, before LM Studio existed, oobabooga’s text-generation-webui was how the open-source LLM community ran models at home. It’s a Gradio app with "Chat", "Instruct", "Model", "Parameters", "Training", "Session", and "Extensions" tabs — every knob exposed, every backend pluggable. For researchers and enthusiasts who want to understand what their model is doing, rather than just chat with it, it remains the most complete tool in the ecosystem.

The cost is complexity. Compared to Ollama’s one command, text-generation-webui is a full Python install with multiple loaders (Transformers, llama.cpp, ExLlamaV2, HQQ, AQLM) and their dependency chains. Updates occasionally break existing setups. For casual use, simpler tools win. For power users who want to swap loaders, experiment with samplers, train LoRAs, and install extensions, text-generation-webui is still unmatched.

It also has the strongest roleplay/creative writing community of any local LLM UI. Persona cards, instruction templates, and many character-focused extensions originated here. If your use case is non-instruction, non-chatbot (interactive fiction, character roleplay, agentic narrative), text-generation-webui has the culture.

Quick Start — One-Click Install + Web UI

The start_* scripts handle a conda env and loader installation automatically. --api exposes OpenAI-compatible endpoints on port 5000. Model format flexibility is the whole point — the same UI loads GGUF (llama.cpp), GPTQ (AutoGPTQ), EXL2 (ExLlamaV2), HQQ, AWQ, and unquantized safetensors. Match loader to model format in the "Model" tab.

# 1. Clone and run the appropriate installer for your OS
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui

# macOS:
./start_macos.sh
# Linux:
# ./start_linux.sh
# Windows:
# .\start_windows.bat

# The script creates a conda env, asks about GPU (CUDA / ROCm / CPU / MPS),
# and installs the right loaders. First run takes 5-15 minutes.

# 2. Open http://localhost:7860 in your browser.
#    - "Model" tab → download a model from Hugging Face (e.g. Qwen2.5-14B-Instruct-GGUF)
#    - "Chat" tab → start chatting

# 3. Or expose the OpenAI-compatible API for programmatic access
#    --api flag enables the API extension (usually on by default in recent builds)
./start_macos.sh --api --listen --model-dir models
# API endpoint: http://localhost:5000/v1/chat/completions

# 4. Train a LoRA (research workflow)
#    "Training" tab → pick dataset → configure hyperparams → run
#    Output LoRA adapter written to loras/<name>; re-load in Model tab.

Key Features

Multiple inference backends

Transformers (full precision or 8-bit), llama.cpp (GGUF), ExLlamaV2 (EXL2), HQQ, AQLM, AutoAWQ, GPTQ. Pick the best loader for your model and hardware.

Training (LoRA / QLoRA)

Built-in LoRA and QLoRA training from the "Training" tab. Not as fast as Axolotl/Unsloth, but the most accessible UI for "train a LoRA on my dataset".

Extensions ecosystem

SD integration, TTS, long-term memory, multimodal vision, web search, and dozens more. Install via the built-in UI from the ooba-extensions hub.

Chat + Instruct + Notebook modes

Chat for back-and-forth, Instruct for zero-shot prompts, Notebook for free-form completion and prompt hacking. Switch anytime.

Fine control over sampling

Every sampling parameter exposed: temperature, top_p, top_k, min_p, repetition penalty, dynamic temperature, mirostat, etc. Useful for creative writing and research.

Character cards

Tavern-style character cards, persona management, and roleplay-oriented templates. The ecosystem around character-based use cases is more developed here than anywhere else in local LLM land.

Comparison

	Audience	Flexibility	Learning Curve	Best Fit
text-generation-webuithis	Power users, researchers	Very high	Medium-high	Experimentation, LoRA, roleplay
Ollama	Developers	Medium	Low	Ship fast, API-first
LM Studio	All users	Medium	Very low	Non-developer GUI
Jan	All users	Medium	Very low	OSS-purist desktop app

Use Cases

01. Model experimentation

Swap loaders, test quantizations, compare samplers on the same prompt. Gradio UI shows results immediately — much faster iteration than a CLI.

02. LoRA fine-tuning

Train small LoRA adapters on custom datasets without leaving the UI. Useful for research and for domain-specific personalization before committing to Axolotl/Unsloth pipelines.

03. Creative / roleplay / character AI

Strongest character-card and persona ecosystem. If your use case is interactive fiction or persistent character chat, this community and tooling is where that work happens.

Pricing & License

text-generation-webui: AGPL-3.0 open source. Free for personal and non-SaaS use. Note the AGPL: if you host it as a service, derivative code must also be AGPL-licensed — check legal implications before embedding it in commercial products.

Hardware cost: scales with loader and model. Transformers full-precision is RAM-hungry; llama.cpp GGUF is the most forgiving. EXL2 is fast on GPU with good quality.

Operational cost: heavier than Ollama or Jan — multi-loader installs can break on Python/CUDA upgrades. Budget time for first-run setup and occasional breakages when upgrading.

Related Assets on TokRepo

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.

text-generation-webui — A Gradio Web UI for Local LLMs

oobabooga's text-generation-webui is the "AUTOMATIC1111 of LLMs": a feature-rich Gradio interface for chatting with and serving local language models. It supports llama.cpp, Transformers, ExLlamaV2, and dozens of model formats.

Text Generation WebUI — Local LLM Chat Interface

Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.

Text Generation Inference (TGI) — Hugging Face Production LLM Server

TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.

Frequently Asked Questions

Is text-generation-webui still relevant in 2026?+

Yes, for specific audiences. For API-driven chat, Ollama has won. For GUI chat, LM Studio has won. For research, LoRA training, and roleplay, text-generation-webui remains the most complete tool. Pick based on your use case.

Can I use text-generation-webui as an API backend?+

Yes — enable --api on startup. The OpenAI-compatible API exposes chat completions and completions endpoints on port 5000 by default. Quality of compatibility is good but not identical to Ollama/vLLM; verify edge cases for your tool integrations.

How does training here compare to Axolotl or Unsloth?+

text-generation-webui’s training UI is the most accessible way to train a LoRA without writing YAML configs. Axolotl and Unsloth are more flexible, faster, and more memory-efficient — required for serious fine-tuning. text-generation-webui is great for "try training" and iterating on small datasets.

Does it run on Apple Silicon?+

Yes. Pick the Metal-enabled llama.cpp loader or the Transformers MPS backend. Not as optimized as MLX, but workable. For pure chat on Apple Silicon, Ollama or MLX-based tools are faster.

What about the AGPL license?+

AGPL-3.0 means: you can use it freely; if you modify and network-expose it, the modified source must be AGPL too. For personal use or internal tools, no issue. For SaaS products embedding the code, consult legal. Alternative: use Gradio or build your own UI on top of llama.cpp / vLLM (both MIT / Apache).

Compare Alternatives

Ollama — Run LLMs Locally with One Command (2026 Guide)LM Studio — Desktop GUI for Local LLMs (Windows, Mac, Linux)Jan — Open-source ChatGPT Alternative That Runs Offline LocalAI — Drop-in OpenAI API for Your Own Hardware