text-generation-webui — A Gradio Web UI for Local LLMs
oobabooga's text-generation-webui is the "AUTOMATIC1111 of LLMs": a feature-rich Gradio interface for chatting with and serving local language models. It supports llama.cpp, Transformers, ExLlamaV2, and dozens of model formats.
What it is
text-generation-webui (commonly called 'oobabooga') is a feature-rich Gradio web interface for chatting with and serving local language models. It supports multiple backends including llama.cpp, Transformers, ExLlamaV2, and dozens of model formats. The one-line installer detects your hardware (CUDA, ROCm, MPS, CPU) and configures the appropriate backend automatically.
The project targets users who want to run LLMs locally with a user-friendly web interface. It provides chat, notebook, and API modes, model management, LoRA loading, and extension support.
How it saves time or tokens
text-generation-webui eliminates the need to write Python scripts for local model inference. The web UI provides a chat interface, parameter tuning, model comparison, and API endpoints without any code. The one-line installer handles Python environments, CUDA dependencies, and backend compilation. For experimentation with different models and parameters, the UI approach is faster than editing scripts.
How to use
- Clone and run the installer:
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh # or start_windows.bat / start_macos.sh
- Select your hardware during setup (CUDA, ROCm, MPS, or CPU).
- Download a model from the Model tab and start chatting.
Example
Using the API for programmatic access:
import requests
response = requests.post(
'http://localhost:5000/v1/chat/completions',
json={
'model': 'loaded-model',
'messages': [
{'role': 'user', 'content': 'Explain quantum computing briefly'}
],
'temperature': 0.7,
'max_tokens': 200,
}
)
print(response.json()['choices'][0]['message']['content'])
The API follows the OpenAI chat completions format, making it a drop-in replacement for API-based workflows.
Related on TokRepo
- Local LLM with text-generation-webui — Detailed guide for text-generation-webui setup
- Local LLM Providers — Compare local LLM running tools including Ollama and LM Studio
Common pitfalls
- The installer creates a large Python environment (several GB). Ensure sufficient disk space before installation.
- VRAM requirements vary by model and quantization. A 7B model at 4-bit quantization needs roughly 6GB VRAM. Check model requirements before downloading.
- Some model formats (GPTQ, AWQ, EXL2) require specific backends. Not all backends are compatible with all formats.
- Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
- For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.
- Model quantization levels (4-bit, 8-bit, 16-bit) trade quality for speed and memory usage. Start with 4-bit quantization for testing and increase precision for production quality.
- The web UI exposes an API endpoint by default. In shared environments, configure authentication or restrict access to localhost to prevent unauthorized model usage.
Frequently Asked Questions
It supports GGUF (llama.cpp), GPTQ, AWQ, EXL2 (ExLlamaV2), and standard Hugging Face Transformers format. Each format has different performance characteristics and VRAM requirements.
Yes. The built-in API server follows the OpenAI chat completions format. This means you can use it as a local replacement for OpenAI's API in applications that support custom endpoints.
The UI works on NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Apple Silicon (MPS), and CPU-only setups. GPU acceleration dramatically improves inference speed. A minimum of 8GB VRAM is recommended for 7B parameter models.
Yes. The UI supports loading LoRA adapters on top of base models. This lets you use fine-tuned models without merging the adapters, saving disk space and enabling quick switching.
Ollama provides a simpler CLI-focused experience for running models. text-generation-webui offers a richer web UI with more parameter controls, multiple backends, extension support, and model comparison features. Ollama is easier to set up; text-generation-webui provides more flexibility.
Citations (3)
- text-generation-webui GitHub— text-generation-webui is a Gradio UI for local LLMs
- text-generation-webui Wiki— Multiple backend support: llama.cpp, Transformers, ExLlamaV2
- llama.cpp GitHub— llama.cpp for efficient LLM inference
Related on TokRepo
Discussion
Related Assets
Cucumber.js — BDD Testing with Plain Language Scenarios
Cucumber.js is a JavaScript implementation of Cucumber that runs automated tests written in Gherkin plain language.
WireMock — Flexible API Mocking for Java and Beyond
WireMock is an HTTP mock server for stubbing and verifying API calls in integration tests and development.
Google Benchmark — Microbenchmark Library for C++
Google Benchmark is a library for measuring and reporting the performance of C++ code with statistical rigor.