text-generation-webui — A Gradio Web UI for Local LLMs
oobabooga's text-generation-webui is the "AUTOMATIC1111 of LLMs": a feature-rich Gradio interface for chatting with and serving local language models. It supports llama.cpp, Transformers, ExLlamaV2, and dozens of model formats.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install b0d2eaa8-37db-11f1-9bc6-00163e2b0d79 --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
text-generation-webui (commonly called 'oobabooga') is a feature-rich Gradio web interface for chatting with and serving local language models. It supports multiple backends including llama.cpp, Transformers, ExLlamaV2, and dozens of model formats. The one-line installer detects your hardware (CUDA, ROCm, MPS, CPU) and configures the appropriate backend automatically.
The project targets users who want to run LLMs locally with a user-friendly web interface. It provides chat, notebook, and API modes, model management, LoRA loading, and extension support.
How it saves time or tokens
text-generation-webui eliminates the need to write Python scripts for local model inference. The web UI provides a chat interface, parameter tuning, model comparison, and API endpoints without any code. The one-line installer handles Python environments, CUDA dependencies, and backend compilation. For experimentation with different models and parameters, the UI approach is faster than editing scripts.
How to use
- Clone and run the installer:
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh # or start_windows.bat / start_macos.sh
- Select your hardware during setup (CUDA, ROCm, MPS, or CPU).
- Download a model from the Model tab and start chatting.
Example
Using the API for programmatic access:
import requests
response = requests.post(
'http://localhost:5000/v1/chat/completions',
json={
'model': 'loaded-model',
'messages': [
{'role': 'user', 'content': 'Explain quantum computing briefly'}
],
'temperature': 0.7,
'max_tokens': 200,
}
)
print(response.json()['choices'][0]['message']['content'])
The API follows the OpenAI chat completions format, making it a drop-in replacement for API-based workflows.
Related on TokRepo
- Local LLM with text-generation-webui — Detailed guide for text-generation-webui setup
- Local LLM Providers — Compare local LLM running tools including Ollama and LM Studio
Common pitfalls
- The installer creates a large Python environment (several GB). Ensure sufficient disk space before installation.
- VRAM requirements vary by model and quantization. A 7B model at 4-bit quantization needs roughly 6GB VRAM. Check model requirements before downloading.
- Some model formats (GPTQ, AWQ, EXL2) require specific backends. Not all backends are compatible with all formats.
- Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
- For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.
- Model quantization levels (4-bit, 8-bit, 16-bit) trade quality for speed and memory usage. Start with 4-bit quantization for testing and increase precision for production quality.
- The web UI exposes an API endpoint by default. In shared environments, configure authentication or restrict access to localhost to prevent unauthorized model usage.
Questions fréquentes
It supports GGUF (llama.cpp), GPTQ, AWQ, EXL2 (ExLlamaV2), and standard Hugging Face Transformers format. Each format has different performance characteristics and VRAM requirements.
Yes. The built-in API server follows the OpenAI chat completions format. This means you can use it as a local replacement for OpenAI's API in applications that support custom endpoints.
The UI works on NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Apple Silicon (MPS), and CPU-only setups. GPU acceleration dramatically improves inference speed. A minimum of 8GB VRAM is recommended for 7B parameter models.
Yes. The UI supports loading LoRA adapters on top of base models. This lets you use fine-tuned models without merging the adapters, saving disk space and enabling quick switching.
Ollama provides a simpler CLI-focused experience for running models. text-generation-webui offers a richer web UI with more parameter controls, multiple backends, extension support, and model comparison features. Ollama is easier to set up; text-generation-webui provides more flexibility.
Sources citées (3)
- text-generation-webui GitHub— text-generation-webui is a Gradio UI for local LLMs
- text-generation-webui Wiki— Multiple backend support: llama.cpp, Transformers, ExLlamaV2
- llama.cpp GitHub— llama.cpp for efficient LLM inference
En lien sur TokRepo
Fil de discussion
Actifs similaires
Text Generation WebUI — Local LLM Chat Interface
Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.
HuggingFace Chat UI — Open-Source AI Chat Interface
Chat UI is Hugging Face's open-source web interface for conversational AI, powering HuggingChat and supporting any text-generation model via TGI, Ollama, or OpenAI-compatible APIs with features like web search, tool use, and multimodal input.
Unsloth — 2x Faster Local LLM Training & Inference
Unsloth is a unified local interface for running and training AI models. 58.7K+ GitHub stars. 2x faster training with 70% less VRAM across 500+ models including Qwen, DeepSeek, Llama, Gemma. Web UI wi
CogVideo — Text and Image to Video Generation
An open-source video generation framework from Zhipu AI supporting text-to-video and image-to-video with CogVideoX models. Generates high-quality clips up to 6 seconds.