Introduction
FastChat is the platform behind LMSYS Chatbot Arena, providing tools to train, serve, and evaluate large language models. Developed by the LMSYS team at UC Berkeley, it powers the largest open LLM evaluation platform and offers a production-ready distributed serving system.
What FastChat Does
- Serves LLMs through a distributed architecture with controller, model workers, and API gateway
- Provides an OpenAI-compatible REST API for drop-in replacement in existing applications
- Supports multi-model deployment with automatic load balancing across workers
- Enables side-by-side model comparison through the Arena interface
- Trains and fine-tunes chat models using multi-turn conversation data
Architecture Overview
FastChat uses a three-component serving architecture: a controller that manages model worker registration and routing, model workers that load and serve individual models, and a Gradio web server or OpenAI-compatible API server as the frontend. Workers can run on different machines with different GPUs, and the controller routes requests based on availability.
Self-Hosting & Configuration
- Install via pip:
pip install "fschat[model_worker,webui]" - Start the controller:
python3 -m fastchat.serve.controller - Launch model workers:
python3 -m fastchat.serve.model_worker --model-path <path> - Run the web UI:
python3 -m fastchat.serve.gradio_web_server - Use the OpenAI API server:
python3 -m fastchat.serve.openai_api_server
Key Features
- OpenAI-compatible API endpoint for seamless integration with existing tools
- Multi-model serving with automatic worker registration and load balancing
- Support for 50+ model architectures including LLaMA, Mistral, and Qwen
- Built-in Chatbot Arena for pairwise model evaluation with Elo ratings
- Efficient inference with support for vLLM, LightLLM, and ExLlamaV2 backends
Comparison with Similar Tools
- vLLM — Focused on high-throughput serving; FastChat adds multi-model orchestration and evaluation
- Text Generation Inference — Single-model serving without built-in evaluation or Arena
- Ollama — Simpler local deployment but no distributed multi-worker architecture
- LiteLLM — API proxy for cloud providers; FastChat serves self-hosted models
- Open WebUI — Chat frontend only; FastChat includes serving infrastructure and training
FAQ
Q: Can I serve multiple models simultaneously? A: Yes. Launch separate model workers for each model and the controller will route requests automatically.
Q: Does FastChat support quantized models? A: Yes. It supports GPTQ, AWQ, and other quantization formats through its model worker backends.
Q: What is Chatbot Arena? A: Chatbot Arena is a crowdsourced evaluation platform where users chat with two anonymous models and vote for the better response, producing Elo-style rankings.
Q: Can I use FastChat with the OpenAI Python SDK? A: Yes. Point the OpenAI client base_url to your FastChat API server and it works as a drop-in replacement.