How do I install FastChat — Open Platform for LLM Serving and Evaluation?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

FastChat — Open Platform for LLM Serving and Evaluation

Introduction

FastChat is the platform behind LMSYS Chatbot Arena, providing tools to train, serve, and evaluate large language models. Developed by the LMSYS team at UC Berkeley, it powers the largest open LLM evaluation platform and offers a production-ready distributed serving system.

What FastChat Does

Serves LLMs through a distributed architecture with controller, model workers, and API gateway
Provides an OpenAI-compatible REST API for drop-in replacement in existing applications
Supports multi-model deployment with automatic load balancing across workers
Enables side-by-side model comparison through the Arena interface
Trains and fine-tunes chat models using multi-turn conversation data

Architecture Overview

FastChat uses a three-component serving architecture: a controller that manages model worker registration and routing, model workers that load and serve individual models, and a Gradio web server or OpenAI-compatible API server as the frontend. Workers can run on different machines with different GPUs, and the controller routes requests based on availability.

Self-Hosting & Configuration

Install via pip: pip install "fschat[model_worker,webui]"
Start the controller: python3 -m fastchat.serve.controller
Launch model workers: python3 -m fastchat.serve.model_worker --model-path <path>
Run the web UI: python3 -m fastchat.serve.gradio_web_server
Use the OpenAI API server: python3 -m fastchat.serve.openai_api_server

Key Features

OpenAI-compatible API endpoint for seamless integration with existing tools
Multi-model serving with automatic worker registration and load balancing
Support for 50+ model architectures including LLaMA, Mistral, and Qwen
Built-in Chatbot Arena for pairwise model evaluation with Elo ratings
Efficient inference with support for vLLM, LightLLM, and ExLlamaV2 backends

Comparison with Similar Tools

vLLM — Focused on high-throughput serving; FastChat adds multi-model orchestration and evaluation
Text Generation Inference — Single-model serving without built-in evaluation or Arena
Ollama — Simpler local deployment but no distributed multi-worker architecture
LiteLLM — API proxy for cloud providers; FastChat serves self-hosted models
Open WebUI — Chat frontend only; FastChat includes serving infrastructure and training

FAQ

Q: Can I serve multiple models simultaneously? A: Yes. Launch separate model workers for each model and the controller will route requests automatically.

Q: Does FastChat support quantized models? A: Yes. It supports GPTQ, AWQ, and other quantization formats through its model worker backends.

Q: What is Chatbot Arena? A: Chatbot Arena is a crowdsourced evaluation platform where users chat with two anonymous models and vote for the better response, producing Elo-style rankings.

Q: Can I use FastChat with the OpenAI Python SDK? A: Yes. Point the OpenAI client base_url to your FastChat API server and it works as a drop-in replacement.

FastChat — Open Platform for LLM Serving and Evaluation

Introduction

What FastChat Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

PaddlePaddle — Industrial-Grade Deep Learning Platform by Baidu

PyTorch Lightning — Scalable Deep Learning Framework

Detectron2 — Meta AI Object Detection Platform