# LitServe — Fast AI Model Serving Engine

> Serve AI models 2x faster than FastAPI with built-in batching, streaming, GPU autoscaling, and multi-model endpoints. From the Lightning AI team.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

```bash
pip install litserve
```

```python
import litserve as ls

class MyAPI(ls.LitAPI):
    def setup(self, device):
        self.model = load_model(device)

    def predict(self, x):
        return self.model(x)

server = ls.LitServer(MyAPI(), accelerator="gpu")
server.run(port=8000)
```

## What is LitServe?

LitServe is a high-performance AI model serving engine built on top of FastAPI. It adds batching, streaming, GPU management, and autoscaling — making it simple to deploy any AI model as a production API with minimal code.

**Answer-Ready**: LitServe is a fast AI model serving engine from Lightning AI that adds batching, streaming, GPU autoscaling, and multi-model support on top of FastAPI.

## Core Features

### 1. Automatic Batching
Combine multiple requests into one GPU batch:

```python
server = ls.LitServer(
    MyAPI(),
    max_batch_size=16,
    batch_timeout=0.01,  # Wait 10ms to fill batch
)
```

### 2. Streaming Responses

```python
class StreamAPI(ls.LitAPI):
    def predict(self, x):
        for token in self.model.generate(x):
            yield token

server = ls.LitServer(StreamAPI(), stream=True)
```

### 3. Multi-GPU & Autoscaling

```python
# Use all available GPUs
server = ls.LitServer(MyAPI(), accelerator="gpu", devices="auto")

# Scale workers per device
server = ls.LitServer(MyAPI(), accelerator="gpu", workers_per_device=4)
```

### 4. Multi-Model Endpoints

```python
server = ls.LitServer(
    {
        "gpt": GPTApi(),
        "bert": BERTApi(),
        "whisper": WhisperApi(),
    }
)
# POST /predict/gpt, /predict/bert, etc.
```

### 5. OpenAI-Compatible API

```python
server = ls.LitServer(
    MyLLMApi(),
    spec=ls.OpenAISpec(),  # /v1/chat/completions compatible
)
```

## Performance

| Feature | FastAPI | LitServe |
|---------|---------|----------|
| Batching | Manual | Built-in |
| Streaming | Manual | Built-in |
| GPU Mgmt | Manual | Automatic |
| Throughput | 1x | ~2x |

## FAQ

**Q: How does it compare to vLLM or TGI?**
A: vLLM/TGI are LLM-specific. LitServe serves any model (vision, audio, tabular) with a unified API.

**Q: Can I use it with PyTorch/TensorFlow?**
A: Yes, framework-agnostic. Any Python model works.

**Q: Production ready?**
A: Yes, built by Lightning AI. Includes health checks, metrics, and Docker support.

## Source & Thanks

- GitHub: [Lightning-AI/LitServe](https://github.com/Lightning-AI/LitServe) (3k+ stars)
- Docs: [litserve.lightning.ai](https://litserve.lightning.ai)

<!-- ZH -->


## Quick Start

```bash
pip install litserve
```

Deploy an AI model as a production-grade API in five lines of code.

## What is LitServe?

LitServe is a high-performance AI model serving engine from Lightning AI. Built on top of FastAPI, it adds batching, streaming output, GPU management, and auto-scaling.

**In one sentence**: LitServe is a fast AI model serving engine with built-in batching, streaming output, GPU auto-scaling, and multi-model support.

## Core Features

### 1. Automatic Batching
Combines multiple requests into a single GPU inference pass.

### 2. Streaming Responses
Token-by-token streaming output.

### 3. Multi-GPU Auto-Scaling
Automatically detects and allocates GPU resources.

### 4. Multi-Model Endpoints
Deploy multiple models on a single server.

### 5. OpenAI Compatible
Supports the standard `/v1/chat/completions` interface.

## FAQ

**Q: How does it compare to vLLM?**
A: vLLM focuses on LLMs; LitServe supports any model (vision, audio, tabular).

**Q: Production ready?**
A: Yes — built by Lightning AI with health checks and metrics monitoring.

## Source & Thanks

- GitHub: [Lightning-AI/LitServe](https://github.com/Lightning-AI/LitServe) (3k+ stars)

---
Source: https://tokrepo.com/en/workflows/litserve-fast-ai-model-serving-engine-c9d3044a
Author: Prompt Lab