Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 24, 2026·2 min de lecture

GPUStack — GPU Cluster Manager for AI Model Deployment

An open-source GPU cluster manager that orchestrates inference engines like vLLM and SGLang for scalable, multi-node AI model serving.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Needs Confirmation · 64/100Policy : confirmer
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
GPUStack
Commande CLI universelle
npx tokrepo install fbee68ba-57ad-11f1-9bc6-00163e2b0d79

Introduction

GPUStack is an open-source GPU cluster manager designed to simplify the deployment and orchestration of LLM inference at scale. It manages multiple GPU workers, automatically distributes models across available hardware, and exposes an OpenAI-compatible API, letting teams serve AI models without manually configuring inference engines.

What GPUStack Does

  • Manages a pool of GPU workers from a single control plane
  • Automatically selects and configures inference backends (vLLM, SGLang, llama.cpp)
  • Distributes models across multiple GPUs with tensor parallelism
  • Provides an OpenAI-compatible API endpoint for all deployed models
  • Offers a web dashboard for monitoring GPU utilization and model health

Architecture Overview

GPUStack uses a controller-worker architecture. The controller maintains model registry state, schedules deployments, and exposes the API gateway. Workers register their GPU inventory and run inference engine processes. The scheduler matches model requirements (VRAM, compute) to available workers, handling placement, scaling, and failover automatically. Communication uses gRPC between controller and workers.

Self-Hosting & Configuration

  • Install via the one-line shell script on Linux systems
  • Controller and workers authenticate via shared token
  • Supports NVIDIA CUDA, AMD ROCm, and Huawei Ascend accelerators
  • Configure model sources from Hugging Face, ModelScope, or local paths
  • Scale by adding worker nodes with the same install command

Key Features

  • One-command setup for both controller and worker nodes
  • Automatic backend selection (vLLM for large models, llama.cpp for edge)
  • Multi-GPU tensor parallelism across networked machines
  • Built-in model catalog with one-click deployment from Hugging Face
  • Resource-aware scheduling prevents over-committing GPU memory

Comparison with Similar Tools

  • Ollama — single-machine focus; GPUStack manages multi-node clusters
  • vLLM/SGLang — raw inference engines; GPUStack orchestrates them with cluster management
  • Ray Serve — general-purpose; GPUStack is purpose-built for LLM model serving
  • KServe — Kubernetes-native; GPUStack works on bare metal without K8s

FAQ

Q: Does GPUStack require Kubernetes? A: No. It runs on bare-metal Linux, VMs, or inside containers. Kubernetes is optional.

Q: Can I mix GPU types in one cluster? A: Yes. The scheduler is aware of GPU capabilities and assigns models to compatible hardware.

Q: How does it handle model updates? A: Rolling updates swap model versions with zero downtime by draining traffic before switching backends.

Q: Is multi-tenant isolation supported? A: GPUStack provides API key management and per-key rate limiting for basic multi-tenancy.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires