ScriptsMay 3, 2026·3 min read

SkyPilot — Run AI Workloads on Any Cloud or Kubernetes

SkyPilot is an open-source framework for running AI workloads across any cloud provider or Kubernetes cluster with automatic cost optimization and unified management.

Introduction

SkyPilot is an open-source framework that lets AI teams run training, fine-tuning, and serving workloads across 20+ cloud providers and Kubernetes clusters without locking into any single vendor. It automatically finds the cheapest available GPUs, handles provisioning, and manages the lifecycle of cloud resources to cut costs while maintaining reliability.

What SkyPilot Does

  • Launches AI workloads on the cheapest available GPUs across AWS, GCP, Azure, and 15+ other clouds
  • Manages Kubernetes clusters alongside cloud VMs with a unified interface
  • Automatically recovers from spot/preemptible instance interruptions with checkpointing
  • Serves models with built-in autoscaling, load balancing, and multi-region replicas
  • Provides a job queue with gang scheduling for multi-node distributed training

Architecture Overview

SkyPilot uses a controller VM (or local process) that acts as a scheduler. When a task is submitted, the optimizer queries real-time pricing across all configured clouds, selects the cheapest region and instance type that meets the resource requirements, provisions the cluster via cloud APIs, and dispatches the job. For spot instances, SkyPilot monitors for preemptions and automatically migrates the workload to another zone or cloud. The serving controller manages replicas with health checks and load-based autoscaling.

Self-Hosting & Configuration

  • Install via pip: pip install skypilot-nightly[aws,gcp,azure] for specific cloud support
  • Run sky check to verify cloud credentials are configured correctly
  • Define workloads in YAML task files specifying resources, setup commands, and run commands
  • Use ~/.sky/config.yaml to set default regions, instance preferences, and cost limits
  • Deploy the SkyPilot controller on Kubernetes for team-shared job management

Key Features

  • Multi-cloud optimizer finds the cheapest GPUs across 20+ providers in real time
  • Spot instance support with automatic failover and checkpoint recovery
  • Managed serving with autoscaling, rolling updates, and multi-replica load balancing
  • Job queue with fair scheduling and priority support for team workflows
  • Works with existing cloud credentials — no separate account or agent needed

Comparison with Similar Tools

  • Terraform — Infrastructure provisioning tool; SkyPilot adds AI-specific scheduling and cost optimization
  • Ray — Distributed compute framework; SkyPilot handles cloud provisioning and can launch Ray clusters
  • Modal — Serverless GPU cloud; SkyPilot runs on your own cloud accounts for cost control
  • Kubernetes — Container orchestration; SkyPilot sits above K8s and cloud VMs as a unified layer
  • RunPod / Lambda — GPU cloud providers; SkyPilot aggregates across all providers to find the best price

FAQ

Q: Which clouds does SkyPilot support? A: AWS, GCP, Azure, OCI, Lambda, RunPod, Fluidstack, Paperspace, Cudo, IBM, SCP, Cloudflare, and Kubernetes clusters, among others.

Q: How much cost savings can SkyPilot achieve? A: By selecting the cheapest region and using spot instances with automatic recovery, teams typically see 3-10x cost reduction compared to on-demand pricing in a single region.

Q: Can I use SkyPilot for model serving? A: Yes, sky serve provides production-ready model serving with autoscaling, load balancing, and multi-cloud replicas.

Q: Does SkyPilot require changes to my training code? A: No, SkyPilot runs your existing scripts as-is. You only need a YAML file describing the resource requirements and commands.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets