Introduction
Kubeflow makes deploying machine learning workflows on Kubernetes simple, portable, and scalable. Originally started at Google, it packages best-of-breed ML tools into a cohesive platform that runs anywhere Kubernetes runs. From experimentation in notebooks to production model serving, Kubeflow covers the entire ML lifecycle.
What Kubeflow Does
- Orchestrates ML pipelines as DAGs with Kubeflow Pipelines and Argo Workflows
- Provides Jupyter notebook servers managed on Kubernetes for interactive development
- Runs distributed training jobs for TensorFlow, PyTorch, MPI, and XGBoost
- Serves models with KServe (formerly KFServing) for autoscaling inference endpoints
- Manages experiments, runs, and artifacts with built-in metadata tracking
Architecture Overview
Kubeflow is a collection of Kubernetes-native components. The central dashboard provides a unified UI. Kubeflow Pipelines uses Argo Workflows to execute ML pipeline steps as pods. Training Operators (TFJob, PyTorchJob) create distributed training topologies. KServe deploys inference graphs with canary rollouts and GPU autoscaling. All components use Kubernetes CRDs and are managed through the Kubeflow operator or kustomize manifests.
Self-Hosting & Configuration
- Deploy using kustomize manifests on any Kubernetes 1.25+ cluster
- Requires Istio for service mesh, Dex for authentication, and cert-manager for TLS
- Cloud-specific distributions available for AWS, GCP, and Azure with managed integrations
- Configure resource quotas per namespace to isolate team workloads and GPU allocation
- Use Kubeflow Profiles to create multi-tenant environments with RBAC isolation
Key Features
- Kubeflow Pipelines: reusable, versioned ML workflows with a visual pipeline editor
- Notebook Servers: spawn Jupyter or VS Code environments on Kubernetes with GPU support
- Distributed Training: native operators for TensorFlow, PyTorch, Horovod, and MPI workloads
- KServe: production model serving with autoscaling, A/B testing, and canary deployments
- Katib: hyperparameter tuning and neural architecture search as Kubernetes jobs
Comparison with Similar Tools
- MLflow — Lighter experiment tracking; Kubeflow offers full pipeline orchestration on K8s
- SageMaker — AWS-managed ML platform; Kubeflow is cloud-agnostic and self-hosted
- Ray — Distributed compute framework; Kubeflow provides a broader ML platform experience
- Metaflow — Netflix's ML workflow tool; simpler but less Kubernetes-native
- Vertex AI — Google's managed ML; Kubeflow is the open-source foundation it builds on
FAQ
Q: Do I need a large Kubernetes cluster to run Kubeflow? A: A minimal install runs on a 4-node cluster. For production with GPUs and multi-tenancy, scale according to workload needs.
Q: Can I use Kubeflow without the full platform? A: Yes. Individual components like Kubeflow Pipelines or KServe can be installed standalone without the full Kubeflow deployment.
Q: Does Kubeflow support GPU workloads? A: Yes. Kubeflow leverages Kubernetes GPU scheduling. Training operators and notebook servers can request GPU resources natively.
Q: How does Kubeflow handle experiment tracking? A: Kubeflow Pipelines tracks runs, parameters, metrics, and artifacts. For richer experiment tracking, it integrates with MLflow or Weights and Biases.