ScriptsApr 16, 2026·3 min read

Kubeflow — Machine Learning Toolkit for Kubernetes

An open-source platform for deploying, orchestrating, and managing ML workflows on Kubernetes. Kubeflow brings portable and scalable machine learning pipelines, notebook servers, model training, and serving to any Kubernetes cluster.

Introduction

Kubeflow makes deploying machine learning workflows on Kubernetes simple, portable, and scalable. Originally started at Google, it packages best-of-breed ML tools into a cohesive platform that runs anywhere Kubernetes runs. From experimentation in notebooks to production model serving, Kubeflow covers the entire ML lifecycle.

What Kubeflow Does

  • Orchestrates ML pipelines as DAGs with Kubeflow Pipelines and Argo Workflows
  • Provides Jupyter notebook servers managed on Kubernetes for interactive development
  • Runs distributed training jobs for TensorFlow, PyTorch, MPI, and XGBoost
  • Serves models with KServe (formerly KFServing) for autoscaling inference endpoints
  • Manages experiments, runs, and artifacts with built-in metadata tracking

Architecture Overview

Kubeflow is a collection of Kubernetes-native components. The central dashboard provides a unified UI. Kubeflow Pipelines uses Argo Workflows to execute ML pipeline steps as pods. Training Operators (TFJob, PyTorchJob) create distributed training topologies. KServe deploys inference graphs with canary rollouts and GPU autoscaling. All components use Kubernetes CRDs and are managed through the Kubeflow operator or kustomize manifests.

Self-Hosting & Configuration

  • Deploy using kustomize manifests on any Kubernetes 1.25+ cluster
  • Requires Istio for service mesh, Dex for authentication, and cert-manager for TLS
  • Cloud-specific distributions available for AWS, GCP, and Azure with managed integrations
  • Configure resource quotas per namespace to isolate team workloads and GPU allocation
  • Use Kubeflow Profiles to create multi-tenant environments with RBAC isolation

Key Features

  • Kubeflow Pipelines: reusable, versioned ML workflows with a visual pipeline editor
  • Notebook Servers: spawn Jupyter or VS Code environments on Kubernetes with GPU support
  • Distributed Training: native operators for TensorFlow, PyTorch, Horovod, and MPI workloads
  • KServe: production model serving with autoscaling, A/B testing, and canary deployments
  • Katib: hyperparameter tuning and neural architecture search as Kubernetes jobs

Comparison with Similar Tools

  • MLflow — Lighter experiment tracking; Kubeflow offers full pipeline orchestration on K8s
  • SageMaker — AWS-managed ML platform; Kubeflow is cloud-agnostic and self-hosted
  • Ray — Distributed compute framework; Kubeflow provides a broader ML platform experience
  • Metaflow — Netflix's ML workflow tool; simpler but less Kubernetes-native
  • Vertex AI — Google's managed ML; Kubeflow is the open-source foundation it builds on

FAQ

Q: Do I need a large Kubernetes cluster to run Kubeflow? A: A minimal install runs on a 4-node cluster. For production with GPUs and multi-tenancy, scale according to workload needs.

Q: Can I use Kubeflow without the full platform? A: Yes. Individual components like Kubeflow Pipelines or KServe can be installed standalone without the full Kubeflow deployment.

Q: Does Kubeflow support GPU workloads? A: Yes. Kubeflow leverages Kubernetes GPU scheduling. Training operators and notebook servers can request GPU resources natively.

Q: How does Kubeflow handle experiment tracking? A: Kubeflow Pipelines tracks runs, parameters, metrics, and artifacts. For richer experiment tracking, it integrates with MLflow or Weights and Biases.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets