Kubeflow — Machine Learning Toolkit for Kubernetes
An open-source platform for deploying, orchestrating, and managing ML workflows on Kubernetes. Kubeflow brings portable and scalable machine learning pipelines, notebook servers, model training, and serving to any Kubernetes cluster.
What it is
Kubeflow is an open-source platform for deploying, orchestrating, and managing machine learning workflows on Kubernetes. It provides Jupyter notebook servers, ML pipeline orchestration (Kubeflow Pipelines), distributed model training, hyperparameter tuning (Katib), and model serving (KServe). All components run as Kubernetes-native resources.
Kubeflow targets ML engineers and platform teams who run Kubernetes and want a standardized way to manage the ML lifecycle. It makes ML workflows portable across any Kubernetes cluster, whether on-premises, on cloud, or hybrid.
How it saves time or tokens
Kubeflow eliminates the need to build custom infrastructure for each ML workflow stage. Pipelines define reproducible multi-step workflows as code. Katib automates hyperparameter search across multiple trials. KServe handles model deployment with auto-scaling and A/B testing. Everything runs on Kubernetes, so you leverage existing cluster management skills and infrastructure.
How to use
- Install Kubeflow on an existing Kubernetes cluster:
kubectl apply -k 'github.com/kubeflow/manifests/example?ref=v1.9'. - Access the dashboard by port-forwarding:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80. - Create notebooks, build pipelines, and submit training jobs through the web UI or SDK.
Example
# Install Kubeflow on existing K8s cluster
kubectl apply -k 'github.com/kubeflow/manifests/example?ref=v1.9'
# Access the dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Open http://localhost:8080
# Define a Kubeflow Pipeline
from kfp import dsl
@dsl.component
def preprocess(data_path: str) -> str:
# preprocessing logic
return processed_path
@dsl.component
def train(data_path: str) -> str:
# training logic
return model_path
@dsl.pipeline(name='ML Pipeline')
def ml_pipeline(data_path: str):
preprocess_task = preprocess(data_path=data_path)
train(data_path=preprocess_task.output)
Related on TokRepo
- DevOps Tools — Kubernetes and infrastructure automation
- Automation Tools — ML and data pipeline automation
Common pitfalls
- Kubeflow requires a functioning Kubernetes cluster with sufficient resources. The full installation consumes significant CPU and memory. Consider starting with a minimal profile.
- Istio is a dependency for the full Kubeflow installation. If your cluster already runs a different service mesh, there may be conflicts.
- Kubeflow Pipelines v2 uses a different SDK and pipeline format than v1. Check which version your installation supports before writing pipelines.
Frequently Asked Questions
No. Kubeflow runs on any Kubernetes cluster including GKE, EKS, AKS, and on-premises clusters. The installation uses standard Kubernetes resources. Some cloud providers offer pre-configured Kubeflow distributions.
Kubeflow Pipelines is a component for defining and running multi-step ML workflows as directed acyclic graphs (DAGs). Each step runs in a container, and the pipeline handles data passing between steps, caching, and retry logic.
Katib is Kubeflow's hyperparameter tuning system. You define a search space (learning rate, batch size, etc.), an objective metric, and a search algorithm (random, Bayesian, grid). Katib launches parallel trials and tracks the best configuration.
KServe (formerly KFServing) is Kubeflow's model serving component. It deploys trained models as scalable inference endpoints with auto-scaling, canary rollouts, and support for TensorFlow, PyTorch, XGBoost, and custom serving runtimes.
A minimal Kubeflow installation needs at least 4 CPUs and 8GB RAM. The full installation with all components requires more. Start with a minimal profile and enable components as needed.
Citations (3)
- Kubeflow GitHub— Kubeflow provides ML pipelines, notebook servers, training, and serving on Kuber…
- Kubeflow Documentation— Kubeflow Pipelines for reproducible ML workflows
- Kubeflow Official Site— Kubernetes-native machine learning platform architecture
Related on TokRepo
Discussion
Related Assets
Flax — Neural Network Library for JAX
A high-performance neural network library built on JAX, providing a flexible module system used extensively across Google DeepMind and the JAX research community.
PyCaret — Low-Code Machine Learning in Python
An open-source AutoML library that wraps scikit-learn, XGBoost, LightGBM, CatBoost, and other ML libraries into a unified low-code interface for rapid experimentation.
DGL — Deep Graph Library for Scalable Graph Neural Networks
A high-performance framework for building graph neural networks on top of PyTorch, TensorFlow, or MXNet, designed for both research prototyping and production-scale graph learning.