KServe — Scalable ML Model Serving on Kubernetes
KServe is a CNCF project that provides a standardized Kubernetes-native platform for deploying, scaling, and managing machine learning models in production with support for TensorFlow, PyTorch, XGBoost, vLLM, and custom inference runtimes.
What it is
KServe is a CNCF project that provides a standardized, Kubernetes-native platform for deploying, scaling, and managing machine learning models in production. It supports inference runtimes including TensorFlow, PyTorch, XGBoost, vLLM, and custom containers.
ML engineers, platform teams, and MLOps practitioners use KServe to deploy models behind a consistent API without writing custom serving infrastructure. It handles autoscaling, canary rollouts, and model versioning through Kubernetes custom resources.
How it saves time or tokens
KServe abstracts away the complexity of serving infrastructure. Instead of writing custom Flask or FastAPI servers for each model, you declare an InferenceService resource and KServe handles routing, scaling (including scale-to-zero), and load balancing. This reduces deployment time from days to minutes and eliminates boilerplate serving code.
How to use
- Install KServe on your Kubernetes cluster using the provided manifests or Helm chart.
- Create an InferenceService YAML defining your model location and runtime.
- Apply the resource and KServe provisions the serving pods, configures autoscaling, and exposes a prediction endpoint.
Example
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: 'gs://kserve-examples/models/sklearn/1.0/model'
resources:
requests:
cpu: '1'
memory: 2Gi
# Deploy the model
kubectl apply -f sklearn-iris.yaml
# Test the prediction endpoint
curl -X POST http://sklearn-iris.default.example.com/v1/models/sklearn-iris:predict \
-H 'Content-Type: application/json' \
-d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'
Related on TokRepo
- AI Tools for DevOps — Kubernetes and infrastructure automation tools
- AI Tools for Automation — ML pipeline and deployment automation
Common pitfalls
- Not configuring resource requests and limits, leading to OOM kills on large models.
- Enabling scale-to-zero without understanding cold start latency for your use case.
- Using the default Knative setup without tuning concurrency and queue depth for your traffic patterns.
Frequently Asked Questions
KServe supports TensorFlow, PyTorch, scikit-learn, XGBoost, LightGBM, vLLM, Triton Inference Server, and custom containers. Each framework has a built-in serving runtime that handles model loading and inference.
Yes. You can request GPU resources in your InferenceService spec. KServe works with NVIDIA GPU operators on Kubernetes and supports CUDA-based runtimes for frameworks like PyTorch and vLLM.
Yes. KServe integrates with Knative to support scale-to-zero, meaning pods are terminated when there is no traffic and spun up on demand. This reduces costs for infrequently used models but introduces cold start latency.
KServe supports canary rollouts by allowing you to specify traffic percentages between model versions in the InferenceService spec. You can gradually shift traffic from an old model to a new one while monitoring metrics.
KServe is a CNCF incubating project used in production by multiple organizations. It provides monitoring, logging, and autoscaling features needed for production ML serving. The API is stable at v1beta1.
Citations (3)
- KServe GitHub— CNCF project for standardized Kubernetes-native ML serving
- KServe Documentation— Support for TensorFlow, PyTorch, XGBoost, vLLM runtimes
- KServe API Reference— InferenceService API for model deployment
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.