BentoML — Build AI Model Serving APIs
BentoML builds model inference REST APIs and multi-model serving systems from Python scripts. 8.6K+ GitHub stars. Auto Docker, dynamic batching, any ML framework. Apache 2.0.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install 8885a870-2236-43c7-b948-5c0d330e17de --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
BentoML is a Python framework for packaging and serving machine learning models as production-ready REST APIs. You decorate a Python class with @bentoml.service and methods with @bentoml.api, and BentoML handles Docker containerization, dynamic request batching, model loading, and API endpoint generation. It supports any ML framework including PyTorch, TensorFlow, Hugging Face Transformers, and scikit-learn.
The tool targets ML engineers and platform teams who need to deploy model inference endpoints without building custom API servers and Docker images from scratch.
How it saves time or tokens
BentoML eliminates the boilerplate of building FastAPI/Flask wrappers around model inference code. A single @bentoml.service decorator replaces hundreds of lines of server setup, health checks, request parsing, and Docker configuration. Dynamic batching automatically groups incoming requests to maximize GPU utilization, improving throughput without changing application code.
How to use
- Install BentoML:
pip install -U bentoml
- Create a service file:
# service.py
import bentoml
@bentoml.service
class Summarizer:
def __init__(self):
from transformers import pipeline
self.pipeline = pipeline('summarization')
@bentoml.api
def summarize(self, text: str) -> str:
result = self.pipeline(text, max_length=130)
return result[0]['summary_text']
- Run locally and containerize:
bentoml serve service:Summarizer
bentoml build
bentoml containerize summarizer:latest
Example
import bentoml
from bentoml.io import JSON
import numpy as np
@bentoml.service(
traffic={'timeout': 60},
resources={'gpu': 1, 'memory': '4Gi'}
)
class ImageClassifier:
def __init__(self):
import torch
self.model = torch.hub.load(
'pytorch/vision', 'resnet50', pretrained=True
)
self.model.eval()
@bentoml.api(batchable=True, batch_dim=0)
def classify(self, images: np.ndarray) -> list:
import torch
tensor = torch.from_numpy(images).float()
with torch.no_grad():
outputs = self.model(tensor)
return outputs.argmax(dim=1).tolist()
Related on TokRepo
- AI tools for coding -- Developer tools for AI application development
- Automation tools -- ML pipeline and deployment automation
Common pitfalls
- The
__init__method runs once at startup; placing slow model loading here is correct, but forgetting to set adequate resource limits causes OOM kills in containers - Dynamic batching requires the
batchable=Trueflag and consistent input shapes; variable-length inputs need padding or separate handling - BentoML builds create large Docker images when model weights are embedded; use external model registries for models over 2GB
Questions fréquentes
BentoML supports PyTorch, TensorFlow, Keras, Hugging Face Transformers, scikit-learn, XGBoost, LightGBM, ONNX, and any framework that can run inference in Python. The framework-agnostic design means you write standard Python inference code and BentoML handles the serving infrastructure.
When batchable=True is set on an API method, BentoML collects incoming requests within a configurable time window, groups them into a batch, and sends the batch through the model in a single forward pass. This maximizes GPU utilization by amortizing per-request overhead across multiple inputs.
Yes. BentoML generates Docker images that can be deployed to any container orchestrator. The bentoml containerize command produces standard Docker images. BentoCloud provides managed Kubernetes deployment, and you can also deploy to any self-managed Kubernetes cluster.
TorchServe is PyTorch-specific and focused on serving PyTorch models. BentoML is framework-agnostic, supports any Python ML library, and provides a simpler decorator-based API. BentoML also handles Docker packaging and multi-model composition more naturally.
Yes. BentoML is Apache 2.0 licensed. The core framework is fully open source. BentoCloud is the optional paid managed platform for deployment and scaling, but you can self-host everything with the open-source tools.
Sources citées (3)
- BentoML GitHub— BentoML builds model serving APIs from Python scripts
- BentoML Documentation— Dynamic batching and auto Docker containerization
- BentoML Framework Guide— Supports PyTorch, TensorFlow, Hugging Face, and other frameworks
En lien sur TokRepo
Source et remerciements
Created by BentoML. Licensed under Apache 2.0. bentoml/BentoML — 8,600+ GitHub stars
Fil de discussion
Actifs similaires
NVIDIA Triton Inference Server — Multi-Framework Model Serving at Scale
Triton Inference Server is NVIDIA's production model serving platform. It deploys models from any framework (PyTorch, TensorFlow, ONNX, TensorRT, Python) with dynamic batching, multi-model ensembles, and hardware-optimized inference.
ONNX Runtime — Cross-Platform ML Model Inference Engine
ONNX Runtime is a high-performance inference engine for machine learning models in the ONNX format. Developed by Microsoft, it accelerates model serving across CPU, GPU, and specialized hardware with a unified API for Python, C++, C#, Java, and JavaScript.
KServe — Scalable ML Model Serving on Kubernetes
KServe is a CNCF project that provides a standardized Kubernetes-native platform for deploying, scaling, and managing machine learning models in production with support for TensorFlow, PyTorch, XGBoost, vLLM, and custom inference runtimes.
LitServe — Fast AI Model Serving Engine
Serve AI models 2x faster than FastAPI with built-in batching, streaming, GPU autoscaling, and multi-model endpoints. From the Lightning AI team.