Skills2026年3月31日·1 分钟阅读

BentoML — Build AI Model Serving APIs

BentoML builds model inference REST APIs and multi-model serving systems from Python scripts. 8.6K+ GitHub stars. Auto Docker, dynamic batching, any ML framework. Apache 2.0.

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

BentoML — Build AI Model Serving APIs

直接安装命令

npx -y tokrepo@latest install 8885a870-2236-43c7-b948-5c0d330e17de --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

BentoML turns Python model inference code into production REST APIs with auto Docker packaging and dynamic batching.

§01

What it is

BentoML is a Python framework for packaging and serving machine learning models as production-ready REST APIs. You decorate a Python class with @bentoml.service and methods with @bentoml.api, and BentoML handles Docker containerization, dynamic request batching, model loading, and API endpoint generation. It supports any ML framework including PyTorch, TensorFlow, Hugging Face Transformers, and scikit-learn.

The tool targets ML engineers and platform teams who need to deploy model inference endpoints without building custom API servers and Docker images from scratch.

§02

How it saves time or tokens

BentoML eliminates the boilerplate of building FastAPI/Flask wrappers around model inference code. A single @bentoml.service decorator replaces hundreds of lines of server setup, health checks, request parsing, and Docker configuration. Dynamic batching automatically groups incoming requests to maximize GPU utilization, improving throughput without changing application code.

§03

How to use

Install BentoML:

pip install -U bentoml

Create a service file:

# service.py
import bentoml

@bentoml.service
class Summarizer:
    def __init__(self):
        from transformers import pipeline
        self.pipeline = pipeline('summarization')

    @bentoml.api
    def summarize(self, text: str) -> str:
        result = self.pipeline(text, max_length=130)
        return result[0]['summary_text']

Run locally and containerize:

bentoml serve service:Summarizer
bentoml build
bentoml containerize summarizer:latest

§04

Example

import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    traffic={'timeout': 60},
    resources={'gpu': 1, 'memory': '4Gi'}
)
class ImageClassifier:
    def __init__(self):
        import torch
        self.model = torch.hub.load(
            'pytorch/vision', 'resnet50', pretrained=True
        )
        self.model.eval()

    @bentoml.api(batchable=True, batch_dim=0)
    def classify(self, images: np.ndarray) -> list:
        import torch
        tensor = torch.from_numpy(images).float()
        with torch.no_grad():
            outputs = self.model(tensor)
        return outputs.argmax(dim=1).tolist()

§05

Related on TokRepo

AI tools for coding -- Developer tools for AI application development
Automation tools -- ML pipeline and deployment automation

§06

Common pitfalls

The __init__ method runs once at startup; placing slow model loading here is correct, but forgetting to set adequate resource limits causes OOM kills in containers
Dynamic batching requires the batchable=True flag and consistent input shapes; variable-length inputs need padding or separate handling
BentoML builds create large Docker images when model weights are embedded; use external model registries for models over 2GB

常见问题

What ML frameworks does BentoML support?+

BentoML supports PyTorch, TensorFlow, Keras, Hugging Face Transformers, scikit-learn, XGBoost, LightGBM, ONNX, and any framework that can run inference in Python. The framework-agnostic design means you write standard Python inference code and BentoML handles the serving infrastructure.

How does dynamic batching work?+

When batchable=True is set on an API method, BentoML collects incoming requests within a configurable time window, groups them into a batch, and sends the batch through the model in a single forward pass. This maximizes GPU utilization by amortizing per-request overhead across multiple inputs.

Can BentoML deploy to Kubernetes?+

Yes. BentoML generates Docker images that can be deployed to any container orchestrator. The bentoml containerize command produces standard Docker images. BentoCloud provides managed Kubernetes deployment, and you can also deploy to any self-managed Kubernetes cluster.

How does BentoML compare to TorchServe?+

TorchServe is PyTorch-specific and focused on serving PyTorch models. BentoML is framework-agnostic, supports any Python ML library, and provides a simpler decorator-based API. BentoML also handles Docker packaging and multi-model composition more naturally.

Is BentoML free and open source?+

Yes. BentoML is Apache 2.0 licensed. The core framework is fully open source. BentoCloud is the optional paid managed platform for deployment and scaling, but you can self-host everything with the open-source tools.

引用来源 (3)

BentoML GitHub— BentoML builds model serving APIs from Python scripts
BentoML Documentation— Dynamic batching and auto Docker containerization
BentoML Framework Guide— Supports PyTorch, TensorFlow, Hugging Face, and other frameworks

🙏

来源与感谢

Created by BentoML. Licensed under Apache 2.0. bentoml/BentoML — 8,600+ GitHub stars

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

BentoML — Build AI Model Serving APIs

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

NVIDIA Triton Inference Server — Multi-Framework Model Serving at Scale

ONNX Runtime — Cross-Platform ML Model Inference Engine

KServe — Scalable ML Model Serving on Kubernetes

LitServe — Fast AI Model Serving Engine