Scripts2026年5月19日·1 分钟阅读

ONNX Runtime — Cross-Platform ML Inference Accelerator

ONNX Runtime is Microsoft's high-performance inference engine for machine learning models in the ONNX format. It supports CPU, GPU, and specialized hardware accelerators across Linux, Windows, macOS, iOS, Android, and the web browser.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Quick Use
通用 CLI 安装命令
npx tokrepo install 59114755-537e-11f1-9bc6-00163e2b0d79

Introduction

ONNX Runtime (ORT) is a cross-platform inference and training accelerator compatible with models from PyTorch, TensorFlow, scikit-learn, and other frameworks exported to the ONNX format. It is used in production at Microsoft across Office, Azure, Bing, and Windows.

What ONNX Runtime Does

  • Loads and runs ONNX models with automatic graph optimizations
  • Supports hardware acceleration via execution providers (CUDA, TensorRT, DirectML, OpenVINO, CoreML, XNNPACK)
  • Provides APIs for Python, C/C++, C#, Java, JavaScript, Objective-C, and Swift
  • Enables quantization (INT8, INT4) and mixed-precision for faster inference
  • Includes ONNX Runtime GenAI for optimized LLM and generative model serving

Architecture Overview

ORT's core is a C++ inference engine that takes an ONNX graph, applies platform-aware graph optimizations (operator fusion, constant folding, layout transformation), and dispatches operators to the best available execution provider. Each EP (e.g., CUDAExecutionProvider, TensorrtExecutionProvider) registers optimized kernel implementations. The session object manages model loading, memory allocation, and thread pooling.

Self-Hosting & Configuration

  • Install CPU version: pip install onnxruntime; GPU version: pip install onnxruntime-gpu
  • Export models from PyTorch using torch.onnx.export() or from TensorFlow via tf2onnx
  • Configure execution providers by passing a provider list to InferenceSession
  • Tune thread count, memory arena, and graph optimization level via SessionOptions
  • Deploy on mobile using the ONNX Runtime Mobile package with reduced operator sets

Key Features

  • Broad hardware coverage: NVIDIA GPU, AMD GPU, Intel CPU/GPU, Apple Neural Engine, Qualcomm NPU
  • Graph optimizations reduce latency without any model changes
  • Quantization tools for INT8 and INT4 with calibration workflows
  • ONNX Runtime GenAI provides optimized pipelines for LLMs (Phi, Llama, Mistral)
  • WebAssembly and WebGPU backends enable in-browser ML inference

Comparison with Similar Tools

  • TensorRT — NVIDIA-specific with maximum GPU performance; ORT is cross-platform and supports TensorRT as a backend
  • OpenVINO — Intel-focused inference toolkit; ORT includes OpenVINO as an execution provider
  • llama.cpp — specialized for LLM inference on CPU; ORT covers broader ML model types
  • TFLite — Google's mobile inference runtime; ORT offers wider hardware EP coverage
  • Triton Inference Server — NVIDIA's model serving platform; ORT is the inference engine, not the serving layer

FAQ

Q: Which ML frameworks can export to ONNX? A: PyTorch, TensorFlow, scikit-learn, XGBoost, LightGBM, and many others have ONNX export support.

Q: Does ONNX Runtime support training? A: Yes. ORT includes training acceleration for PyTorch models using ORTModule, which applies graph optimizations during training.

Q: Can I run ONNX Runtime in a web browser? A: Yes. The onnxruntime-web package runs models in the browser via WebAssembly or WebGPU.

Q: How do I choose the right execution provider? A: Pass your preferred providers as a list; ORT will use the first available one and fall back automatically.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产