How do I install ONNX Runtime — Cross-Platform ML Inference Accelerator?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ONNX Runtime — Cross-Platform ML Inference Accelerator

Introduction

ONNX Runtime (ORT) is a cross-platform inference and training accelerator compatible with models from PyTorch, TensorFlow, scikit-learn, and other frameworks exported to the ONNX format. It is used in production at Microsoft across Office, Azure, Bing, and Windows.

What ONNX Runtime Does

Loads and runs ONNX models with automatic graph optimizations
Supports hardware acceleration via execution providers (CUDA, TensorRT, DirectML, OpenVINO, CoreML, XNNPACK)
Provides APIs for Python, C/C++, C#, Java, JavaScript, Objective-C, and Swift
Enables quantization (INT8, INT4) and mixed-precision for faster inference
Includes ONNX Runtime GenAI for optimized LLM and generative model serving

Architecture Overview

ORT's core is a C++ inference engine that takes an ONNX graph, applies platform-aware graph optimizations (operator fusion, constant folding, layout transformation), and dispatches operators to the best available execution provider. Each EP (e.g., CUDAExecutionProvider, TensorrtExecutionProvider) registers optimized kernel implementations. The session object manages model loading, memory allocation, and thread pooling.

Self-Hosting & Configuration

Install CPU version: pip install onnxruntime; GPU version: pip install onnxruntime-gpu
Export models from PyTorch using torch.onnx.export() or from TensorFlow via tf2onnx
Configure execution providers by passing a provider list to InferenceSession
Tune thread count, memory arena, and graph optimization level via SessionOptions
Deploy on mobile using the ONNX Runtime Mobile package with reduced operator sets

Key Features

Broad hardware coverage: NVIDIA GPU, AMD GPU, Intel CPU/GPU, Apple Neural Engine, Qualcomm NPU
Graph optimizations reduce latency without any model changes
Quantization tools for INT8 and INT4 with calibration workflows
ONNX Runtime GenAI provides optimized pipelines for LLMs (Phi, Llama, Mistral)
WebAssembly and WebGPU backends enable in-browser ML inference

Comparison with Similar Tools

TensorRT — NVIDIA-specific with maximum GPU performance; ORT is cross-platform and supports TensorRT as a backend
OpenVINO — Intel-focused inference toolkit; ORT includes OpenVINO as an execution provider
llama.cpp — specialized for LLM inference on CPU; ORT covers broader ML model types
TFLite — Google's mobile inference runtime; ORT offers wider hardware EP coverage
Triton Inference Server — NVIDIA's model serving platform; ORT is the inference engine, not the serving layer

FAQ

Q: Which ML frameworks can export to ONNX? A: PyTorch, TensorFlow, scikit-learn, XGBoost, LightGBM, and many others have ONNX export support.

Q: Does ONNX Runtime support training? A: Yes. ORT includes training acceleration for PyTorch models using ORTModule, which applies graph optimizations during training.

Q: Can I run ONNX Runtime in a web browser? A: Yes. The onnxruntime-web package runs models in the browser via WebAssembly or WebGPU.

Q: How do I choose the right execution provider? A: Pass your preferred providers as a list; ORT will use the first available one and fall back automatically.

ONNX Runtime — Cross-Platform ML Inference Accelerator

这个资产可以被 Agent 直接读取和安装

Introduction

What ONNX Runtime Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

ONNX Runtime — Cross-Platform ML Model Inference Engine

SDL — Simple DirectMedia Layer for Cross-Platform Multimedia

MonoGame — Cross-Platform .NET Game Framework

MediaPipe — Cross-Platform ML Solutions by Google