ONNX Runtime — Cross-Platform ML Inference and Training Accelerator

Introduction

ONNX Runtime is an open-source inference engine by Microsoft that accelerates machine learning model execution across a wide range of hardware. It supports models exported from PyTorch, TensorFlow, scikit-learn, and other frameworks via the ONNX (Open Neural Network Exchange) format.

What ONNX Runtime Does

Runs ONNX-format models on CPU, GPU, NPU, and edge devices with optimized performance
Provides execution providers for CUDA, TensorRT, DirectML, OpenVINO, CoreML, and more
Supports both inference and training workloads with the same runtime
Integrates with Python, C/C++, C#, Java, JavaScript, and Objective-C
Applies graph optimizations and operator fusion automatically at load time

Architecture Overview

ONNX Runtime loads an ONNX model graph and applies a series of graph transformations (constant folding, operator fusion, layout optimization) before dispatching operations to execution providers. Each provider targets specific hardware: the CUDA EP for NVIDIA GPUs, the TensorRT EP for further GPU optimization, the CoreML EP for Apple Silicon, etc. The runtime selects the best provider per node, enabling heterogeneous execution within a single model.

Self-Hosting & Configuration

Install via pip, conda, NuGet, Maven, or npm depending on your language
Select execution providers by passing them to InferenceSession: e.g., ['CUDAExecutionProvider', 'CPUExecutionProvider']
Use onnxruntime-gpu for NVIDIA GPU acceleration with CUDA 11.x or 12.x
Tune thread count with SessionOptions().intra_op_num_threads for CPU inference
Quantize models with ONNX Runtime's built-in quantization tools to reduce model size and latency

Key Features

Broad hardware coverage via 20+ execution providers across cloud, desktop, mobile, and IoT
Automatic graph optimizations reduce inference latency without manual tuning
ONNX format interoperability lets you train in any framework and deploy uniformly
Quantization support (INT8, FP16) for smaller models and faster inference on constrained devices
Production-grade stability used in Microsoft products including Office, Bing, and Azure

Comparison with Similar Tools

TensorRT — NVIDIA-only, deeper GPU optimization but no cross-platform portability
OpenVINO — Intel-focused inference; ONNX Runtime supports Intel via the OpenVINO EP
TFLite — Targets mobile/embedded for TensorFlow models; ONNX Runtime covers more frameworks
Triton Inference Server — Model serving platform; ONNX Runtime is the inference engine it can host
llama.cpp — Specialized for LLM inference; ONNX Runtime is a general-purpose ML runtime

FAQ

Q: Do I need to convert my PyTorch model to ONNX first? A: Yes. Use torch.onnx.export() to convert a PyTorch model to ONNX format before loading it in ONNX Runtime.

Q: Can ONNX Runtime run large language models? A: Yes. The ONNX Runtime GenAI library supports transformer-based LLM inference with KV-cache optimization and beam search.

Q: Does it support training or only inference? A: Both. The onnxruntime-training package supports fine-tuning and full training with optimized memory usage.

Q: What platforms does ONNX Runtime run on? A: Windows, Linux, macOS, Android, iOS, and various embedded systems. It ships as a single library with no external dependencies.

ONNX Runtime — Cross-Platform ML Inference and Training Accelerator

Instalación lista para agent

Introduction

What ONNX Runtime Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

ONNX Runtime — Cross-Platform ML Model Inference Engine

ONNX Runtime — Cross-Platform ML Inference Accelerator

ONNX Runtime — Cross-Platform ML Inference Accelerator

MediaPipe — Cross-Platform ML Solutions by Google