# ONNX Runtime — Cross-Platform ML Model Inference Engine > ONNX Runtime is a high-performance inference engine for machine learning models in the ONNX format. Developed by Microsoft, it accelerates model serving across CPU, GPU, and specialized hardware with a unified API for Python, C++, C#, Java, and JavaScript. ## Install Save as a script file and run: # ONNX Runtime — Cross-Platform ML Model Inference Engine ## Quick Use ```bash pip install onnxruntime python -c " import onnxruntime as ort import numpy as np # Example: run a model exported from PyTorch/TF # session = ort.InferenceSession('model.onnx') # result = session.run(None, {'input': np.random.randn(1, 3, 224, 224).astype(np.float32)}) print(f'ONNX Runtime {ort.__version__} ready, providers: {ort.get_available_providers()}') " ``` ## Introduction ONNX Runtime is Microsoft's cross-platform inference accelerator for machine learning models. It supports the Open Neural Network Exchange (ONNX) format, enabling models trained in PyTorch, TensorFlow, scikit-learn, and other frameworks to run on any target hardware with optimized performance and minimal code changes. ## What ONNX Runtime Does - Executes ONNX-format models with automatic graph optimizations and kernel fusion - Accelerates inference on CPUs (x86, ARM), NVIDIA GPUs (CUDA, TensorRT), AMD GPUs, and NPUs - Provides execution providers that plug in hardware-specific acceleration without code changes - Supports quantization (INT8, FP16) to reduce model size and increase throughput - Enables on-device inference for mobile (iOS, Android) and web (WebAssembly, WebGPU) targets ## Architecture Overview ONNX Runtime loads an ONNX graph and applies optimization passes (constant folding, operator fusion, layout transformation) before partitioning subgraphs to available execution providers. Each execution provider handles hardware-specific code generation. A session-based API manages model loading, input binding, and output retrieval. The runtime uses a thread pool for parallel operator execution on CPU and streams for GPU overlap. ## Self-Hosting & Configuration - CPU inference: `pip install onnxruntime`; GPU: `pip install onnxruntime-gpu` - Convert PyTorch models: `torch.onnx.export(model, dummy_input, 'model.onnx')` or use the Optimum library - Select execution provider: `ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider'])` - Quantize models with `onnxruntime.quantization.quantize_dynamic()` for INT8 inference - Deploy to edge via ONNX Runtime Mobile with a reduced operator set for smaller binary size ## Key Features - Broad hardware support through 20+ execution providers including TensorRT, DirectML, and OpenVINO - Graph optimizations that automatically fuse operators for up to 3x speedup over naive execution - Supports ONNX opset versions 7 through 21 for wide model compatibility - Training support via ORTModule for accelerating PyTorch training with ONNX Runtime kernels - ONNX Runtime Web enables ML inference in the browser via WebAssembly and WebGPU ## Comparison with Similar Tools - **PyTorch (eager)** — flexible for research but ONNX Runtime typically delivers 2-3x faster inference - **TensorRT** — best raw NVIDIA GPU performance but NVIDIA-only and requires more setup - **OpenVINO** — Intel-optimized inference but narrower hardware coverage than ONNX Runtime - **TFLite** — lightweight for mobile but limited to TensorFlow models and fewer optimizations - **Triton Inference Server** — model serving platform that can use ONNX Runtime as a backend ## FAQ **Q: Which model frameworks can export to ONNX?** A: PyTorch (via torch.onnx or torch.export), TensorFlow (via tf2onnx), scikit-learn (via skl2onnx), and many others. Hugging Face Optimum automates export for transformer models. **Q: Does ONNX Runtime support dynamic input shapes?** A: Yes, ONNX models can have dynamic axes (e.g., variable batch size or sequence length), and ONNX Runtime handles them efficiently. **Q: How much speedup can I expect over PyTorch?** A: Typically 1.5-3x for inference, depending on the model and hardware. Graph optimization and operator fusion drive most gains. **Q: Can I use ONNX Runtime for LLM inference?** A: Yes, ONNX Runtime GenAI provides optimized LLM inference with support for models like Phi, LLaMA, and Mistral in ONNX format. ## Sources - https://github.com/microsoft/onnxruntime - https://onnxruntime.ai --- Source: https://tokrepo.com/en/workflows/0e90de1c-3d9d-11f1-9bc6-00163e2b0d79 Author: Script Depot