torchvision — Computer Vision Models, Datasets & Transforms for PyTorch

Introduction

torchvision is the official computer vision library in the PyTorch ecosystem. It ships production-ready model architectures, pretrained weights, dataset loaders, and image transforms so researchers and engineers can build vision pipelines without reimplementing common components.

What torchvision Does

Provides 50+ pretrained model architectures (ResNet, EfficientNet, ViT, Swin, DETR)
Includes dataset wrappers for ImageNet, COCO, VOC, CIFAR, and more
Offers composable image transforms (v2 API with joint image/target transforms)
Supplies utilities for bounding box, mask, and keypoint manipulation
Bundles efficient C++/CUDA operators for NMS, RoI pooling, and deformable convolutions

Architecture Overview

torchvision is organized into four main modules: models (pretrained architectures), datasets (download and load benchmarks), transforms (preprocessing pipelines), and ops (custom CUDA kernels). The new transforms v2 API operates on arbitrary data structures, applying consistent random transforms to images and their annotations simultaneously.

Self-Hosting & Configuration

Install alongside PyTorch with matching CUDA version
Use pip, conda, or build from source for custom CUDA support
Download pretrained weights on first use or cache them via TORCH_HOME
Combine with torchdata or torch.utils.data.DataLoader for batched loading
Configure transforms pipelines declaratively using transforms.Compose

Key Features

Multi-weight API allowing selection of specific pretrained checkpoints per model
Transforms v2 with support for bounding boxes, segmentation masks, and videos
Built-in ONNX export support for deployment
Video reading and decoding utilities via torchvision.io
Quantization-ready model variants for efficient inference

Comparison with Similar Tools

timm — Larger model zoo for image classification; torchvision covers detection and segmentation too
Albumentations — Richer augmentation library but not tightly integrated with PyTorch models
OpenCV — General-purpose vision library; torchvision is specifically for deep learning workflows
Keras Applications — TensorFlow ecosystem equivalent; fewer detection/segmentation models

FAQ

Q: How do I load a pretrained model? A: Use torchvision.models.resnet50(weights="IMAGENET1K_V2"). The multi-weight API lets you pick specific checkpoint versions.

Q: Can torchvision handle video data? A: Yes. torchvision.io provides video reading, and transforms v2 supports video tensor augmentation.

Q: What is transforms v2? A: The new transform API that jointly transforms images and their annotations (boxes, masks) with consistent random parameters.

Q: Does torchvision support object detection? A: Yes. It includes Faster R-CNN, RetinaNet, FCOS, SSD, and DETR with pretrained COCO weights.

torchvision — Computer Vision Models, Datasets & Transforms for PyTorch

Introduction

What torchvision Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Flower — Federated Learning Framework for Any ML Platform

H2O-3 — Scalable Open-Source Machine Learning Platform

Open3D — Modern Library for 3D Data Processing