TOKREPO · Arsenal IA

Stable

Pipeline Vision Multimodale

Dix picks production pour les devs qui livrent de la vision : augmentation, backbones préentraînés, détection temps réel et open-set, segmentation, OCR texte et documents, parsing de screenshots pour agents, postprocessing, et eval de datasets + modèles.

10 ressources

À propos de ce pack

What's in this pack

This is the stack you wire together when a generic vision-language API isn't enough — when you need to classify your own images, detect domain-specific objects, OCR your own document layouts, parse screenshots for an agent, or analyze video frame-by-frame at production latency. Every pick here is production-grade, actively maintained, and represents a distinct layer of the pipeline a vision developer will own end to end.

This pack is deliberately different from a generic "computer vision tools" round-up. It's organized by the actual order in which you reach for things when shipping: you do not start at model.predict() — you start at data, you train or fine-tune a backbone, you wire detection or segmentation into your hot path, you postprocess detections into business logic, and you keep an eval set running on every change.

The through-line is the same realization most vision teams hit around the third week of production: the demo image worked because you cherry-picked it; the only number that matters is performance on a real held-out eval set, measured before and after every change. Half the picks here exist to make that loop fast.

Install in this order (augment → backbone → detect → segment → OCR → screenshot → postprocess → eval)

Albumentations — image augmentation. Start here because no model, however large, fixes a thin training set. Albumentations is fast (NumPy + OpenCV), composable, and supports bounding boxes, masks, and keypoints transformed in sync with the image. It's the standard preprocessing layer behind most Kaggle wins and most production training pipelines.
timm (PyTorch Image Models) — pretrained backbones. When you need a ResNet, ConvNeXt, EfficientNet, ViT, or DINOv2 backbone for classification or as a feature extractor, timm has it with a unified API and weights you can swap in two lines. Don't roll your own ResNet definition; don't fine-tune off a random checkpoint of unknown provenance.
Ultralytics YOLO — real-time object detection. YOLO11 is the speed-accuracy default for detection, instance segmentation, classification, pose estimation, and oriented bounding boxes. One Python API and CLI covers training, validation, prediction, and export to ONNX / TensorRT / CoreML / TFLite. This is the inference workhorse for most production detection pipelines.
Grounding DINO — open-set detection with text prompts. When you need to detect categories that aren't in any pretrained label set, Grounding DINO accepts a text prompt ("a red emergency stop button", "a serial number sticker") and returns boxes — no fine-tuning required. Pair with SAM 2 for grounded segmentation in one shot.
SAM 2 (Segment Anything 2) — promptable segmentation across images and video. Click a point or draw a box and SAM 2 returns a mask, plus it tracks segments across video frames natively. The combination of "prompt anywhere, segment anything" plus video tracking is what makes it the postprocessing step for most modern annotation and analysis pipelines.
PaddleOCR — production OCR for 100+ languages. Lightweight (small enough for edge), accurate on natural-scene text, and battle-tested across screenshots, signs, packaging, and UI. Use it when you have prose-style text on heterogeneous backgrounds.
Surya — document OCR for 90+ languages. Where PaddleOCR is the all-rounder, Surya is purpose-built for scanned and PDF documents: text recognition, line-level detection, reading order, and layout analysis (titles, tables, figures). Reach for it when the input is a document, not a photo.
OmniParser — screen parsing for AI agents. Converts a screenshot into structured data (interactive icons, semantic regions, labels) that a downstream LLM can act on. This is the missing layer between "vision model can describe a screenshot" and "agent can click the right button." Essential if you're building a screenshot-driven agent.
Supervision — postprocessing toolkit by Roboflow. The composable piece you've been writing inline: detection annotators, polygon zones, line counters, byte-tracker, dataset converters, and dozens of small utilities that turn raw model output into business logic. Pairs natively with YOLO and Grounding DINO.
FiftyOne — dataset curation and model evaluation. The eval loop. Visualize predictions next to ground truth, find label errors, slice metrics by metadata, embed and cluster, compare two models on the same eval set. The single tool most production teams credit with measurable accuracy gains — because it makes the loop fast enough to actually run.

How they fit together (production vision pipeline)

┌─────────────────────────────────────────────────────────────┐
│  TRAINING / FINE-TUNE                                       │
│   Albumentations  ──►  augmented batches                    │
│        │                                                    │
│        ▼                                                    │
│   timm backbone  ──►  features / head                       │
│        │                                                    │
│        ▼                                                    │
│   Ultralytics YOLO  ──►  trained checkpoint (.pt → .onnx)   │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  INFERENCE (hot path)                                       │
│                                                             │
│   Image / frame                                             │
│        │                                                    │
│        ├──► YOLO (closed-set, real-time)                    │
│        ├──► Grounding DINO (open-set, text-prompted)        │
│        ├──► SAM 2 (segmentation, image + video tracking)    │
│        ├──► PaddleOCR (prose / scene text)                  │
│        ├──► Surya (documents, PDFs)                         │
│        └──► OmniParser (screenshots → structured UI tree)   │
│        │                                                    │
│        ▼                                                    │
│   Supervision  ──►  annotate, track, count, zone, convert   │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  EVAL                                                       │
│   FiftyOne  ◄── ground truth + predictions                  │
│   slice by metadata · find label errors · compare models    │
└─────────────────────────────────────────────────────────────┘

The split is deliberate. Training is a batch job that produces a checkpoint. Inference is a hot path where you pick the right specialist for the input (a detector for objects, an OCR engine for text, OmniParser for UI). Eval wraps everything — without FiftyOne or an equivalent, you cannot tell whether last week's augmentation tweak helped or quietly regressed mAP on the slice that matters.

Tradeoffs you'll hit

Closed-set vs open-set detection — YOLO is fast and accurate on the categories you trained for. Grounding DINO is slower but accepts arbitrary text prompts and works zero-shot on novel categories. Default to YOLO when the label set is fixed and you have data. Reach for Grounding DINO when the label set keeps changing or when you cannot afford to label thousands of examples per class.
PaddleOCR vs Surya vs vision-language model OCR — PaddleOCR wins on natural-scene text and edge deployment. Surya wins on documents, reading order, and layout. A general vision-language model (GPT-4o, Claude, Qwen-VL) wins on "describe what you see and extract whatever is relevant" — but at much higher cost per page and with no guarantee of consistent structure. Most production stacks use a specialist (PaddleOCR / Surya) and only fall back to a vision-language model for low-confidence pages.
SAM 2 vs YOLO segmentation — YOLO-seg is faster and trains on a fixed label set with masks supervised. SAM 2 is promptable (click a point, draw a box) and tracks across video frames, but does not know category labels by itself. Combine: Grounding DINO produces a box from a text prompt, SAM 2 segments inside it, YOLO-seg is what you reach for when you have your own labeled mask data and want real-time throughput.
OmniParser vs raw vision-language model on a screenshot — A vision-language model can describe a screenshot but cannot reliably output {button: "Submit", coords: [x,y,w,h]} consistently enough to drive clicks. OmniParser is purpose-built for that structured output. If you are building a screenshot-driven agent, the parser belongs in the pipeline; the vision-language model belongs downstream, reasoning over OmniParser's structured output.

Common pitfalls

Augmentations that destroy the label — Random crops that cut off bounding boxes, color jitters that hide a sticker color you're actually trying to detect, mosaic with mixup on a 10-class imbalanced dataset. Visualize augmented batches before training. Albumentations makes this one line; the failure mode is skipping the visualization.
Using a pretrained backbone without checking input statistics — timm models expect specific normalization (mean / std) that varies by checkpoint family. Feeding ImageNet-normalized tensors to a DINOv2 backbone silently degrades accuracy. Use timm.data.resolve_data_config and stop guessing.
No eval set = no progress — the most common failure mode in vision teams. Without a few hundred hand-curated images with ground truth, every change is vibes-based. Build the eval set in week one. Add every misclassification a user reports. FiftyOne exists to make this loop fast.
Single threshold across all classes — One global confidence threshold optimized for mAP overall, then deployed to a customer that only cares about the rare class with 12 examples. Per-class thresholds, tuned on the slice that matters, are almost always worth the extra code.
OCR without a confidence floor — PaddleOCR and Surya both return per-token confidences. Production pipelines that ship every OCR string downstream regardless of confidence get embarrassed by single-character flips on receipt totals. Floor at a threshold and route low-confidence pages to human review or a vision-language model fallback.
Forgetting that screenshots are not natural images — Models trained on COCO / ImageNet do not generalize to UI screenshots. Use OmniParser for structure, fine-tune YOLO on actual UI screenshots if you need a closed-set detector for icons, and do not assume Grounding DINO's defaults will find a "close button" reliably without prompt tuning.
Postprocessing as a one-off script — Every team writes the "draw boxes, count crossings, dump CSV" code inline three times before standardizing. Use Supervision once, save the team a quarter of duplicated utilities.

INSTALLER · UNE COMMANDE

$ tokrepo install pack/multimodal-vision-pipeline

passez-la à votre agent — ou collez-la dans votre terminal

Ce qu'il contient

10 ressources prêtes à installer

Skill#01

Albumentations — Fast Image Augmentation Library for ML Pipelines

Albumentations is a fast and flexible image augmentation library for machine learning that supports classification, segmentation, and detection tasks with a composable transform API.

by Script Depot·201 views

$ tokrepo install albumentations-fast-image-augmentation-library-ml-pipelines-43c2bcef

Skill#02

timm — Pretrained Vision Models and Layers for PyTorch

timm (PyTorch Image Models) is a collection of pretrained image classification models, layers, utilities, and training scripts maintained by Ross Wightman and hosted on Hugging Face.

by AI Open Source·249 views

$ tokrepo install timm-pretrained-vision-models-layers-pytorch-b2a4ac4b

Skill#03

Ultralytics YOLO — State-of-the-Art Object Detection

Production-ready object detection, segmentation, classification, and pose estimation models with a simple Python API and CLI, supporting training, validation, and deployment in a single package.

by AI Open Source·190 views

$ tokrepo install ultralytics-yolo-state-art-object-detection-d66bdad8

Prompt#04

Grounding DINO — Open-Set Object Detection with Text Prompts

Grounding DINO combines a DINO-based detector with grounded pre-training to detect arbitrary objects described in natural language, enabling open-vocabulary object detection.

by AI Open Source·85 views

$ tokrepo install grounding-dino-open-set-object-detection-text-prompts-c61831c4

Skill#05

SAM 2 — Segment Anything in Images and Videos

Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

by AI Open Source·278 views

$ tokrepo install sam-2-segment-anything-images-videos-c9dc9efb

Skill#06

PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages

A lightweight, production-ready OCR system supporting 100+ languages. Bridges documents and images to structured data for LLM pipelines.

by Script Depot·231 views

$ tokrepo install paddleocr-ai-powered-ocr-toolkit-100-languages-175147cb

Skill#07

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

by Script Depot·574 views

$ tokrepo install surya-document-ocr-90-languages-66bc0630

Skill#08

OmniParser — Screen Parsing Toolkit for AI Agents

OmniParser by Microsoft Research converts screenshots into structured data that AI agents can understand and act upon, enabling vision-based GUI automation across desktop and web applications.

by AI Open Source·172 views

$ tokrepo install omniparser-screen-parsing-toolkit-ai-agents-edfd1172

Skill#09

Supervision — Reusable Computer Vision Tools by Roboflow

A Python library of composable building blocks for detecting, tracking, classifying, and annotating objects in images and video streams.

by AI Open Source·143 views

$ tokrepo install supervision-reusable-computer-vision-tools-roboflow-4a9bbd36

Skill#10

FiftyOne — Visual AI Data Curation and Model Analysis

An open-source toolkit for building high-quality datasets and evaluating computer vision models through interactive visualization.

by Script Depot·169 views

$ tokrepo install fiftyone-visual-ai-data-curation-model-analysis-a764718a

Questions fréquentes

Why does this pack include PaddleOCR and Surya both — aren't they overlapping?

They cover different surfaces and most production stacks end up running both. PaddleOCR is the right default for natural-scene text: screenshots, signs, packaging, photos with text. Surya is purpose-built for documents: scanned PDFs, multi-column layouts, reading order, table detection, and form fields. The pack lists both because the question is not "which OCR" but "which surface" — once you have invoices and product photos in the same product, you will reach for both.

Where do general vision-language models like GPT-4o, Claude, or Qwen-VL fit?

Upstream of agents and downstream of specialists, not as a replacement for either. A vision-language model is great at "describe what's in this image" and "answer a question about this chart" and as a fallback for low-confidence OCR pages. It is not the right tool for real-time object detection at 30 FPS, for screenshot parsing into structured coordinates, or for high-volume document OCR — the cost per call and the consistency overhead of structured output make a specialist cheaper and more reliable for those layers. Use the vision-language model as the reasoning layer over OmniParser / PaddleOCR / YOLO output, not as a substitute.

Do I really need both YOLO and Grounding DINO?

Not always — pick by use case. If your label set is fixed, you have data, and latency matters (security cameras, defect detection, sports analytics), YOLO is the right answer and Grounding DINO is overkill. If your label set keeps changing or you cannot afford to label thousands of examples per class (a product team that needs to find a new attribute every sprint), Grounding DINO earns its slot because it accepts a text prompt and works zero-shot. Many real stacks ship YOLO for the core categories and Grounding DINO as the escape hatch for new ones.

Is FiftyOne worth installing if I already log metrics to Weights & Biases or MLflow?

Yes, for a different reason. W&B and MLflow are good at metric and run tracking; FiftyOne is the tool that lets you actually look at the data and predictions side by side, slice mAP by metadata, find label errors, and compare two models on the same examples. The teams that ship the largest accuracy gains spend their time inside FiftyOne, not staring at scalar metric curves. Use both: W&B / MLflow for run history, FiftyOne for the example-level loop that drives changes.

What is the smallest viable vision eval set I can start with?

Two hundred to five hundred images with ground truth labels, sliced into at least three meta-buckets that matter to your product (e.g. lighting condition, camera angle, customer tier). Build it in week one before you tune anything else. Every real user-reported failure becomes one more example. By month three you'll have one to three thousand and FiftyOne queries become how you spot regressions before they ship. The hard part is not the count — it's the discipline of slicing by metadata so you can see when a change helps one bucket and hurts another.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs