Pipeline Vision Multimodale
Dix picks production pour les devs qui livrent de la vision : augmentation, backbones préentraînés, détection temps réel et open-set, segmentation, OCR texte et documents, parsing de screenshots pour agents, postprocessing, et eval de datasets + modèles.
What's in this pack
This is the stack you wire together when a generic vision-language API isn't enough — when you need to classify your own images, detect domain-specific objects, OCR your own document layouts, parse screenshots for an agent, or analyze video frame-by-frame at production latency. Every pick here is production-grade, actively maintained, and represents a distinct layer of the pipeline a vision developer will own end to end.
This pack is deliberately different from a generic "computer vision tools" round-up. It's organized by the actual order in which you reach for things when shipping: you do not start at model.predict() — you start at data, you train or fine-tune a backbone, you wire detection or segmentation into your hot path, you postprocess detections into business logic, and you keep an eval set running on every change.
The through-line is the same realization most vision teams hit around the third week of production: the demo image worked because you cherry-picked it; the only number that matters is performance on a real held-out eval set, measured before and after every change. Half the picks here exist to make that loop fast.
Install in this order (augment → backbone → detect → segment → OCR → screenshot → postprocess → eval)
- Albumentations — image augmentation. Start here because no model, however large, fixes a thin training set. Albumentations is fast (NumPy + OpenCV), composable, and supports bounding boxes, masks, and keypoints transformed in sync with the image. It's the standard preprocessing layer behind most Kaggle wins and most production training pipelines.
- timm (PyTorch Image Models) — pretrained backbones. When you need a ResNet, ConvNeXt, EfficientNet, ViT, or DINOv2 backbone for classification or as a feature extractor, timm has it with a unified API and weights you can swap in two lines. Don't roll your own ResNet definition; don't fine-tune off a random checkpoint of unknown provenance.
- Ultralytics YOLO — real-time object detection. YOLO11 is the speed-accuracy default for detection, instance segmentation, classification, pose estimation, and oriented bounding boxes. One Python API and CLI covers training, validation, prediction, and export to ONNX / TensorRT / CoreML / TFLite. This is the inference workhorse for most production detection pipelines.
- Grounding DINO — open-set detection with text prompts. When you need to detect categories that aren't in any pretrained label set, Grounding DINO accepts a text prompt ("a red emergency stop button", "a serial number sticker") and returns boxes — no fine-tuning required. Pair with SAM 2 for grounded segmentation in one shot.
- SAM 2 (Segment Anything 2) — promptable segmentation across images and video. Click a point or draw a box and SAM 2 returns a mask, plus it tracks segments across video frames natively. The combination of "prompt anywhere, segment anything" plus video tracking is what makes it the postprocessing step for most modern annotation and analysis pipelines.
- PaddleOCR — production OCR for 100+ languages. Lightweight (small enough for edge), accurate on natural-scene text, and battle-tested across screenshots, signs, packaging, and UI. Use it when you have prose-style text on heterogeneous backgrounds.
- Surya — document OCR for 90+ languages. Where PaddleOCR is the all-rounder, Surya is purpose-built for scanned and PDF documents: text recognition, line-level detection, reading order, and layout analysis (titles, tables, figures). Reach for it when the input is a document, not a photo.
- OmniParser — screen parsing for AI agents. Converts a screenshot into structured data (interactive icons, semantic regions, labels) that a downstream LLM can act on. This is the missing layer between "vision model can describe a screenshot" and "agent can click the right button." Essential if you're building a screenshot-driven agent.
- Supervision — postprocessing toolkit by Roboflow. The composable piece you've been writing inline: detection annotators, polygon zones, line counters, byte-tracker, dataset converters, and dozens of small utilities that turn raw model output into business logic. Pairs natively with YOLO and Grounding DINO.
- FiftyOne — dataset curation and model evaluation. The eval loop. Visualize predictions next to ground truth, find label errors, slice metrics by metadata, embed and cluster, compare two models on the same eval set. The single tool most production teams credit with measurable accuracy gains — because it makes the loop fast enough to actually run.
How they fit together (production vision pipeline)
┌─────────────────────────────────────────────────────────────┐
│ TRAINING / FINE-TUNE │
│ Albumentations ──► augmented batches │
│ │ │
│ ▼ │
│ timm backbone ──► features / head │
│ │ │
│ ▼ │
│ Ultralytics YOLO ──► trained checkpoint (.pt → .onnx) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ INFERENCE (hot path) │
│ │
│ Image / frame │
│ │ │
│ ├──► YOLO (closed-set, real-time) │
│ ├──► Grounding DINO (open-set, text-prompted) │
│ ├──► SAM 2 (segmentation, image + video tracking) │
│ ├──► PaddleOCR (prose / scene text) │
│ ├──► Surya (documents, PDFs) │
│ └──► OmniParser (screenshots → structured UI tree) │
│ │ │
│ ▼ │
│ Supervision ──► annotate, track, count, zone, convert │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EVAL │
│ FiftyOne ◄── ground truth + predictions │
│ slice by metadata · find label errors · compare models │
└─────────────────────────────────────────────────────────────┘
The split is deliberate. Training is a batch job that produces a checkpoint. Inference is a hot path where you pick the right specialist for the input (a detector for objects, an OCR engine for text, OmniParser for UI). Eval wraps everything — without FiftyOne or an equivalent, you cannot tell whether last week's augmentation tweak helped or quietly regressed mAP on the slice that matters.
Tradeoffs you'll hit
- Closed-set vs open-set detection — YOLO is fast and accurate on the categories you trained for. Grounding DINO is slower but accepts arbitrary text prompts and works zero-shot on novel categories. Default to YOLO when the label set is fixed and you have data. Reach for Grounding DINO when the label set keeps changing or when you cannot afford to label thousands of examples per class.
- PaddleOCR vs Surya vs vision-language model OCR — PaddleOCR wins on natural-scene text and edge deployment. Surya wins on documents, reading order, and layout. A general vision-language model (GPT-4o, Claude, Qwen-VL) wins on "describe what you see and extract whatever is relevant" — but at much higher cost per page and with no guarantee of consistent structure. Most production stacks use a specialist (PaddleOCR / Surya) and only fall back to a vision-language model for low-confidence pages.
- SAM 2 vs YOLO segmentation — YOLO-seg is faster and trains on a fixed label set with masks supervised. SAM 2 is promptable (click a point, draw a box) and tracks across video frames, but does not know category labels by itself. Combine: Grounding DINO produces a box from a text prompt, SAM 2 segments inside it, YOLO-seg is what you reach for when you have your own labeled mask data and want real-time throughput.
- OmniParser vs raw vision-language model on a screenshot — A vision-language model can describe a screenshot but cannot reliably output
{button: "Submit", coords: [x,y,w,h]}consistently enough to drive clicks. OmniParser is purpose-built for that structured output. If you are building a screenshot-driven agent, the parser belongs in the pipeline; the vision-language model belongs downstream, reasoning over OmniParser's structured output.
Common pitfalls
- Augmentations that destroy the label — Random crops that cut off bounding boxes, color jitters that hide a sticker color you're actually trying to detect, mosaic with mixup on a 10-class imbalanced dataset. Visualize augmented batches before training. Albumentations makes this one line; the failure mode is skipping the visualization.
- Using a pretrained backbone without checking input statistics — timm models expect specific normalization (mean / std) that varies by checkpoint family. Feeding ImageNet-normalized tensors to a DINOv2 backbone silently degrades accuracy. Use
timm.data.resolve_data_configand stop guessing. - No eval set = no progress — the most common failure mode in vision teams. Without a few hundred hand-curated images with ground truth, every change is vibes-based. Build the eval set in week one. Add every misclassification a user reports. FiftyOne exists to make this loop fast.
- Single threshold across all classes — One global confidence threshold optimized for mAP overall, then deployed to a customer that only cares about the rare class with 12 examples. Per-class thresholds, tuned on the slice that matters, are almost always worth the extra code.
- OCR without a confidence floor — PaddleOCR and Surya both return per-token confidences. Production pipelines that ship every OCR string downstream regardless of confidence get embarrassed by single-character flips on receipt totals. Floor at a threshold and route low-confidence pages to human review or a vision-language model fallback.
- Forgetting that screenshots are not natural images — Models trained on COCO / ImageNet do not generalize to UI screenshots. Use OmniParser for structure, fine-tune YOLO on actual UI screenshots if you need a closed-set detector for icons, and do not assume Grounding DINO's defaults will find a "close button" reliably without prompt tuning.
- Postprocessing as a one-off script — Every team writes the "draw boxes, count crossings, dump CSV" code inline three times before standardizing. Use Supervision once, save the team a quarter of duplicated utilities.
10 ressources prêtes à installer
Questions fréquentes
Why does this pack include PaddleOCR and Surya both — aren't they overlapping?
They cover different surfaces and most production stacks end up running both. PaddleOCR is the right default for natural-scene text: screenshots, signs, packaging, photos with text. Surya is purpose-built for documents: scanned PDFs, multi-column layouts, reading order, table detection, and form fields. The pack lists both because the question is not "which OCR" but "which surface" — once you have invoices and product photos in the same product, you will reach for both.
Where do general vision-language models like GPT-4o, Claude, or Qwen-VL fit?
Upstream of agents and downstream of specialists, not as a replacement for either. A vision-language model is great at "describe what's in this image" and "answer a question about this chart" and as a fallback for low-confidence OCR pages. It is not the right tool for real-time object detection at 30 FPS, for screenshot parsing into structured coordinates, or for high-volume document OCR — the cost per call and the consistency overhead of structured output make a specialist cheaper and more reliable for those layers. Use the vision-language model as the reasoning layer over OmniParser / PaddleOCR / YOLO output, not as a substitute.
Do I really need both YOLO and Grounding DINO?
Not always — pick by use case. If your label set is fixed, you have data, and latency matters (security cameras, defect detection, sports analytics), YOLO is the right answer and Grounding DINO is overkill. If your label set keeps changing or you cannot afford to label thousands of examples per class (a product team that needs to find a new attribute every sprint), Grounding DINO earns its slot because it accepts a text prompt and works zero-shot. Many real stacks ship YOLO for the core categories and Grounding DINO as the escape hatch for new ones.
Is FiftyOne worth installing if I already log metrics to Weights & Biases or MLflow?
Yes, for a different reason. W&B and MLflow are good at metric and run tracking; FiftyOne is the tool that lets you actually look at the data and predictions side by side, slice mAP by metadata, find label errors, and compare two models on the same examples. The teams that ship the largest accuracy gains spend their time inside FiftyOne, not staring at scalar metric curves. Use both: W&B / MLflow for run history, FiftyOne for the example-level loop that drives changes.
What is the smallest viable vision eval set I can start with?
Two hundred to five hundred images with ground truth labels, sliced into at least three meta-buckets that matter to your product (e.g. lighting condition, camera angle, customer tier). Build it in week one before you tune anything else. Every real user-reported failure becomes one more example. By month three you'll have one to three thousand and FiftyOne queries become how you spot regressions before they ship. The hard part is not the count — it's the discipline of slicing by metadata so you can see when a change helps one bucket and hurts another.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs