Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 21, 2026·3 min de lecture

Grounding DINO — Open-Set Object Detection with Text Prompts

Grounding DINO combines a DINO-based detector with grounded pre-training to detect arbitrary objects described in natural language, enabling open-vocabulary object detection.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 96/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Prompt
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Grounding DINO Overview
Commande CLI universelle
npx tokrepo install c61831c4-54cb-11f1-9bc6-00163e2b0d79

Introduction

Grounding DINO is an open-set object detection model from IDEA Research that combines the DINO detection architecture with grounded pre-training. Unlike traditional detectors limited to fixed category lists, Grounding DINO can detect any object described in a text prompt, bridging the gap between closed-set detection and open-vocabulary understanding.

What Grounding DINO Does

  • Detects arbitrary objects specified by natural language descriptions
  • Returns bounding boxes with confidence scores and matched text phrases
  • Handles multiple object categories in a single text prompt separated by periods
  • Supports referring expression comprehension for specific object identification
  • Provides an open-vocabulary alternative to COCO-trained fixed-category detectors

Architecture Overview

Grounding DINO extends the DINO (DETR with Improved deNoising anchOr boxes) architecture with a text encoder branch. A Swin Transformer or similar backbone extracts image features, while a BERT-style encoder processes the text prompt. Cross-modality fusion layers enable feature exchange between vision and language branches at multiple scales. The detection head produces bounding boxes grounded to input text phrases using contrastive alignment between region features and token embeddings.

Self-Hosting & Configuration

  • Install via pip or clone the repository and build from source
  • Requires PyTorch 1.12+ and CUDA-capable GPU with at least 6 GB VRAM
  • Pre-trained weights available for Swin-T (172M) and Swin-B (232M) backbones
  • Adjustable box_threshold and text_threshold control detection sensitivity
  • Pairs naturally with SAM for text-prompted segmentation (Grounded-SAM pipeline)

Key Features

  • Open-vocabulary detection removes the need for fixed category training
  • Text-grounded approach detects novel objects without retraining
  • Strong zero-shot transfer outperforms many supervised detectors on COCO
  • Multi-phrase queries detect different object types in a single forward pass
  • Easily combined with Segment Anything Model for grounded segmentation

Comparison with Similar Tools

  • OWL-ViT — Google open-vocabulary detector using CLIP features, simpler but less accurate
  • YOLO-World — real-time open-vocabulary detector, faster but less precise on rare objects
  • GLIP — grounded language-image pre-training for detection, predecessor approach
  • Detic — extends detector vocabulary using image classification data, different training strategy
  • Florence-2 — Microsoft unified vision model with detection capability but broader and less specialized

FAQ

Q: How do I specify multiple object types to detect? A: Separate object descriptions with periods in the text prompt, for example: "person . bicycle . traffic light".

Q: Can Grounding DINO detect objects it was never trained on? A: Yes, that is the core capability. It generalizes to novel objects described in text, though accuracy depends on how well the text description matches visual features.

Q: How does Grounded-SAM work? A: Grounded-SAM pipelines use Grounding DINO to detect bounding boxes from text, then feed those boxes as prompts to SAM to generate precise segmentation masks.

Q: What is the inference speed? A: On an NVIDIA A100, the Swin-T model processes approximately 10-15 images per second at 800x1333 resolution.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires