YOLO-World: Real-Time Open-Vocabulary Object Detection

https://github.com/AILab-CVC/YOLO-World/blob/master/yolo_world/models/layers/yolo_bricks.py

Open Vocabulary Object Detection

Open-vocabulary detection (OVOD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference. [From link]

Novelty

Traditional Object Detector

Previous Open-Vocabulary Detector

YOLO-World Model Architecture

Text Encoder(Clip)

YOLO Detector

YOLOv8 Backbone

Re-parameterizable Vision-Language Path Aggregation Network (Vision-Language PAN)

Text-guided CSPLayer(T-CSPLayer)

Dark Bottkneck(C2f Layer)

From: https://openmmlab.medium.com/dive-into-yolov8-how-does-this-state-of-the-art-model-work-10f18f74bab1

Max-Sigmoid

X_l^{\prime}=X_l \cdot \delta\left(\max _{j \in\{1 . . C\}}\left(X_l W_j^{\top}\right)\right)^{\top},

Image-Pooling Attention(I-Pooling Attention)

W^{\prime}=W+\text { MultiHead-Attention }(W, \tilde{X}, \tilde{X})

Text Contrastive Head

Region-Text Matching

Sk,j = {\alpha}{\cdot}L2\text{-}Norm(e_k)\cdot L2\text{-}Norm(w_j)^T + \beta

Loss

\mathcal{L}(I)=\mathcal{L}_{\mathrm{con}}+\lambda_I \cdot\left(\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{dfl}}\right)

\mathcal{L}_{\mathrm{con}}

is region-text contrastive loss

\mathcal{L}_{\mathrm{iou}}

is IoU loss

\mathcal{L}_{\mathrm{dfl}}

is distributed focal loss

\lambda_I

is an indicator factor and set to 1 when input image I is from detection or grounding data and set to 0 when it is from the image-text data.

Dengbuqi's Blog

[CVPR 2024] YOLO-World:Open-vocabulary Object Detection(OVOD)

Novelty

YOLO-World Model Architecture

Loss