0%

[CVPR 2024] YOLO-World:Open-vocabulary Object Detection(OVOD)

YOLO-World: Real-Time Open-Vocabulary Object Detection

https://github.com/AILab-CVC/YOLO-World/blob/master/yolo_world/models/layers/yolo_bricks.py

Open Vocabulary Object Detection

Open-vocabulary detection (OVOD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference. [From link]

Novelty

Traditional Object Detector

Traditional Object Detector

Previous Open-Vocabulary Detector

Previous Open-Vocabulary Detector

YOLO-World Model Architecture

Text Encoder(Clip)

YOLO Detector

YOLOv8 Backbone

Untitled

  • Untitled

Re-parameterizable Vision-Language Path Aggregation Network (Vision-Language PAN)

Untitled

Text-guided CSPLayer(T-CSPLayer)

Dark Bottkneck(C2f Layer)

From: https://openmmlab.medium.com/dive-into-yolov8-how-does-this-state-of-the-art-model-work-10f18f74bab1

Max-Sigmoid

Xl=Xlδ(maxj{1..C}(XlWj)),X_l^{\prime}=X_l \cdot \delta\left(\max _{j \in\{1 . . C\}}\left(X_l W_j^{\top}\right)\right)^{\top},

Image-Pooling Attention(I-Pooling Attention)

W=W+ MultiHead-Attention (W,X~,X~)W^{\prime}=W+\text { MultiHead-Attention }(W, \tilde{X}, \tilde{X})

Text Contrastive Head

Region-Text Matching

Sk,j=αL2-Norm(ek)L2-Norm(wj)T+βSk,j = {\alpha}{\cdot}L2\text{-}Norm(e_k)\cdot L2\text{-}Norm(w_j)^T + \beta

Loss

L(I)=Lcon+λI(Liou+Ldfl)\mathcal{L}(I)=\mathcal{L}_{\mathrm{con}}+\lambda_I \cdot\left(\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{dfl}}\right)
Lcon\mathcal{L}_{\mathrm{con}} is region-text contrastive loss
Liou\mathcal{L}_{\mathrm{iou}} is IoU loss
Ldfl\mathcal{L}_{\mathrm{dfl}} is distributed focal loss
λI\lambda_I is an indicator factor and set to 1 when input image I is from detection or grounding data and set to 0 when it is from the image-text data.
If you like my, Donate here.