Vision models use CLiP or something similar, which has no conception of anything...

JeremyHerrman · on July 10, 2024

VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.

Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.