Vision models use CLiP or something similar, which has no conception of anything specific in the image. It sees embeddings which correlate similarly to text embeddings. Take an image then describe it 'there are birds sitting on a power line in front of a blue sky with some clouds', get the embeddings from that and the embeddings from that picture and line them up. If you ask if there are birds in it, it would know, but not how many, unless it was common to describe the number of birds sitting on things and it happened often enough that the number counted was the number in the image descriptions it trained on. If you want to count objects you want something like YOLO.
VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.
Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.