I found that while CLIPSeg is slower than YOLOn, it is still pretty fast and if gave me much much better results without training.

If you want to detect objects and speed is important so you can’t use a LLM architecture, you can give it a try too.