Google recently released their paper "Image Generators are Generalist Vision Learners" about exactly this. They fine tuned Nano Banana pro into what they call Vision Banana which can do segmentation etc.

https://arxiv.org/abs/2604.20329

very interesting, it seems that they use image(image,text) functions to process/filter images, effectively generating arbitrary bitmap(image), where bitmap is of the same dimension as image.