In my experience, LLMs tend to take noticeably longer to process images than text.

It has to get the image data first, basically just IO time before processing it

IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.

I wonder if these stay in the prefix cache?