Hacker News

In my experience, LLMs tend to take noticeably longer to process images than text.

It has to get the image data first, basically just IO time before processing it

IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.

I wonder if these stay in the prefix cache?