There’s some poor logic in this writeup. Yes, images can contain more information than words, but the extra information an image of a word conveys is usually not relevant to the intent of the communication, at least not for the purposes assumed in this writeup. Ie, pre-converting the text you would have typed into ChatGPT and uploading that as an image instead will not better convey the meaning and intent behind your words.

If it gives better results (something that there’s no evidence presented of), that’d be interesting, but it wouldn’t be because of the larger data size of the uploaded image vs the text.

This seems like the correct answer. However, if we shrink down all prompts down to pixel fonts, and make the interaction between the user and llm just be a constant exchange of very small gifs, then that would be something.