This is interesting. There might be an information-theoretic reason — perhaps 'spatial tokenization' is more informative than 'words tokenization'.