This isn't exactly true, as tokens live in the embedding space, which is n-dimensional, like 256 or 512 or whatever (so you might see one word, but it's actually an array of a bunch of numbers). With that said, I think it's pretty intuitive that continuous tokens are more efficient than discrete ones, simply due to the fact that the LLM itself is basically a continuous function (with coefficients/parameters ∈ ℝ).
We call an embedding-space n-dimensional, but in this context I would consider it 1-dimensional, as in it's a 1d vector of n values. The terminology just sucks. If we described images the same way we describe embeddings a 2 megapixel image would have to be called 2-million-dimensional (or 8-million-dimensional if we consider rgba to be four separate values)
I would also argue tokens are outside the embedding space, and a large part of the magic of LLMs (and many other neural network types) is the ability to map sequences of rather crude inputs (tokens) into a more meaningful embedding space, and then map from a meaningful embedding space back to tokens we humans understand
Those are just dimensions of different things, and it’s usually pretty clear from context what is meant. Color space has 3 dimensions; or 4 with transparency; an image pixel has 6 dimensions (xy+RGBA) if we take its color into account, but only 2 spatial dimensions; if you think of an image as a function that maps continuous xy coordinates into continuous rgba coordinates, then you have an infinitely dimensional function space; embeddings have their own dimensions, but none of them relate to their position in text at hand, which is why text in this context said to be 1D and image said to be 2D.