Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.

Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.

> Since vectors are usually normalized to the surface of an n-sphere (...)

In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.

This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.