I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
> Since vectors are usually normalized to the surface of an n-sphere (...)
In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.
This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.
I agree the OPs argument is a bad one. But I’m still optimistic about the representational capacity of those 20k dimensions.
I also doubt that all vectors are Orthogonal and/or Independent.
Re: distance metrics and curvilinear spaces and skew coordinates: https://news.ycombinator.com/item?id=41873650 :
> How does the distance metric vary with feature order?
> Do algorithmic outputs diverge or converge given variance in sequence order of all orthogonal axes? Does it matter which order the dimensions are stated in; is the output sensitive to feature order, but does it converge regardless? [...]
>> Are the [features] described with high-dimensional spaces really all 90° geometrically orthogonal?
> If the features are not statistically independent, I don't think it's likely that they're truly orthogonal; which might not affect the utility of a distance metric that assumes that they are all orthogonal
Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes, Linear Regression and Logistic Regression, LDA, PCA, and linear models in general are unreliable with non-independent features.
What are some of the hazards of L1 Lasso and L2 Ridge regularization? What are some of the worst cases with outliers? What does regularization do if applied to non-independent and/or non-orthogonal and/or non-linear data?
Impressive but probably insufficient because [non-orthogonality] cannot be so compressed.
There is also the standing question of whether there can be simultaneous encoding in a fundamental gbit.
[dead]