I envy your intuition about high-dimensional spaces, as I have none (other than "here lies dragons"). (I think your intuition is broadly correct, seeing as billions of collision tests feels quite inadequate given the size of the space.)

> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

What's the intuition here? Law of large numbers?

And how is orthogonality related to distance? Expansion of |a-b|^2 = |a|^2 + |b|^2 - 2<a,b> = 2 - 2<a,b> which is roughly 2 if the unit vectors are basically orthogonal?

> Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere.

I also have no intuition regarding the surface of the unit sphere in high-dimensional vector spaces. I believe it vanishes. I suppose this patch also vanishes in terms of area. But what's the relative rate of those terms going to zero?

> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Imagine for simplicity that we consider only vectors pointing parallel/antiparallel to coordinate axes.

- In 1D, you have two possibilities: {+e_x, -e_x}. So if you pick two random vectors from this set, the probability of getting something orthogonal is 0.

- In 2D, you have four possibilities: {±e_x, ±e_y}. If we pick one random vector and get e.g. +e_x, then picking another one randomly from the set has a 50% chance of getting something orthogonal (±e_y are 2/4 possibilities). Same for other choices of the first vector.

- In 3D, you have six possibilities: {±e_x, ±e_y, ±e_z}. Repeat the same experiment, and you'll find a 66.7% chance of getting something orthogonal.

- In the limit of ND, you can see that the chance of getting something orthogonal is 1 - 1/N, which tends to 100% as N becomes large.

Now, this discretization is a simplification of course, but I think it gets the intuition right.

I think that's a good answer for practical purposes.

Theoretically, I can claim that N random vectors of zero-mean real numbers (say standard deviation of 1 per element) will "with probability 1" span an N-dimensional space. I can even grind on, subtracting the parallel parts of each vector pair, until I have N orthogonal vectors. ("Gram-Schmidt" from high school.) I believe I can "prove" that.

So then mapping using those vectors is "invertible." Nyeah. But back in numerical reality, I think the resulting inverse will become practically useless as N gets large.

That's without the nonlinear elements. Which are designed to make the system non-invertible. It's not shocking if someone proves mathematically that this doesn't quite technically work. I think it would only be interesting if they can find numerically useful inverses for an LLM that has interesting behavior.

All -- I haven't thought very clearly about this. If I've screwed something up, please correct me gently but firmly. Thanks.

for 768 dimensions, you'd still expect to hit (1-1/N) with a few billion samples though. Like that's a 1/N of 0.13%, which quite frankly isn't that rare at all?

Of course are vectors are not only points in one coordinate axes, but it still isn't that small compared to billions of samples.

[deleted]

[dead]

> What's the intuition here? Law of large numbers?

For unit vectors the cosine of the angle between them is a1*b1+a2*b2+...+an*bn.

Each of the terms has mean 0 and when you sum many of them the sum concentrates closer and closer to 0 (intuitively the positive and negative terms will tend to cancel out, and in fact the standard deviation is 1/√n).

> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Yep, the large number being the number of dimensions.

As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).

The mathematical name for this is 'concentration of measure' [1]

It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)

[1] https://www.johndcook.com/blog/2017/07/13/concentration_of_m...

> What's the intuition here? Law of large numbers?

"Concentration of measure"

https://en.wikipedia.org/wiki/Concentration_of_measure