As I read it, what they did there was a sanity-check by trusting the birthday paradox. Kind of: "If you get orthogonal vectors due to mere chance once, that's okay, but you try it billions of times and still get orthogonal vectors every time, mere chance seems a very unlikely explanation."
This has nothing to do with the birthday paradox. That paradox presumes a small countable state space (365) and a large enough # of observations.
In this case, it's a mathematical fact that 2 random vector in high dimensional space is very likely to be near orthogonal.
A slightly stronger (and more relevant) statement is that the number of mutually nearly orthogonal vectors you can simultaneously pack into an N dimensional space is exponential in N. Here “mutually nearly orthogonal” can be formally defined as: choose some threshold epsilon>0 - the set S of unit vectors is nearly mutually orthogonal if the maximum of the pairwise dot products of between all members if S is less than epsilon. The statement of the exponential growth of the size of this set with N is (amazingly) independent of the value of epsilon (although the rate of growth does obviously depend on that value).
This is pretty unintuitive for us 3D beings.
Edit: there are other clarifications, eg authors on X, so this comment is irrelevant.
The birthday paradox relies on there being a small number of possible birthdays (365-366).
There are not a small number of dimensions being used in the LLM.
The GP argument makes sense to me.
It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.
That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..
I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.
We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9
The birthday paradox equation is approximately the square root. You expect to find a collision in 365 possibilities in ~sqrt(365) = ~19 tries.
You expect to find a collision in 2^256 possibilities in ~sqrt(2^256) = ~2^128 tries.
You expect to find a collision in 10^10000 possibilities in ~sqrt(10^10000) = ~10^5000 tries.
The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.
Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.
I may have misunderstood — don't they test for orthogonality? Orthogonality would seem to drop much of the information in the vectors.
That assumes the random process by which vectors are generated places them at random angles to each other, it doesnt, it places them almost always very very nearly at (high-dim) right angles
The underlying geometry isnt random, to this order, it's determinstic