Edit: there are other clarifications, eg authors on X, so this comment is irrelevant.
The birthday paradox relies on there being a small number of possible birthdays (365-366).
There are not a small number of dimensions being used in the LLM.
The GP argument makes sense to me.
It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.
That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..
I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.
We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9
The birthday paradox equation is approximately the square root. You expect to find a collision in 365 possibilities in ~sqrt(365) = ~19 tries.
You expect to find a collision in 2^256 possibilities in ~sqrt(2^256) = ~2^128 tries.
You expect to find a collision in 10^10000 possibilities in ~sqrt(10^10000) = ~10^5000 tries.
The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.
Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.
I may have misunderstood — don't they test for orthogonality? Orthogonality would seem to drop much of the information in the vectors.