I'm the person who ran the test.

To hopefully clarify a bit...

I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.

Can you expand on how you did this?

I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)