I'm surprised by the accuracy, in practice, I feel like I generally have a lot better results

I'm the person who ran the test.

The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.

[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]

Do you measure your results in a repeatable way? In a way where your hypotheses about accuracy are falsifiable? Or do they just “feel” right?