Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

This is called context rot