> You change the prompt slightly, where a human who understands the topic would still give the right response trivially, the LLM outputs an answer that is both wrong/irrelevant and unpredicably and non-humanly wrong in a way that no human who exhibited understanding with the first answer could be predicted to answer the second question in the same bizarrw manner as the LLM.
I think this should make you question whether the prompt change was really as trivial as you imply. Providing an example of this would elucidate.
Here's an entire paper [0] showing the impact of extremely minor structural changes on the quality of the results of the model. Things as simple as not using a colon in the prompt can lead to notably degraded (or improved) performance.
0. https://arxiv.org/pdf/2310.11324