How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?
How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?
I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.
I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.
I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.