Hacker News

I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.

I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.

I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.