The real story here is not how stupid the responses are - it's to show that on a question that even a young child can adequately answer, it chokes.

Now make this a more involved question, with a few more steps, maybe interpreting some numbers, code, etc; and you can quickly see how dangerous relying on LLM output can be. Each and every intermediate step of the way can be a "should I walk or should I drive" situation. And then the step that before that can be one too. Turtles all the way down, so to say.

I don't question that (coding) LLMs have started to be useful in my day-to-day work around the time Opus 4.5 was released. I'm a paying customer. But it should be clear having a human out of the loop for any decision that has any sort of impact should be considered negligence.

I think models don't treat is as riddle, rather a practical question. With latter, it makes sense that car is already at the car wash, otherwise the question makes no sense.

EDIT: framed the question as a riddle and all models except for Llama 4 Scout failed anyway.