> They're much, much better at that now.
Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.