I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.
In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.
Ah, that's a good point.