I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.

> Now tell me a system like this can't take source material and all the expert writings so far, and come up with various interpretations based on those combinations. And tell me it'll be less accurate than some historian's "vibes".

Framing it as the kind of problem where accuracy is a well-defined concept is the error this article is talking about. Literally the historian's "vibes" and "feelings" are the product you're trying to mimic with the LLM output, not an error to be smoothed out. I have no doubt that LLMs can have real impact in this field, especially as turbopowered search engines and text-management tools. But the point of human narrative history is fundamentally that we tell it to ourselves, and make sense of it by talking about it. Removing the human from the loop is IMO like trying to replace the therapy client with a chat agent.