Figure 12 shows probabilities I think, it actually does seem to be 100% at temperature 0.1 for certain pretraining runs.

And this Figure 12 is not about Dyck/balanced-brackets grammar. This figure is about something not properly described in the paper.

  > it actually does seem to be 100%
For all Dyck grammar sequences, infinitely many of them? ;)

Well they used strings of < 800 chars, you probably run into context window and training limits at some point (they mention some result that you need at least something of GPT-2 size to begin recognizing more intricate CFGs (their synthetic cfg3f). But then again your physical real-world computer which is conceptually "turing complete" can't handle "infinite strings" either.

> Dyck/balanced-brackets grammar

Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.

Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.

(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.