>> If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar.

I'm not sure I understand what that means - could you explain please?

It means applying specific rules about how text can be generated. For example, generating valid json reliably. Currently we use constrained decoding to accomplish this (e.g. the next token must be one of three valid options).

Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.

I'm not GP, but here goes:

LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.

But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.

It's not about the generation, it's about verification.

Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.

Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.

The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:

---

Given the following grammar:

    <start> ::= <path>
    <path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
    <routes> ::= <path> | end_route <company>
    <company> ::= end_company | <path>
Is the following sentence valid:

Rome Paris Rome end_path Rome London end_path end_company

---

Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.