It's not about the generation, it's about verification.

Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.

Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.

The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:

---

Given the following grammar:

    <start> ::= <path>
    <path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
    <routes> ::= <path> | end_route <company>
    <company> ::= end_company | <path>
Is the following sentence valid:

Rome Paris Rome end_path Rome London end_path end_company

---

Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.