It's not about the generation, it's about verification.
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
---
Given the following grammar:
<start> ::= <path>
<path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
<routes> ::= <path> | end_route <company>
<company> ::= end_company | <path>
Is the following sentence valid:Rome Paris Rome end_path Rome London end_path end_company
---
Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.