There are ways around this. You can push the success rate close to 100% if you use chain of thought and a quorum selection. It isn't great, and it slows response times, but if 85% isn't good enough, you just need to flip the coin about 5 times to get nearly(!) guaranteed results.
Good insight here, we actually did not include thinking into this model partly because we saw how incredibly fast it was to just get the minimum amount of tokens to output an answer.
Thinking helps performance scores but we'll leave it up to users to add additional tokens if they want. Our goal here was the leanest weight and token base for blazing fast performance for you all.
Coin flipping works only if the fails are roughly independent. More important is the complexity ceiling above which they fail all the time.
So my solution to non-binary failure states is
1. Generate a potential solution
2. If the solution is complex, chunk it up into logical parts
3. Vote on each chunk and select those with more than k votes
By doing this you can filter out outliers (not always desirable) and pull the signal out of the noise.