I have heard this argument before, but never actually seen concrete evals.

The argument goes that because we are intentionally constraining the model - I believe OAI’s method is a soft max (I think, rusty on my ML math) to get tokens sorted by probability then taking the first that aligns with the current state machine - we get less creativity.

Maybe, but a one-off vibes example is hardly proof. I still use structured output regularly.

Oh, and tool calling is almost certainly implemented atop structured output. After all, it’s forcing the model to respond with a JSON schema representing the tool arguments. I struggle to believe that this is adequate for tool calling but inadequate for general purpose use.

> but never actually seen concrete evals.

The team behind the Outlines library has produced several sets of evals and repeatedly shown the opposite: that constrained decoding improves model performance (including examples of "CoT" which the post claims isn't possible). [0,1]

There was a paper that claimed constrained decoding hurt performance, but it had some fundamental errors which they also wrote about [2].

People get weirdly superstitious when it comes to constrained decoding as though t somehow "limiting the model" when it's just a simple as applying a conditional probably distribution to the logits. I also suspect this post is largely to justify the fact that BAML parses the results (since the post is written by them).

0. https://blog.dottxt.ai/performance-gsm8k.html

1. https://blog.dottxt.ai/oss-v-gpt4.html

2. https://blog.dottxt.ai/say-what-you-mean.html

To be fair, there is "real harm" from constraining LLM outputs related to, for example, forcing lipograms or the letter "E" and a model responding with misspellings of words (deleted E) rather than words that don't actually have the letter "E" at all. This is why some authors propose special decoders to fix that diversity problem. See this paper and most of what it cites around it for examples of this: https://arxiv.org/abs/2410.01103

This is independent from a "quality" or "reasoning" problem which simply does not exist/happen when using structured generation.

Edit (to respond):

I am claiming that there is no harm to reasoning, not claiming that CoT reasoning before structured generation isn't happening.

> "reasoning" problem which simply does not exist/happen when using structured generation

The first article demonstrates exactly how to implement structured generation with CoT. Do you mean “reasoning” other than traditional CoT (like DeepSeek)? I’ll have to look for an reference but I recall the Outlines team also handling this latter case.