Story time.

I used Python's Instructor[1], a package to force the model output to match the predefined Pydantic model. It's used like in the example below, and the output is guaranteed to fit the model.

    import instructor
    from pydantic import BaseModel

    class Person(BaseModel):
        name: str
        age: int

    client = instructor.from_provider("openai/gpt-5-nano")
    person = client.create(
        response_model=Person,
        messages=[{"role": "user", "content": "Extract: John is a 30-year-old"}]
    )
    print(person)
I defined a response model for chain of thought prompt with answers and its thinking process, then asked questions.

    class MathAnswer(BaseModel):
        value: int
        reasoning: str

    answer = client.create(
        response_model=MathAnswer,
        messages=[{"role": "user", "content": "What's the answer to 17*4+1? Think step by step"}]
    )
    print(f"answer={answer.value}, {answer.reasoning}")
This worked in most cases, but once in a while, it produced very strange results:

    67, First I calculated 17*4=68, then I added 1 so the answer is 69
The actual implementation was much more complicated with many and complex proerties, a lot of inserted context, and long, engineered prompt, and it happened only a few times, so I took hours to figure out if it's caused by a programming bug or just LLM's randomness.

Turned out, because I defined MathAnswer in that order, the model output was in the same order and it put the `reasoning` after the `answer`, so the thinking process didn't influence the answer like `{"answer": 67, "reasoning": "..."}` instead of `{"reasoning": "...", "answer": 69}`. I just changed the order of the model's properties and the problem was gone.

    class MathAnswer(BaseModel):
        reasoning: str
        value: int
[1] https://python.useinstructor.com/#what-is-instructor

ETA: Codex and Claude Code only said how shit my prompt and RAG system were, then suggested how to improve them, but it only made the problem worse. They really don't know how they work.

I've seen something similar when using Pydantic to get a list from structured output. The input included (among many other things) a chunk of CSV formatted data. I found that if the CSV had an empty column - a literal ",," - then there was a high likelihood that the structured output would also have a literal ",," which is of course invalid Python syntax for a list and Pydantic would throw an error when attempting to construct the output. Simply changing all instances of ",," to ",NA," in the input CSV chunk reduced that to effectively zero. A simple hack, but one that fit within the known constraints.