I've wondered if it's because structured outputs rely on visual cues to impart meaning, and turning them into token streams looses that spatial structure.
I've wondered if it's because structured outputs rely on visual cues to impart meaning, and turning them into token streams looses that spatial structure.