I'm also curious if you know if anyone has any definitive test sets on this? Kind of like how Simon Willison uses the bird on the bicycle?

Good question - we're working on case studies for this.

My theory: models are heavily trained on HTML/XML and many use XML tags in their own system prompts, so they're naturally fluent in that syntax. Makes nested structures more reliable in our testing.

Structured output endpoints help JSON a lot though.