In my experience 4o was already really good at this task. I'd be curious to see an in-depth 4o vs. o3 benchmark.