I meant this is actually straight-forward if you've been paying even the remotest of attention.

Chronologically:

GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini, o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1

Model iterations, by training paradigm:

SGD pretraining with RLHF: GPT-4 -> turbo -> 4o

SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)

reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro

Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)

By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:

Creative Writing: gpt-4.5 -> gpt-4o

Business Comms: o1-pro -> o1 -> o3-mini

Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) -> o1-mini-preview

Shooting the shit: gpt-4o -> o1

It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.