Marginally on-topic: I'd love if the charts included prior models, including GPT 4 and 3.5.
Not all systems upgrade every few months. A major question is when we reach step-improvements in performance warranting a re-eval, redesign of prompts, etc.
There's a small bleeding edge, and a much larger number of followers.
[deleted]