The pelican is really getting old as an a standalone evaluation metric. By now they are certainly going to be in training set if not explicitly tuned to produce it for the press on HN alone.
Keep the pelican but isn’t it time to add something else more novel that all current and past models struggle with?
One shot canvas and svg images or animations are also just something that at this scale shouldn't be an issue at all, even Qwen running locally on 24gb cards can do impressive ones.
Don't understand why this test gets any attention, I mean other than the pelicans which isn't a good test, theres no meat in this article.
Relevant: https://news.ycombinator.com/item?id=47839493
It also seems like all of the models have converged on very similar images.