This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/

I don’t care how practical it may or may not be, this is my new favorite LLM benchmark

I couldn't find an about page or similar?

Here's the public sample https://github.com/T3-Content/skatebench/blob/main/bench/tes...

I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.

o3-pro is better than 5.2 pro! And GPT 5 high is best. Really quite interesting.

  1. Take the top ten searches on Google Trends 
     (on day of new model release)
  2. Concatenate
  3. SHA-1 hash them
  4. Use this as a seed to perform random noun-verb 
     lookup in an agreed upon large sized dictionary. 
  5. Construct a sentence using an agreed upon stable 
     algorithm that generates reasonably coherent prompts
     from an immensely deep probability space.
That's the prompt. Every existing model is given that prompt and compared side-by-side.

You can generate a few such sentences for more samples.

Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.