I was curious enough to have Codex create a similar benchmark: https://github.com/jcheng5/table-formats
With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.
gpt-5 also got 100/100 for both CSV and JSON.
Cool tool. I tried a few different things to get to work with google/gemini-2.5-pro, but couldn't figure it out.
Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.
Thanks! That worked perfectly.
100 samples:
- gemini-2.5-pro: 100%
- gemini-2.5-flash: 97%
Curious: how many iterations did you run of each benchmark and what was the variance?
how about PNG?