I was curious enough to have Codex create a similar benchmark: https://github.com/jcheng5/table-formats

With 1000 rows and 100 samples and markdown-kv, I got these scores:

- gpt-4.1-nano: 52%

- gpt-4.1-mini: 72%

- gpt-4.1: 93%

- gpt-5: 100%

I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.

To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:

    uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.

gpt-5 also got 100/100 for both CSV and JSON.

    uv run inspect eval evals/table_formats_eval.py@table_formats_csv --model openai/gpt-5 --limit 100
    uv run inspect eval evals/table_formats_eval.py@table_formats_json --model openai/gpt-5 --limit 100

Cool tool. I tried a few different things to get to work with google/gemini-2.5-pro, but couldn't figure it out.

    uv add google-genai
    uv run scripts/run_benchmarks.py --models google/gemini-2.5-pro --formats markdown_kv --limit 100
And add GOOGLE_API_KEY=<your-key-here> to a file called .env in the repo root.

Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.

Thanks! That worked perfectly.

100 samples:

- gemini-2.5-pro: 100%

- gemini-2.5-flash: 97%

Curious: how many iterations did you run of each benchmark and what was the variance?

how about PNG?