> where accuracy is paramount

> accuracy: 60%

Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…

I'm the person who ran the test.

To explain the 60% a bit more...

With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.

For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.

Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...

As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.

Based on these limited tests, here's the leaderboards on formats FWIW:

    CSV: 84.25%
    Markdown Table: 82.65%
    YAML: 81.85%
    JSON Lines (jsonl): 79.85%
    Markdown key-value: 79.83%
    Pipe-delimited: 79.45%
    Natural language summary: 78.65%
    JSON: 77.73%
    HTML table: 75.80%
    XML: 73.80%
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1

And if you have no control over model, then use CSV or Markdown Table.

Thank you for including the tokens needed for each test.

It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.

Wouldn't it be more useful to measure the number of rows the model can process while still hitting 100% accuracy?

> As you increase the size of the input data, the accuracy gradually decreases.

Interesting.

On your section "Limitations and Areas for Further Study",

What I'd be curious on future work would be,

    - changing the order of the data on each table type
    - changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.

Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?

Good idea

LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.

Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.

    Where do you eat?
    A) floor
    B) table
    C) dirt

In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.

Isn't the best performing (markdown tables) and the worst (pipe delimited tables) basically the same format?

The best performing isn't markdown tables, it's markdown key/value pairs:

  ## Record 1
  
  ```
  id: 1
  name: Charlie A0
  age: 56
  city: New York
  department: Operations
  salary: 67896
  years_experience: 7
  project_count: 1
  ```
Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.

Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.

I was surprised that XML (56%), with closing tags, wasn’t as good as YAMl/KV(60%), though line breaks perform the same kind of grouping function.

Then I realized from the table that XML used about 50% more tokens (~75K vs ~50K) for similar accuracy, and for the first time felt a kind of sympathy for the LLM…

Yeah that was my intuition as well. I think the KV-Markdown format gains additional advantage over JSON and YAML in the special syntax for headers helping to break up records.

they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.

I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.

Or just regular gpt-4.1, it's a quite capable model.

trust me bro, the next model bro, it's just way better bro

To be fair nano was an absolute crap model when it came out