Hacker News

Interesting. Curious to reproduce across models, I made a comprehensive eval based on your post and ran it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...

As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.

Based on these limited tests, here's the leaderboards on formats FWIW:

    CSV: 84.25%
    Markdown Table: 82.65%
    YAML: 81.85%
    JSON Lines (jsonl): 79.85%
    Markdown key-value: 79.83%
    Pipe-delimited: 79.45%
    Natural language summary: 78.65%
    JSON: 77.73%
    HTML table: 75.80%
    XML: 73.80%

IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.

The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.