TL;DR: They are estimates from giving an LLM (generally o3 mini high due to cost, some o1 preview) a large corpus of grounding data to reason over and asking it to use its general world knowledge to return estimates it was confident in, which, when escalating to better LLMs like o1-pro and manual verification, proved to be good enough that I thought they warranted release.

You can read about the background on how I did them in more detail in the about/methodology section: https://www.opennutrition.app/about (see "Technical Approach")

You need to add a disclaimer for this data. People could rely on them being accurate, and you simply can't prove they are.

There is a large disclaimer that states, among other things, "We strive to ensure accuracy and quality using authoritative sources and AI-based validation; however, we make no guarantees regarding completeness, accuracy, or timeliness. Always confirm nutritional data independently when accuracy is critical." on every page on the website where that kind of in-depth data is available.

At that point, if you are not sure a data point is accurate, should you really display it ? You have no proof appart from "The LLM said it was ok" which is kind of poor.

I disagree with the idea that data must be accompanied by a guarantee of accuracy to be used or published. That standard would rule out almost all datasets for which the underlying data is not programmatically generated.

My guess is that this dataset is probably more accurate on the whole than many datasets used by the kinds of calorie-tracking apps that outsource their collection of nutrition information to users. But an analysis would be required.

Regardless, the only workable approach is to describe the provenance of your data and explain what steps have been taken to ensure accuracy. Then anyone who wants to use the data can account for that information.

[deleted]