Big fan of reader mode. For me, a direction better than llms.txt would be to encourage sites to improve their markup (think semantic web era) so agents could get the text version from that the way reader mode does. Would achieve the same thing - save tokens.
This isn't difficult and I think the reason it hasn't been done is that publishers want clicks and ad views. Which begs the question: why would they start doing it for agents?
Agents don't buy stuff they see in an ad
So why serve them at all?
If your website itself is advertising a product or service you sell you would still want LLMs to see and fetch it. If you are a news site, blog, or any other website that doesn’t exist to sell something, you are only harmed by ai agents.
In those situations you wouldn't have ads on the human version of the site either, surely?
Sure, if it’s paywalled. Web hosting isn’t free
modern agents already do this via content negotiation and will attempt to retrieve the markdown version of a given site
https://www.sanity.io/learn/course/markdown-routes-with-next...
But that isn't that different from requesting the llms.txt version. Why not just make it so the useful content you want the LLM to focus on is easily retrievable from the same HTML the user's browser gets?
The sanity.io page writes:
> serving agents a bunch of HTML might just bloat their context window.
That's only true if you assume the the agent can't extract the useful text before it goes into the model as tokens. Your browser's reader mode uses heuristics to identify what the actual content is in a large HTML response and strips away the rest.
To me this is a far better approach than worrying about an llms.txt files or looking at HTTP headers to see if markdown is preferred. Such efforts could easily be directed at ensuring the useful content on your site carries the appropriate markup for an agent or any other tool to extract it. And it would require less work to implement for the publisher of the content.