How to reliably HTML to MD for any page on the internet? I remember struggling with this in the past
How hard would it be to build an MCP that's basically a proxy for web search except it always tries to build the markdown version of the web pages instead of passing HTML?
Basically Sosumi.ai but instead of working on only for Apple docs it works for any web page (including every doc on the internet)
I think HTML -> Markdown is a bit of a red herring.
In many cases, a Markdown distillation of HTML can improve the signal-to-noise ratio — especially for sites mired in <div> tag soup (intentionally or not). But that's an optimization for token efficiency; LLMs can usually figure things out.
The motivation behind Sosumi is better understood as a matter of accessibility. The way AI assistants typically fetch content from the web precluded them from getting any useful information from developer.apple.com.
You could start to solve the generalized problem with an MCP that 1) used a headless browser to access content for sites that require JS, and 2) used sampling (i.e. having a tool use the host LLM) to summarize / distill HTML.
https://pure.md is exactly what you're looking for.
But stripping complex formats like html & pdf down to simple markdown is a hard problem. It's nearly impossible to infer what the rendered page looks like by looking at the raw html / pdf code. https://github.com/mozilla/readability helps but it often breaks down over unconventional div structures. I heard the state of the art solution is using multimodal LLM OCR to really look at the rendered page and rewrite the thing in markdown.
Which makes me wonder: how did OpenAI make their model read pdf, docx and images at all?
There are APIs such as Jina AI's reader API that do this pretty well. It doesn't produce output as clean as Sosumi for Apple docs, but it's free and does a decent job.
https://jina.ai/reader
https://github.com/vivekVells/mcp-pandoc