So, uh, where do the HTML versions of the papers come from?

Ground truth.

What do you mean by that? That researchers should be authoring their papers in HTML?