The first part I implemented was the basic parser -> SVG renderer (restricted to the simplest TikZ constructs) and then put in a basic drag-and-drop interface to validate whether the architecture was promising. Code structure was decided pretty much entirely by Codex -- it asks my opinion with multiple choice questions during plan mode, which I like. I tend to alternate between feature expansion and code quality passes (e.g. making sure no files are too big, folder structure makes sense, test coverage is good, etc).

Indeed I have scripts for compiling a given tikz figure using latex (in particular dvisvgm so I get an SVG instead of a PDF) as well as my js-based renderer. I apply that script to various corpuses, mostly particular pages from the tikz manual (see https://tikz.dev), but there are also a few books about TikZ that have downloadable zips of all the examples they use. I then inspect the correspondence between the two renderers by eye and give Codex a list of which figures are wrong and why, and it then goes and fixes the underlying issues.

You'd think that finding discrepancies between the renderers could be done automatically, but it hasn't worked well in my experience. The models are multimodal but still kinda blind; they think two pictures are the same even if they are very much not the same. But once you tell them whats wrong, they're then pretty good at iterating until it is fixed. (One could also try to do a pixel diff of rasterized images, but that's super noisy, and text rendering isn't going to be pixel perfect anyway.)