The lazy-loading approach for the API reference is clever. Did you consider embedding a smaller, distilled subset of the most-used classes permanently in context, and only lazy-loading the long tail? Curious whether the overhead of deciding what to load ever causes cascading errors where the wrong API gets pulled in. Also interested in the visual QA separation - using a model that sees only screenshots and never the code is a genuinely good idea for avoiding confirmation bias. Did you try using Claude for that role too, or was Gemini Flash specifically chosen for cost reasons?
The two-tier loading is exactly that — ~128 common classes always visible as one-liners, full docs loaded on demand. The agent does sometimes pull the wrong class, but since each task runs in a forked context, a bad lookup doesn't cascade beyond that task. Gemini Flash for QA was a capability choice — it's able to catch spatial issues like z-fighting and clipping, which is not obvious since vision models aren't typically trained on broken game screenshots. Turned out to work surprisingly well.