The two-tier loading is exactly that — ~128 common classes always visible as one-liners, full docs loaded on demand. The agent does sometimes pull the wrong class, but since each task runs in a forked context, a bad lookup doesn't cascade beyond that task. Gemini Flash for QA was a capability choice — it's able to catch spatial issues like z-fighting and clipping, which is not obvious since vision models aren't typically trained on broken game screenshots. Turned out to work surprisingly well.