Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
I don't have the eval results live yet, so I cannot share them yet.
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
Here is the scenario list (prompts are much more detailed): [0]: https://grandpacad.comVery cool project. Thanks for sharing!