Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
I don't have the eval results live yet, so I cannot share them yet.
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
Here is the scenario list (prompts are much more detailed): [0]: https://grandpacad.comVery cool project. Thanks for sharing!
Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?
Absolutely. Running it now, will update this comment in about 30 mins.
Edit: Surprisingly very good results with 3.0 flash with high thinking.
Cost: $0.06
Duration: 3.22 min
Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)
Adherence was on par with 3.5 flash Low thinking
Thanks! I’ve still been using 3.0 a lot, the price-to-performance ratio absolutely kills compared to Google’s other and newer offerings.