I don't have the eval results live yet, so I cannot share them yet.
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
<0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
<0.4 → Weak – Partially relevant; significant omissions or errors.
<0.6 → Fair – Covers main points but lacks completeness or precision.
<0.8 → Good – Mostly accurate; minor gaps or deviations.
<=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
Here is the scenario list (prompts are much more detailed): dragon-bottle-stopper
editing-param-mid-conv
editing-parametric-enclosure
editing-swap-material-param
editing-text-edit-cube
multi-turn-bird-house
multi-turn-dice-tower
multi-turn-modular-planter
multi-turn-phone-stand
multi-turn-shelf
one-shot-bookend
one-shot-cable-clip
one-shot-chess-queen
one-shot-coaster
one-shot-coffee-cup
one-shot-dog-tag
one-shot-dragon-figurine
one-shot-hex-bracket
one-shot-keychain-fob
one-shot-low-poly-tree
one-shot-pegboard-hook
one-shot-pi4-case
one-shot-threaded-jar
[0]: https://grandpacad.com
Very cool project. Thanks for sharing!