How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

A mix of evals and vibes.

What's that ratio exactly

[dead]

[flagged]

Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.

"Evals and vibes" can I put that on a t shirt?

Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow