Agree w/ you on the model's tendency to butcher things. Performance wise, this almost feels like the GPT-OSS model.
I need to incorporate "risk of major failure" into bluey bench. Spark is a dangerous model. It doesnt strongly internalize the consequences of the commands that it runs, even on xhigh. As a result I'm observing a high tendency to run destructive commands.
For instance, I asked it to assign random numbers to the filename of the videos in my folder to run the bm. It accidentally deleted the files on most of the runs. The funniest part about it is that it comes back to you within a few seconds and says something like "Whoops, I have to keep it real, I just deleted the files in your folder."
Ouch, at least it fesses up. I ran into problems with it first refusing to use git "because of system-level rules in the session". Then later it randomly amended a commit and force pushed it because it made a dumb mistake. I guess it was embarassed.