Do you write /maintain evals? This is something I want to get into more. Otherwise I feel really blind and feel compelled to just drop money on frontier.
Do you write /maintain evals? This is something I want to get into more. Otherwise I feel really blind and feel compelled to just drop money on frontier.
Not really. I have one I made for fun where I let LLMs control a text editor called Kakoune, and then give them no other way to do things, to see how they deal with it, but that's not really a scenario I expect them to do well at.
So far most of them have done very poorly on that one, because they are all overtrained on just executing shell commands.
A former colleague of mine and I made a simple test for some baseline "Everything worth using should be able to do this pretty easily and swiftly" but that's some very minor code generation with a very straight forward, boilerplate-type pattern.