At work, paid gitlab duo (which is supposed to be a blend of various top models) gets more complex codebase hilariously wrong every time. Maybe our codebase is obscure for it (but it shouldn't be, standard java stuff with usual open source libs) but it just can't actually add value for anything but small snippets here and there.

For me litmus paper for any llm is flawless creation of complex regexes from a well formed prompt. I don't mean trivial stuff like email validation but rather expressions on limits of regex specs. Not almost-there, rather just-there.