> You are a senior SWE-Bench reviewer, make no mistakes.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
> You are a senior SWE-Bench reviewer, make no mistakes.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
More importantly, I suspect this actually hinders the work. If the LLM does make a mistake, it's now incentivized to downplay it instead of acknowledging and correcting.
This approach is effectively seeding the context with how you want the LLM to behave/operate ("senior reviewer", i.e. the style of the responses you want) and the context/domain in which the LLM is operating in ("SWE-Bench").
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
This is not actually what the reviewer prompt says, or perhaps it is, I don't know since they don't make it public. I'm just pointing out how it seems like a bad idea to ask a LLM to make a subjective judgement on things like "taste". If the SOTA LLM witting the code could not produce tasteful code then why would a different LLM be able to judge the "taste" of that code?
Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.
> Is it giving an unfair advantage to Model X if we use Model X as the judge?
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3
The “make no mistakes” admonition does seem pretty silly (it’s been skewered to death on yt), but… it’s easy to imagine how it might work. E.g. it could be interpreted as simply as “check your work”.
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
Conversations in its training data that explicitly mentioned "make no mistakes" don't strike me as particularly rich sources of high-quality reasoning signals. They strike me as conversations with Pointy-haired Bosses.
I'm not sure if they've fixed this, but older models have a tendency to ignore negation as `no`, `not`, etc. all occur frequently in the training data so are weighted less strongly than the verbs and nouns.
The advice I've heard is to emphasize the traits you want, not discourage the traits you don't. So rather than saying "make no mistakes" you can do something like you suggested with writing it as "check your work" or "ensure you answer correctly and concisely".