The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.
And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!
Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?
So are rules for the game of go or chess ? Specifying code that satisfies (or doesn't satisfy) is a problem statement - evaluation is automatic.
> but they don't directly imply the correctness of a test.
I'd be willing to bet that if you start with an existing coding model and continue training it with coverage/lint metrics and evaluation as feedback you'd get better at generating tests. Would be slow and figuring out how to build a problem dataset from existing codebases would be the hard part.
The rules are well defined and you can easily write a program that will tell whether a move is valid or not, or whether a game has been won or not. This allows you generate virtually infinite amount of data to train the model on without human intervention.
> Specifying code that satisfies (or doesn't satisfy) is a problem statement
This would be true if you fix one specific program (just like in Go or Chess you fix the specific rules of the game and then train a model on those) and want to know whether that specific program satisfies some given specification (which will be the input of your model). But if instead you want the model to work with any program then that will have to become part of the input too and you'll have to train it an a number of programs which will have to be provided somehow.
> and figuring out how to build a problem dataset from existing codebases would be the hard part
This is the "Human Feedback" part that the tweet author talks about and the one that will always be flawed.
The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.
And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!
Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?
> The input data is still human produced
So are rules for the game of go or chess ? Specifying code that satisfies (or doesn't satisfy) is a problem statement - evaluation is automatic.
> but they don't directly imply the correctness of a test.
I'd be willing to bet that if you start with an existing coding model and continue training it with coverage/lint metrics and evaluation as feedback you'd get better at generating tests. Would be slow and figuring out how to build a problem dataset from existing codebases would be the hard part.
> So are rules for the game of go or chess ?
The rules are well defined and you can easily write a program that will tell whether a move is valid or not, or whether a game has been won or not. This allows you generate virtually infinite amount of data to train the model on without human intervention.
> Specifying code that satisfies (or doesn't satisfy) is a problem statement
This would be true if you fix one specific program (just like in Go or Chess you fix the specific rules of the game and then train a model on those) and want to know whether that specific program satisfies some given specification (which will be the input of your model). But if instead you want the model to work with any program then that will have to become part of the input too and you'll have to train it an a number of programs which will have to be provided somehow.
> and figuring out how to build a problem dataset from existing codebases would be the hard part
This is the "Human Feedback" part that the tweet author talks about and the one that will always be flawed.