I think it should be tested on goals.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).
I think it should be tested on goals.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).