For anyone not familiar this is SWE https://huggingface.co/datasets/princeton-nlp/SWE-bench
One of the examples in the dataset they took from
https://github.com/pvlib/pvlib-python/issues/1028
What the AI is expected to do
https://github.com/pvlib/pvlib-python/pull/1181/commits/89d2...
Make your own mind about the test.
My favorite was always the HumanEval dataset.
Surely in all of GitHub such code doesn't exist!Sure in all of GitHub we can filter such code out by ngram!
Maybe my favorite part is that it has 60 authors and became the de facto benchmark for awhile