For anyone not familiar this is SWE https://huggingface.co/datasets/princeton-nlp/SWE-bench

One of the examples in the dataset they took from

https://github.com/pvlib/pvlib-python/issues/1028

What the AI is expected to do

https://github.com/pvlib/pvlib-python/pull/1181/commits/89d2...

Make your own mind about the test.

My favorite was always the HumanEval dataset.

  Problem: 
    1) we want to train on GitHub repos
    2) most datasets are spoiled. Training on GitHub would definitely spoil

  Solution:
    Hand write new problems!!!
    ... leetcode style ....
    ... and we'll check if it passes test

  Example:
    What's the decimal part of this float?
Surely in all of GitHub such code doesn't exist!

Sure in all of GitHub we can filter such code out by ngram!

Maybe my favorite part is that it has 60 authors and became the de facto benchmark for awhile