My favorite was always the HumanEval dataset.
Problem:
1) we want to train on GitHub repos
2) most datasets are spoiled. Training on GitHub would definitely spoil
Solution:
Hand write new problems!!!
... leetcode style ....
... and we'll check if it passes test
Example:
What's the decimal part of this float?
Surely in all of GitHub such code doesn't exist!Sure in all of GitHub we can filter such code out by ngram!
Maybe my favorite part is that it has 60 authors and became the de facto benchmark for awhile