Assuming that that is indeed what most of the benchmark does: If the LLMs are as bad as it as the numbers suggest, then it seems like a perfectly good benchmark. I would definitely want them to be able to do stuff like that when I let them write my code.