> I've had _every_ model fail this

That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).