This article says anthropic models can write out the entire benchmark solution set word for word from memory