I'm not at all surprised that LLMs are good at interactive theorem proving.

But I am surprised that Claude Code is good at it. Maybe I'm just not using it right but when I have a task in a less popular programming language, Gemini 2.5 pro tends to be much better than Claude Opus despite Opus doing better on benchmarks. Better in the sense that Gemini will typically bang out a correct solution with best practices immediately and Opus might take half a day of correcting basic errors and misunderstandings.

If anyone knows what I'm talking about and knows a way to improve this please let me know. Claude still feels to me like an over-eager engineer that is more eager to get thing done fast than to do them correctly. It may be that with a task like theorem proving you can just burn credits and time because there's a clear answer.

I use the Gemini mcp and my CLAUDE.md has instructions to consult Gemini for strategy. Seems to work well. I have the lowest Claude plan so I don’t know how this would work vs Opus, for example.

Separately, I have been meaning to implement a cheating detector — have run into Claude modifying problem statements, adding axioms, etc.

Okay thanks I'll try that.

> have run into Claude modifying problem statements, adding axioms, etc.

Same here. I've thought about creating a utility that tells Claude it has to keep going until a test exits with nonzero status. But I'm concerned Claude would just fake everything to make the test pass.

Do you need an MCP for that? Why not tell Claude to call the Gemini CLI?

You could do that -- my understanding is that MCPs give Claude less to "think about" than having to use another program correctly, and therefore avoid context clutter. I could be wrong.