Why would you not have the bot write in a formal language (e.g. Lean) and then just typecheck it? Then you only need to decide that your definitions were interesting. LLMs are now extremely good at programming given a compiler/typechecker, so I'd expect them to be good at formal math as well. It's nearly (if not precisely) the exact same activity.

That’s what they do usually I understand - llm generates proof in lean, and proof checker proves.