I've been strapping different LLM based setups to Lean 4 with a variety of different prompting methods. My biggest conclusion here is that LLMs are worse at formalizing than humans are. Additionally, for Lean 4 specifically, I don't think there's enough training data.
I'm of the opinion that formalization is the biggest bottleneck of current generation LLMs. However, I don't think that this necessarily suggests that LLMs don't benefit from formal methods. Given existing abstractions, Lean4's exceptional tooling allows for more efficient iteration with LLMs and requires less human supervision since Lean's language server provides specific and actionable feedback whenever the LLM makes a mistake. I've also noticed that LLMs can reason about code written in Lean4 far more effectively than in Python, despite Python having orders of magnitude more training data than Lean.
Nonetheless, I concur that LLMs don't yet know how to translate a request stated in a prompt to a complete Lean4 interpretation. My practice so far has usually required me to first choose an existing reference file that is similar to my desired goals, and use this reference as "inspiration" for how the LLM should go about formalization.
Yeah, we really need LLMs to work swimmingly with Lean 4. It is currently hot garbage and it does not understand proof composition, exploring proof extensions, lemma search, etc. It does not explore an open-ended node to a mathematical knowledge graph by substituting various options.
I'd happily work with someone on a conversational theorem prover, if anyone's up for it.
Join the Lean Zulip. There are many people interested in this.
https://leanprover.zulipchat.com/