What you’re describing is that we’d turn deterministic engineering into the same march of 9s that FSD and robotics are going through now - but for every single workflow. If you can’t check the code for correctness, and debug it, then your test system must be absolutely perfect and cover every possible outcome. Since that’s not possible for nontrivial software, you’re starting a march of 9s towards 100% correctness of each solution.
That accounting software will need 100M unit tests before you can be certain it covers all your legal requirements. (Hyperbole but you get the idea) Who’s going to verify all those tests? Do you need a reference implementation to compare against?
Making LLM work opaque to inspection is kind of like pasting the outcome of a mathematical proof without any context (which is almost worthless AFAIK).