I do not trust the LLM to do it correctly. We do not have the same experience with them, and should not assume everyone does. To me, your question makes no sense to ask.

We should be able to measure this. I think verifying things is something an llm can do better than a human.

You and I disagree on this specific point.

Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.