A one-way audio channel is indeed too weak for a person to distinguish a person from a recording, but a bidirectional audio channel is easily strong enough: the person can verbally ask the person-or-recording a question and see if it is acknowledged.

I claim that a modern frontier LLM can be given simple instructions that make it impossible for a person to reliably distinguish it from a person over a bidirectional text-only medium.