I'm running a local voice agent on a Mac Mini M4. Qwen ASR for STT and Qwen TTS on Apple Silicon via MLX, Claude for the LLM. No API costs besides the Claude subscription but the interesting part is the LLM is agentic because it's using Claude Code. It reads and writes files, spawns background agents, controls devices, all through voice.

The insights about VAD and streaming pipelines in this thread are exactly what I'm looking at for v2. Moving to a WebSocket streaming pipeline with proper voice activity detection would close the latency gap significantly, even with local models.