I'm doing this now with Home Assistant voice. All the TTS, STT, and LLMs involved run locally on my network. It's absurdly superior to every other voice assistant product. (Would be nice if it was just a pure multi-modal model though)