More than you can physically fit in a phone like that. Many hundreds if not thousands of watts of GPU.

That's not true. You could run such an LLM on a lower end laptop GPU, or a phone GPU. Very low power and low space. This isn't 2023 anymore, a Santa-specific LLM would not be so intensive.

I've done a fair amount of fine-tuning for conversational voice use cases. Smaller models can do a really good job on a few things: routing to bigger models, constrained scenarios (think ordering food items from a specific and known menu), and focused tool use.

But medium-sized and small models never hit that sweet spot between open-ended conversation and reasonably on-the-rails responsiveness to what the user has just said. We don't know yet know how to build models <100B parameters that do that, yet. Seems pretty clear that we'll get there, given the pace of improvement. But we're not there yet.

Now maybe you could argue that a kid is going to be happy with a model that you train to be relatively limited and predictable. And given that kids will talk for hours to a stuffie that doesn't talk back at all, on some level this is a fair point! But you can also argue the other side: kids are the very best open-ended conversationalists in the world. They'll take a conversation anywhere! So giving them an 8B parameter, 4-bit quantized Santa would be a shame.

But on that compute budget it’s gonna sound so stupid. Oh right. Santa.

It's a children's toy, how nuanced does its responses need to be?

I agree. It just took me a while to figure it out. A 3B param LLM would do perfectly well.

I run LLMs and TTS capable of this on my laptop since last year