Hacker News

Common approach I've seen is having workflows with paid/larger/hosted models for some workflow where you don't quite know exactly how it'll be when you first put it together, then with time you've locked down how things more or less work yet you still need free-form text parsing of some kind, so you end up replacing the bigger models with carefully post-trained small models.

Besides that, there is a ton of use cases for smaller models for a bunch of different things. We'll be unlikely to be able to run LLMs (actually Large) on smartphones for a while, while the smaller LLMs seem to run already on-device in experiments.