Hacker News

Wait, how does this work? If you load in one LLM of 40 GB, then to load in four more LLMs of 40 GB still takes up an extra 160 GB of memory right?

It will typically be the same 40 GB model loaded in, but called with many different inputs simultaneously