Hacker News

This is impressive. I've been experimenting with Gemini API for a side project and the latency difference between local and cloud inference is something I keep thinking about. How does memory usage scale with the 500B models?