If it's something like:
- v4.5: 1x cost, 100% quality, 100% speed but maybe sometimes 80% speed because of load - v4.6: 3x cost, 105% quality, 80% speed most of the time depends - v4.7: 9x cost, 115% quality, 90% speed most of the time
Then people will either stick with v4.5 for everything it can do and, if knowledgeable, use v4.7+ for critical or specific tasks.
But if we add the option of:
LocalLLM: one time hardware + electricity cost, good enough quality for 90% of work, good enough speed for 90% of work, no vendor lock in/sudden cost spikes...
Then there is an edge to running it yourself unless you can burn investor cash to get to the next level.
I think the recent headlines on org token spend plus my own experience just today (June 1) with the new Copilot Pro limits is going to push those with the compute to run locally.
As of about 1pm today I did something to hit 47% of my entire June premium requests (copilot Pro, not converted).
As of 2pm I'm using Gemma 4 E4B on a 12gb GPU (with large context window) off my desktop to power VS Code with Copilot on my laptop. I'm going to build an AMD Strix Halo system next week when parts arrive so I can queue up a few models in parallel or work with something I need that much RAM for.
I'm not lifting the earth with my LLM setup. Gemma 4 E4B is solid for accelerating my current projects. and it's costing me pennies more per hour vs blowing half my Copilot Pro plan in a distracted morning.
I'm at a vendor conference this weekend that is showing off their Agent/Agentic workflows. Nobody can tell me how they balance the cost long term. Hopefully whoever the vendor is paying for their cloud LLM token usage doesn't spike cost in a year (or the vendor themselves) after companies convert and are trapped VMware style with these agent processes. You can bring your own (cloud) model subscription. I need to find out if we can point it back to our own local LLM endpoint and try local models for the same processes. Even if it takes 5x longer, it could be cheaper and more secure.