I've seen the Microsoft Aurora team make a compelling argument that weather is an interesting contradiction of the AI-energy-waste narrative. Once deployed at scale, inference with these models is actually a sizable energy/compute improvement over classical simulation and forecasting methods. Of course it is energy intensive to train the model, but the usage itself is more energy efficient.
There's also the efficiency argument from new capability: even a tiny bit better weather forecast is highly economically valuable (and saves a lot of wasted energy) if it means that 1 city doesn't have to evacuate because of an erroneous hurricane forecast, say. But how much would it cost to do that with the rivals? I don't know but I would guess quite a lot.
And one of the biggest ironies of AI scaling is that where scaling succeeds the most in improving efficiency, we realize it the least, because we don't even think of it as an option. An example: a Transformer (or RNN) is not the only way to predict text. We have scaling laws for n-grams and text perplexity (most famously, from Jeff Dean et al at Google back in the 2000s), so you can actually ask the question, 'how much would I have to scale up n-grams to achieve the necessary perplexity for a useful code writer competitive with Claude Code, say?' This is a perfectly reasonable, well-defined question, as high-order n-grams could in theory write code without enough data and big enough lookup tables, and so it can be answered. The answer will look something like 'if we turned the whole earth into computronium, it still wouldn't be remotely enough'. The efficiency ratio is not 10:1 or 100:1 but closer to ∞:1. The efficiency gain is so big no one even thinks of it as an efficiency gain, because you just couldn't do it before using AI! You would have humans do it, or not do it at all.
> even a tiny bit better weather forecast is highly economically valuable (and saves a lot of wasted energy) if it means that 1 city doesn't have to evacuate because of an erroneous hurricane forecast
Here is the NOAA on the improvements:
> 8% better predictions for track, and 10% better predictions for intensity, especially at longer forecast lead times — with overall improvements of four to five days.(1)
I’d love someone to explain what these measurements mean though. Does better track mean 8% narrower angle? Something else? Compared to what baseline?
And am I reading this right that that improvement is measured at the point 4-5 days out from landfall? What’s the typical lead time for calling an evacuation, more or less than four days?
(1)https://www.noaa.gov/news/new-noaa-system-ushers-in-next-gen...
To have a competitive code writer with ngrams you need more than to "scale up the ngrams" you need to have a corpus that includes all possible codes that someone would want to write. And at that point you'd be better off with a lossless full text index like an r-index. But, the lack of any generalizability in this approach, coupled with its markovian features, will make this kind of model extremely brittle. Although, it would be efficient. You just need to somehow compute all possible language before hand. tldr; language models really are reasoning and generalizing over the domain they're trained on.
Now that we’ve saved infinite energy all carbon tax credit markets are unnecessary! Big win for the climate! pollutes
Obviously much simpler Neural Nets, but we did have some models in my domain whose role was to speed up design evaluation.
Eg you want to find a really good design. Designs are fairly easy to generate, but expensive to evaluate and score. Understand we can quickly generate millions of designs but evaluating one can take 100ms-1s. With simulations that are not easy to GPU parallelize. We ended up training models that try to predict said score. They don’t predict things perfectly, but you can be 99% sure that the actual score designs is within a certain distance of said score.
So if normally you want to get the 10 best design out of your 1 million, we can now first have the model predict the best 1000 and you can be reasonably certain your top 10 is a subset of these 1000. So you only need to run your simulation on these 1000.
Heuristical branch-and-bound
It's definitely interesting that some neural nets can reduce compute requirements, but that's certainly not making a dent on the LLM part of the pie.
Sam Altman has made a lot of grandiose claims about how much power he's going to need to scale LLMs, but the evidence seems to suggest the amount of power required to train and operate LLMs is a lot more modest than he would have you believe. (DeepSeek reportedly being trained for just $5M, for example.)
I saw a claim that DeepSeek had piggybacked off of some aspect of training that ChatGPT had done, and so that cost needed to be included when evaluating DeepSeek.
This training part of LLMs is still mostly Greek to me, so if anyone could explain that claim as true or false and the reasons why, I’d appreciate it
I think the claim that DeepSeek was trained for $5M is a little questionable. But OpenAI is trying to raise $100B which is 20,000 times as much as $5M. Though even at $1B I think it's probably not that big a deal for Google or OpenAI. My feeling is they can profit on the prices they are charging for their LLM APIs, and that the dominant compute cost is inference, not training. Though obviously that's only true if you're selling billions of dollars worth of API calls like Google and OpenAI.
OpenAI has had $20B in revenue this year, and it seems likely to me they have spent considerably less than that on compute for training GPT5. Probably not $5M, but quite possibly under $1B.
So LLMs predict the next token. Basically, you train them by taking your training data that's N words long and, for X = 1 to N, and optimizing it to predict token X using tokens 1 to X-1.
There's no reason you couldn't generate training data for a model by getting output from another model. You could even get the probability distribution of output tokens from the source model and train the target model to repeat that probability distribution, instead of a single word. That'd be faster, because instead of it learning to say "Hello!" and "Hi!" from two different examples, one where it says hello and one where it says hi, you'd learn to say both from one example that has a probability distribution of 50% for each output.
Sometimes DeepSeek said it's name is ChatGPT. This could be because they used Q&A pairs from ChatGPT for training or because they scraped conversations other people posted where they were talking to ChatGPT. Or for unknown reasons where the model just decided to respond that way, like mixing up some semantics of wanting to say "I'm an AI" and all the scraped data referring to AI as ChatGPT.
Short of admission or leaks of DeepSeek training data it's hard to tell. Conversely, DeepSeek really went hard into an architecture that is cheap to train, using a lot of weird techniques to optimize their training process for their hardware.
Personally, I think they did. Research shows that a model can be greatly improved with a relatively-small set of high quality Q&A pairs. But I'm not sure the cost evaluation should be influenced that much, because the ChatGPT training price was only paid once, it doesn't have to be repaid for every new model that cribs its answers.
And an LLM can be more energy efficient than a human -- and that's precisely when you should use it.
If its more energy efficient it is doing something different there is no guarantee that its more accurate long term. Weather is horrible difficult to predict and we are only just alright at it. If LLM are guessing at the same rate we are calculating but I am doubtful
Well that was a failed response opps. I am just cautious because while transformers get the random guessing right you can get the right answer statistically but fail on accuracy improvement long term. Clearly this model does better than the current model but extending it to be even better seems basically intractable besides throw more data at it but what if it derived the wrong model you simply cannot actually know
That's precisely when, (insert hand wavy motion), we should use any of this.
This jumped out at me as well - very interesting that it actually reduces necessary compute in this instance
The press statement is full of stuff like this:
"Area for future improvement: developers continue to improve the ensemble’s ability to create a range of forecast outcomes."
Someone else noted the models are fairly simple.
My question is "what happens if you scale up to attain the same levels of accuracy throughout? Will it still be as efficient?"
My reading is that these models work well in other regions but I reserve a certain skepticism because I think it's healthy in science, and also because I think those ultimately in charge have yet to prove reliable judges of anything scientific.
> My question is "what happens if you scale up to attain the same levels of accuracy throughout? Will it still be as efficient?"
I've done some work in this area, and the answer is probably 'more efficient, but not quite as spectacularly efficient.'
In a crude, back-of-the-envelope sense, AI-NWP models run about three orders of magnitude faster than notionally equivalent physics based NWP models. Those three orders of magnitude divide approximately evenly between three factors:
1. AI-NWP models produce much sparser outputs compared to physics-based models. That means fewer variables and levels, but also coarser timesteps. If a model needs to run 10x as often to produce an output every 30m rather than every 6h, that's an order of magnitude right there.
2. AI-NWP models are "GPU native," while physics-based models emphatically aren't. Hypothetically running physics-based models on GPUs would gain most of an order of magnitude back.
3. AI-NWP models have fantastic levels of numerical intensity compared to physics-based NWP models since the former are "matrix-matrix multiplications all the way down." Traditional NWP models perform relatively little work per grid point in comparison, which puts them on the wrong (badly memory-bandwidth limited) side of the roofline plots.
I'd expect a full-throated AI-NWP model to give up most of the gains from #1 (to have dense outputs), and dedicated work on physics-based NWP might close the gap on #2. However, that last point seems much more durable to me.
"it's more efficient if you ignore the part where it's not"
> "it's more efficient if you ignore the part where it's not"
Even when you include training, the payoff period is not that long. Operational NWP is enormously expensive because high-resolution models run under soft real-time deadlines; having today's forecast tomorrow won't do you any good.
The bigger problem is that traditional models have decades of legacy behind them, and getting them to work on GPUs is nontrivial. That means that in a real way, AI model training and inference comes at the expense of traditional-NWP systems, and weather centres globally are having to strike new balances without a lot of certainty.
It's more efficient anyway because the inference is what everyone will use for forecasting. Researchers will be using huge amounts of compute to develop better models, but that's also currently the case, and it isn't the majority of weather simulation use.
There's an interesting parallel to Formula One, where there are limits on the computational resources teams can use to design their cars, and where they can use an aerodynamic model that was previously trained to get pretty good outcomes with less compute use in the actual design phase.
I suggest reading up on fixed costs vs variable costs and why it is generally preferable to push costs to fixed.
Assuming you’re not throwing the whole thing out after one forecast, it is probably better to reduce runtime energy usage even if it means using more for one-time training.
I mean that’s cute, but surely you can add up the two parts (single training plus globally distributed inference) and understand that the net efficiency would be an improvement?
[dead]