When I was working for startups trying to develop foundation models circa 2015 we were concerned with training more than inference.

Today with models that are actually useful training costs matters much less than inference costs. A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs.

I still don't have a GPT3-class model that was trained without copyright infringement. I'd have so many uses for it from research to production. What's stopping me is the $30 million training cost for 180B models. Even a 30B like Mosaic cost over a million dollars.

So, I strongly disagree unless we're talking about the five or six companies that already spend tens of millions on training and keep repeating that. Outside of them, the medium to large models are done infrequently or one off by a small number of other companies. Then, most of us are stuck with their pretraining efforts because we can't afford it ourselves.

On my end, I'd rather see a model that drops pretraining costs to almost nothing but costs 10-32x more to do inference. My uses would produce mere MB of output vs hundreds of GB to TB that pretraining requires. A competitive use that costs 32x current prices would probably be profitable for me. Optimizations, which are plentiful for inference, might bring it down further.

I think you're right but there has to be a limit. If I'm training a model I'm going to do a significant amount of inference to evaluate it and support the training.

Why are you making something cheap more expensive than it needs to be?

It's not cheap. It costs millions to $100 million depending on the model. I was responding to this tradeoff:

"A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs."

Given millions and up, I'd like that to be 10x cheaper while inference was 10x more expensive. Then, it could do research or coding for me at $15/hr instead of $1.50/hr. I'd just use it carefully with batching.

Calculating the gradient requires a forward pass (inference) and a backward pass (back propagation).

They cost roughly the same, with the backwards pass being maybe 50% more expensive. So let's say three times the cost of a forward pass.

You can't make training faster by making inference slower.

I was responding to their claim by starting with an assumption that it may be correct. I don't know the cost data myself. Now, I'll assume what you say is true.

That leaves computation and memory use of two passes plus interlayer communication.

I think backpropagation doesn't occur in the brain since it appears to use local learning but global optimization probably happens during sleep/dreaming. I have a lot of papers on removing backpropagation, Hebbien learning, and "local, learning rules."

From there, many are publishing how to do training at 8-bit and below. A recent one did a mix of low-bit training with sub-1-bit storage for weights. The NoLayer architecture might address interlayer better.

People keep trying to build analog accelerators. There are mismatches between their features and hardware. Recent works have come up with analog NN's that work well with analog hardware.

A combination of those would likely get cost down dramatically on both inference and training. Also, energy use would be lower.