These techniques are not new. And the reason why they’re usually not used is on page 9 in the paper. They require about 10x as many training iterations.
These techniques are not new. And the reason why they’re usually not used is on page 9 in the paper. They require about 10x as many training iterations.
When I was working for startups trying to develop foundation models circa 2015 we were concerned with training more than inference.
Today with models that are actually useful training costs matters much less than inference costs. A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs.
I still don't have a GPT3-class model that was trained without copyright infringement. I'd have so many uses for it from research to production. What's stopping me is the $30 million training cost for 180B models. Even a 30B like Mosaic cost over a million dollars.
So, I strongly disagree unless we're talking about the five or six companies that already spend tens of millions on training and keep repeating that. Outside of them, the medium to large models are done infrequently or one off by a small number of other companies. Then, most of us are stuck with their pretraining efforts because we can't afford it ourselves.
On my end, I'd rather see a model that drops pretraining costs to almost nothing but costs 10-32x more to do inference. My uses would produce mere MB of output vs hundreds of GB to TB that pretraining requires. A competitive use that costs 32x current prices would probably be profitable for me. Optimizations, which are plentiful for inference, might bring it down further.
I think you're right but there has to be a limit. If I'm training a model I'm going to do a significant amount of inference to evaluate it and support the training.
Why are you making something cheap more expensive than it needs to be?
It's not cheap. It costs millions to $100 million depending on the model. I was responding to this tradeoff:
"A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs."
Given millions and up, I'd like that to be 10x cheaper while inference was 10x more expensive. Then, it could do research or coding for me at $15/hr instead of $1.50/hr. I'd just use it carefully with batching.
Calculating the gradient requires a forward pass (inference) and a backward pass (back propagation).
They cost roughly the same, with the backwards pass being maybe 50% more expensive. So let's say three times the cost of a forward pass.
You can't make training faster by making inference slower.
I was responding to their claim by starting with an assumption that it may be correct. I don't know the cost data myself. Now, I'll assume what you say is true.
That leaves computation and memory use of two passes plus interlayer communication.
I think backpropagation doesn't occur in the brain since it appears to use local learning but global optimization probably happens during sleep/dreaming. I have a lot of papers on removing backpropagation, Hebbien learning, and "local, learning rules."
From there, many are publishing how to do training at 8-bit and below. A recent one did a mix of low-bit training with sub-1-bit storage for weights. The NoLayer architecture might address interlayer better.
People keep trying to build analog accelerators. There are mismatches between their features and hardware. Recent works have come up with analog NN's that work well with analog hardware.
A combination of those would likely get cost down dramatically on both inference and training. Also, energy use would be lower.
Unless each iteration is 90% faster
This.
In fact, it can be slower because hardware is probably not optimized for the 1-bit case, so there may be a lot of low-hanging fruit for hardware designers and we may see improvements in the next iteration of hardware.
Isn't digital (binary) hardware literally optimized for 1-bit case by definition?
People are confusing word size…
The CPU can handle up to word size bits at once. I believe they mean that a lot of assembly was written for integer math and not bit math. Word size 4+ However, it is unlikely we’ll see improvements in this area because by definition, using 64-bit floats uses max word size. So… that’s the max throughput. Sending 1 bit vs 64 bits would be considerably slower so this entire approach is funny.
No, because there are algorithmic shortcuts that allow approximations and skipped steps in comparison to a strict binary step-by-step calculation, by using in-memory bit reads and implicit rules, among other structural advantages in how GPUs and CPUs instruction sets are implemented in hardware.
FPGA's could be highly-competitive for models with unusual, but small, bit lengths. Especially single bits since their optimizers will handle that easily.
Yea I saw that training perplexity and thought hmmm...
Turns out using floats is a feature and not a bug?
No, I don't think so, in that I don't think anyone has ever called that a bug.
In the paper summary they did not call it a bug explicitly, but they do say there are 32x improvements in using single bits instead.
That's an obvious exaggeration. The competition is using smaller weights already, some of which are floating point and some of which aren't.
And they use full size floats for training.
That means their paper is actually worse than SOTA, which is concerned with training in fp4 natively without full precision [0] for QAT.
[0] "full precision" in ML usually means 16 bit floats like bfloat16
I wouldn't say "worse". It's focusing on inference cost and leaving training at a default for now.
To memory, sure. At the cost of 32x slower speeds.