Chip CEO here. It really depends on what "design" or "production" means. Does "design" mean that the design was complete? Does "production" mean the beginning of production, i.e. tapeout? If measuring from RTL-freeze to tapeout, this is a fairly typical (even somewhat unimpressive) timeline (accounting for some unexpected issues) for a large, complex 3nm chip. If measuring from concept (no RTL at all, block diagram of architecture) to tapeout, this is an amazing timeline. The truth is probably somewhere in between. A more concrete statement would use actual technical milestones and gates.
Not a chip CEO, but I read this article and thought that they're working on some kind of application specific chip only for serving models. Similar to how an FPGA can optimize certain tasks.
Given constant weights / biases of a Transformer / DNN you could use pipelining to feed forward calculations through the array one layer at a time. For DNN's with thousands of layers you might see 1:1 speed up per layer channel.
I doubt they would undergo this process for marginal gains.
With a striking lack of numbers, I'm not confident. I my experience, everything underspecified in a marketing release is unflattering. They're also not a chip designing company, but they're probably trying to keep up on the eyes of investors. As the article mentions, several of their competitors are chip designers and already have working procuction inference chips.
When you have a few billion dollars you can hire chip people and partner with a chip company.
That's not to say I expect they'll ship something competitive with Google's custom AI hardware on the first go, since Google has been at it for quite a while, but there's very few technical problems large sums of money won't solve.
Yeah, I'm not sure how competitive it is without any specs. Just from it being "inference only" that puts it on the same level as Google's 2015 TPUv1.
Yes, my statement was not about the quality or performance of the chip -- simply the tapeout timeline that was stated, by itself.
i don't understand what the second paragraph is saying.
In very crude terms, AFAICT, if you have a bunch of matrix multiplications, but one of matrices (the one with model weights) doesn't change, you can seriously speed up the computation. One thing is that you don't need to re-fetch the elements of the constant matrix, you can keep it near the ALUs. Then you maybe can detect and ignore sparse / empty blocks by marking them once.
IDK how the custom hardware exploits this; would love to hear any ideas!
> IDK how the custom hardware exploits this; would love to hear any ideas!
You might like this article [1], titled "FPGA-based CNN Acceleration using Pattern-Aware Pruning". More context and details can be found in the PhD thesis of Léo Pradels [2].
[1]: https://inria.hal.science/hal-04689673/document
[2]: https://theses.hal.science/tel-05021575v1/file/PRADELS_Leo.p...
Current accelerators (TPUs, various onchip NPUs) are something close to this. Systolic array is the estabilished computer architecture term for flowing data from computation to computation without the overhead of a register file or von Neumann bottleneck.
Random thought. Once models stabilise, could you possibly hardcode the model in gates? Or are they too large for a single chip?
https://www.anuragk.com/blog/posts/Taalas.html
https://taalas.com/
wow if they can get something like this working, what happens to all this infrastructure? Hyperscalers have to be assuming the lifespan of that stuff wrong considering the next gen will be 1000x more efficient.
The question isn’t whether it works (it does); the question is whether there are buyers for hardware that is obsolete the day it ships. Models evolve much more quickly than hardware can keep up.
Presumably at some point the rapid progress of models will plateau, at least insofar as a model could be frozen in time and remain economically useful for the expected life of hardware. Especially if it comes with compelling benefits e.g. dramatically lower latency and/or dramatically higher performance per watt.
If you can build chips that could run one specific LLM 100x faster than anything else, it would have a use case that nothing else could match.
Those taalus chips apparently run at 1/10 the power as the current SOTA GPU setups. If they can execute even partially on their plan, it'll be a literal game changer.
https://www.cerebras.ai/ is exactly that! Holy shit it's fast.
Cerebras is not that. Cerebras isn’t tied to a particular model like Taalas is. The latter is even faster than Cerebras.
Right, but there exist problems that need to be routinely solved and can be solved on glm 5.2. is the model state of the art when it is published? No. But when it comes out you could optimize it and let your solver run forever for quite cheap, and that could be useful if the only problems you want it to solve (for cheap) are solvable by that model.
And the high water mark of what can be solved by open models will keep going up.
One obvious use case is edge computing, such as in industrial applications that cannot tolerate the risk of a network link or cloud service going down. Even embedded use cases are possible, such as an image classifier model in a security camera.
In fact any application where the task is stable and the model good enough to address that task. As you suggest, industrial applications where a robot must deal with variants of the same repetitive task. Or a military drone which needs to be jamming proof.
> Or a military drone which needs to be jamming proof.
That, if used in war, I would think, would need the ability to be updated frequently. For example, your enemy might find out (say by running tests on hardware they captured from you) that painting some red paint in a particular shape (a smiley might even work) on their hardware prevented your drones from attacking them because it confuses that pattern with the Red Cross logo.
Those are really two different things. One is the computer vision that could be “hard coded” and the other is the image library, that would be updated regularly. Look at facial recognition. You can download and run a facial recognition LLM on your GPU that looks at a library of your personal photos. The LLM doesn’t change when it scans your photos for faces, it just writes the data associated with a “face” to whatever library. When you add a new picture, it adds that face data and compares it to the library for a match. The actual LLM never needs to change. It is the same as the one I downloaded and ran on my GPU for my photos. If it was written on chips we both bought and installed, it would work the same way.[1]
[1] Yes, this is a massive simplification
You keep the "reasoning core" burned and play the cat-and-mouse game at the I/O edge. Enemy invents a smiley shield, your R&D figures out some filtering step that defeats this effect without compromising general image recognition. Then the enemy figures out a new trick, your R&D invents a countermeasure, and so on - point is, this can happen for a long time in layers on top of the core model. If the enemy invents some robust way to attack the core that cannot be filtered out, it's game over for that hardware, but that is a much more difficult task and might take longer than expected service time of a given batch of drones.
Sort of mirrors how biological organisms work. E.g. in a bird, the core functionality of knowing how to fly is burned in. Hunting food is probably a combination of experiential learning on top of instinctive behavior, and is somewhat adaptable to local conditions.
There may be all sorts of stable use case models that this could be interesting for. Imagine permanent voice translation circuits at a tiny fraction of the current price, glasses that subtitle the world with long battery life.
They are betting on fast release cycles coupled with much lower costs (purchase and operations) mixed with the ability to have dynamic fine tunes on top of the static model.
The models have to run on something or they're useless. They can't run on future hardware today, and people want to use models today. So, if hardware is obsolete the day it ships, we're all using obsolete hardware, and there's no alternative to that.
Taalas encodes the model into the hardware itself. The two are inextricably coupled. It’s like buying a CNC router that can’t be reprogrammed to build anything other than a specific predetermined kitchen cabinet. And the model used inside is frozen many months before the hardware ships, since the process from tapeout to production takes that long.
In contrast, tomorrow’s models will typically run, although perhaps more slowly, on general-purpose inference hardware that was released today or even years ago.
Basically getting around the branch predictor problem with generalized compute architectures https://en.wikipedia.org/wiki/Branch_predictor
If you look at the timelines for the hiring of the hardware team, this was an extremely fast and high risk implementation from concept to tapeout. Amazing it works at all during bringup.
>If measuring from RTL-freeze to tapeout, this is a fairly typical (even somewhat unimpressive) timeline (accounting for some unexpected issues) for a large, complex 3nm chip.
Even for a company’s first design?
I don't think you get the newcomer novelty buff when your val approaches 13 digits.
Big companies are lumbering behemoth, crude assemblages of barely cobbled-together incentives and principal agent problems in a trenchcoat. Getting them to change direction, or worse, try something new at scale, is a massive undertaking
Nah, you just need to get the CEO behind it. Most coordination issues get solved when the CEO is breathing down your neck to get something done. Trouble is that they don't do this enough.
CEOs have limited bandwidth, and can only breath down so many necks at once.
Eh, zero guarantees on that one.
The Fire Phone was Jeff Bezos' personal baby, and we know how that went. Then there was the Apple G4 Cube with Steve Jobs, the Model X' Falcon Wing doors and Elon, and lets not even talk about the Metaverse and Zuck.
> The Fire Phone was Jeff Bezos' personal baby, and we know how that went.
I'd rather guess that Jeff Bezos' opinion on what makes a good phone is/was different on the opinion of many potential buyers.
An Amazon phone with Amazon Video, playing Amazon Music, making phone calls throug the Amazon messenger, with an Amazon Browser that overlays ads to Amazon products, and has Amazon Voice Recognition ... blah blah blah
I imagine when you are a billionaire from one company, every time you hear the name of the company you hear your name, so you can't really think about what Joe Schmoe wants in a phone independently of your ego.
I guess this is what Steve Jobs was better at. SOME focus on the customer independent of his ego and Apple Apple Apple. I did say ... SOME.
Actually, you've provided examples that prove the point. None of those were especially good (though everyone wanted the G4 Cube), and yet they made it to market anyway. Why?
Because the CEO was behind it, breathing down their necks.
Pretty much every example is considered an abysmal failure that often costed the actual workers their careers while their CEO carried on.
If you consider that outcome a worthwhile endeavor, I don't know what else to say.
He's definitely not talking about worthy endeavour.
He's talking about an endeavour reaching the market.
I'm sure if Zuckerberg wants to spend $10B on Nuclear Fusion it will happen.
It’s fission, not fusion:
https://www.esgdive.com/news/meta-inks-nuclear-deals-terrapo...
…and if they do all of this, it’ll be closer to $20B than 10!
If all it took to get viable fusion power was a FAANG CEO with $10B to burn, I'd be first to petition for it to happen, and even throw whatever money I can spare onto that pyre.
The typical way a chip effort in a non-chip company works is that the "design" is the RTL (e.g. SystemVerilog that defines the behavior of the chip) and then this is handed off to a third-party "design house" (such as Broadcom) that turns that code into a real image of a chip, which is called a GDS (basically you can think of this as a very big layer by layer photoshop file) that can actually be sent to a fab. This is called "backend design", in contrast to the "frontend design" (the RTL itself).
As another commenter said, Broadcom is very experienced with backend design (as well as the supply chain management, testing, etc. that comes after the chip is taped out) and so this can't be regarded as a "first chip". Richard Ho (the head of hardware at OpenAI) is also extremely experienced and used to be the head of the Google TPU effort -- where he actually worked with Broadcom in a similar tapeout already. So yes, this is not a "first design"!
I wonder if broadcomm borrowed IP between the Google tpu and this design. How would you ever know it didn't happen?
There is no real way to prevent this, but there are ways to increase the cost of doing so. For example, one level of obfuscation is, OAI could internally run synthesis and adopt a “netlist-in” model in which Broadcom gets a netlist - a description of a huge amount of gates and wires and how they connect - instead of the plain Verilog (or other language). It is possible to reverse engineer the netlist, but it’s a certain level of indirection and effort.
A big part of the semiconductor industry also operates on a reputation basis. Broadcom (like TSMC) is a neutral party as a design house, but if they did something like this, it might ruin that reputation.
More likely that the AI training set contained the IP of others, and we all know how that turns out.
This isn't Broadcom's first design.
Yeah, "first chip" here likely means they contracted Broadcom (or a firm with similar experience) to do all the heavy lifting. Building out your own in-house teams for this sort of thing is a decade-long project - just look how much inside Apple's early chips was licensed ARM / PowerVR cores
Apple didn't have the talent in-house until they bought Intrincity who worked with Samsung on Apple's earlier Arm chips as well. https://en.wikipedia.org/wiki/Intrinsity
That’s not quite fair. As I recall there were about 1,500 people in that part of the hardware org circa mid 2000s. Before PA Semi there were pretty established teams already doing VLSI/PD/verification/validation, PCB, and of course analog/mixed hardware, in their own work and in conjunction with samsung, old broadcom, qualcomm, etc. Lots of inhouse work went in to all those bespoke monitors, phones, apple tv, airports, etc etc.
My recollection is that PA Semi was very much for the architectural and design talent, even though it was an “asset purchase” and all the existing Power & military chips were hived off.
For Intrinsity I recall a lot of interest was actually in their existing graphics work and EDA. ISTR that those early mobile GPUs were what they focused on.
I was in the mansfield org circa ‘07-11. I spent a lot of time flying between cupertino and austin/bee caves that first year.
I think the folks at PA Semi had some chops too.
The way I heard it PA Semi was the singular driving force that led to Apple Silicon, but I'm not any kind of insider that's just the chatter I heard.
Whoever it was, whooo, that's hot shit. I remember an M1 MacBook Air just cleaning the clock of an Intel MacBook Pro and thinking "x86_64 has real competition again".
Great silicon. I'm over it with not having root on my own machine, so I've left the ecosystem, but it's really nice hardware, can't dispute that.
it would be interesting to know apple's true/inside attitude towards people putting linux on their hardware. they don't seem very interested in helping, but donno whether they actively sabotage either.
> The way I heard it PA Semi was the singular driving force that led to Apple Silicon
And a lot of them are sitting under Qualcomm via the Nuvia acquisition.
PA Semi group did the logic designs. I think they're talking about physical design though.
[dead]