I wanna see an inference chip where the weights are part of the rom of the chip.

There would be 1 multiplier per weight (and since they're constant, the whole thing turns into a bunch of simple adders), and the total pipelined system throughput would be one token per clock cycle.

That means you can probably have millions of users simultaneously using a single bit of silicon, with perhaps 500 million tokens per second coming out the output bus.

Downside is this chip would be huuuuge - a whole wafer.

Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.

Due to the speed the industry moves, you'd want to race from model weights to production super fast, make 50 wafers, use them for a year, then bin them when that model is obsolete.

this appeared some time ago, https://taalas.com/, but I'm sure there's others thinking these same thoughts. this would be best for small models imo, nothing frontier because that changes too fast

you can try it out here: https://chatjimmy.ai/

that's so fast it feels fake

13,789 tok/s

Well I've gotten one of those "holy fuck this is the future" deeply unsettled anxious feelings in my gut again. It's been a week or 2, it was time.

i only found one discussion of the tech here on HN

https://news.ycombinator.com/item?id=47103661

It's indeed super fast, but the output is complete BS hallucination. Not sure what's the value of this.

It's a proof of concept that it's possible to etch a neural net into a chip and get massive performance (and efficiency) boost

By the way, you've seen Cerebras? It's not gone as far as what you described - loads of cores and RAM but you still load up the weights onto it as software and they need to be streamed into the chip for large models - but it is a whole wafer.

Cerebras is a whole lot of SRAM, basically a ton more L1/L2 cache, hence increasing throughput.

They're pretty supply constrained right now though and their production costs seem prohibitive.

The interesting players at the moment are from Toronto: taalas (print the model onto the silicon) and tenstorrent (dataflow programming based hardware)

There is a huge downside to weights being modifiable - it means you need to have multipliers (not simply adders), and SRAM to store those weights.

I suspect for equal performance, that's probably a 5x increase in silicon area (and therefore cost).

>> I wanna see an inference chip where the weights are part of the rom of the chip.

I've been wondering about that for a while now. For a lot of tasks putting weights in ROM is probably OK. OTOH:

>> There would be 1 multiplier per weight...

I'm not sure that is a good idea. Maybe if its quantized down to 2 bits... Otherwise maybe a small ROM near each multiplier (or row of them or whatever) so the multipliers could handle N distinct matrix operations without having to move the data from far away.

Another fun thought is to have a row of MAC units on DRAM so a DRAM row would be a vector. Row size might be 64Kbit or 8K weights if they're 8bit. This also keeps the weights and calcs on the same chip. I'm not sure this would put enough multipliers on one chip though. Systolic arrays can have tens or hundreds of thousands each doing one op per clock cycle.

analog chips could also be very interessting instead of using digital signals and processing them against the weights in the ROM. I have no idea if that scales with such big models though.

The drawback is in keeping signal fidelity (e.g. dissipation, temperature etc.) and in the conversion between analogue and digital.

Nonetheless, yes, there are already implemented solutions for small NNs (I understand mostly acting as triggers).

You don't need a single wafer, you can split the model into many smaller different chips and connect inputs/outputs.

Skip VHDL and directly go for GDSII / OASIS. Try to find similar vectors so you get re-usable blocks.

You can dynamically calibrate a chip by fine tuning output.

This may be extreme, or, completely stupid, but, why are we not using genetics to "grow" chips in a chemical soup yet? Similar to Verilog/VHDL, don't we have some similar language to express circuits using gene sequences?

I've worked for one of Europe's biggest synthetic biology labs and I know lots of biologists are low-key interested, but current players in semiconductors see it as kind of a tarpit.

IBM used to have a program using DNA origami for lithography back in 2009, which makes sense as lithography masks are a pain to make. I really wish I know why the program was stopped, but most of the researchers are retired by now.

As to whether you can just "grow" the whole chip from scratch, the answer is probably, but it would require lots of non-trivial scientific discoveries. For instance, we can't really make sizable chips using DNA without horrible defect rates. Biology is much better at making redundant rube goldberg machines, than very precise machines with no tolerance for errors.

I think we'd have a better chance of success if we made very weird kinds of chips that better took advantage of the medium, perhaps even something that we "train" rather than just use out of the box.

I'd love it if anyone here knew more about this !

Would it be comparatively easy to make neuromorphic chips instead of traditional chips? I believe probabilistic algorithms like those employed by LLM's must be more tolerable to working with defects as well..?

We lack robust frameworks for 'forward engineering' stochastic thermodynamic computation over molecular free-energy landscapes (which is basically what a "chemical soup" is doing) like we do for analog/optical/digital computing. This is why, as a field, medicine is so heavily empirical and reverse engineering oriented.

Man... I had to chatgpt your comment just to understand. But I do now.

Basically, unlike current chip manufacturing process where every stage is deterministic and precise, the soup-world, the chemistry, is not. And we do not have accurate enough models to handle them in deterministic way, or, model them precisely.

My respect for nature's engineering just shot up by 10 times more.

Are referencing the 1998 short story "Taklamakan" by Bruce Sterling?

Thanks.. just looked it up. Seems super interesting.

Do that at scale

Bacteria do that at scale, far far bigger than all chips combined. All it takes is chemical soup and a few starter seed dna's.

Ah, so we're not talking creating full on brains after-all?

> "Downside is this chip would be huuuuge - a whole wafer."

Why don't we have chips like that? If a CPU the size of a postage stamp can do x amount of performance, imagine how much performance you could get if you used an entire wafer of chips running in parallel. Obviously there would be certain use cases, like you couldn't fit an entire wafer in a phone, but still

Using the space of an entire wafer for one chip would result in extremely low manufacturing yields. Even with state of the art silicon cleanrooms, there will still be defects in parts of the output.

With CPUs and GPUs, chip makers can disable faulty cores and bin them as lower SKUs to get some yield out of it. But if you're using an entire wafer to embed weights, and a speck of dust causes a printing defect that makes the weights wrong, the entire wafer is worthless.

Do failed wafers have to go in the trash, or can you recycle them?

You can grind some of the raw silicon out of a finished wafer but I don't think it'd be suitable to use in another batch of product. So instead of having the weights on the wafer like OOP was suggesting, hardware inference has been trending toward having a wafer of lots and lots of smaller cores with fast SRAM.

What's the difference between disabling faulty cores and disabling the parts of the wafer that have defects?

I'm not an expert, but I think those are the same thing. But for an LLM etched onto a whole wafer, it doesn't make sense to disable part of it since that would remove some weights entirely.

Is that defect easy to detect?

We do. The Cerebras line of Wafer Scale Engines is exactly an entire wafer of cores running in parallel with fast memory next to each one. It's intended for very high throughput LLM inference. https://www.cerebras.ai/chip

One token per clock cycle at 1B parameters would imply 2 ExaFLOPS, consuming about 10 KWs

[deleted]

I've also been thinking about this. Although the forward pass of a transformer model also involves some heavier operations like normalization, reciprocals, exponentiations or other non-linearities (GeLU, SiLU) which may (though typically don't) involve learned weights as operands.

Supposedly memristors would be ideal for this (and it would be reprogrammable), but then again, memristors seem to be the carbon nanotubes of the computing world.

> weights [as] part of the rom of the chip

Not really that: you are pointing to Compute-In-Memory (CIM) - techniques where the data (here, a multiplier value) is part of the processor (here, the multiplying circuit).

The problem of "fetch and process" is bypassed completely architecturally: the data is there where the processing happens - it's not moved, there is no latency.

firmware upgrade would mean flashing a huge BIN file.

How would the pipelining work when the next token depends on the last token?

“ Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.”

Brain science people “love” traumatic brain injury cases because it can help explore what happens when bits of the “brain wafer” get damaged. We’ve learned a lot from such things.

I wonder if people are intentionally “destroying” parts of the model weights to learn more about what happens? Like could you strategically wipe a gig of the model so it’s “all zeros” and see what happens?

I have to wonder

This is called mechanistic interpretability. There is lots of fascinating insights already since you can do basically everything down to the neuron or weight level thousands of times. The human brain is many orders of magnitude harder to make sense of.

well its actually called ablation, and its one way to do mech interp. anthriopics got a bunch of work on mech interp here https://transformer-circuits.pub/, like SAEs and NLAs

Of course tampering with chunks or nodes in the NNs is a way to study the "spawned" (through gradient descent etc.) configuration and "reverse-engineer the black box" to get "AI transparency".

Anthropic published an important work around one year and a half ago.

> Anthropic published an important work around one year and a half ago

> #Tracing the thoughts of a large language model#

https://www.anthropic.com/research/tracing-thoughts-language...

https://news.ycombinator.com/item?id=43495617 (27 March 2025)

Reminds me of Golden Gate Claude (https://www.anthropic.com/news/golden-gate-claude)