Hacker News

I wanna see an inference chip where the weights are part of the rom of the chip.

There would be 1 multiplier per weight (and since they're constant, the whole thing turns into a bunch of simple adders), and the total pipelined system throughput would be one token per clock cycle.

That means you can probably have millions of users simultaneously using a single bit of silicon, with perhaps 500 million tokens per second coming out the output bus.

Downside is this chip would be huuuuge - a whole wafer.

Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.

Due to the speed the industry moves, you'd want to race from model weights to production super fast, make 50 wafers, use them for a year, then bin them when that model is obsolete.

sometimelurker a day ago [ - ]

this appeared some time ago, https://taalas.com/, but I'm sure there's others thinking these same thoughts. this would be best for small models imo, nothing frontier because that changes too fast

1e1a a day ago [ - ]

you can try it out here: https://chatjimmy.ai/

Meetvelde 19 hours ago [ - ]

that's so fast it feels fake

the_sleaze_ 18 hours ago [ - ]

13,789 tok/s

Well I've gotten one of those "holy fuck this is the future" deeply unsettled anxious feelings in my gut again. It's been a week or 2, it was time.

froh 14 hours ago [ - ]

i only found one discussion of the tech here on HN

https://news.ycombinator.com/item?id=47103661

agazso 13 hours ago [ - ]

It's indeed super fast, but the output is complete BS hallucination. Not sure what's the value of this.

runeks 8 hours ago [ - ]

It's a proof of concept that it's possible to etch a neural net into a chip and get massive performance (and efficiency) boost

Smaug123 a day ago [ - ]

By the way, you've seen Cerebras? It's not gone as far as what you described - loads of cores and RAM but you still load up the weights onto it as software and they need to be streamed into the chip for large models - but it is a whole wafer.

trouve_search a day ago [ - ]

Cerebras is a whole lot of SRAM, basically a ton more L1/L2 cache, hence increasing throughput.

They're pretty supply constrained right now though and their production costs seem prohibitive.

The interesting players at the moment are from Toronto: taalas (print the model onto the silicon) and tenstorrent (dataflow programming based hardware)

londons_explore a day ago [ - ]

There is a huge downside to weights being modifiable - it means you need to have multipliers (not simply adders), and SRAM to store those weights.

I suspect for equal performance, that's probably a 5x increase in silicon area (and therefore cost).

phkahler a day ago [ - ]

>> I wanna see an inference chip where the weights are part of the rom of the chip.

I've been wondering about that for a while now. For a lot of tasks putting weights in ROM is probably OK. OTOH:

>> There would be 1 multiplier per weight...

I'm not sure that is a good idea. Maybe if its quantized down to 2 bits... Otherwise maybe a small ROM near each multiplier (or row of them or whatever) so the multipliers could handle N distinct matrix operations without having to move the data from far away.

Another fun thought is to have a row of MAC units on DRAM so a DRAM row would be a vector. Row size might be 64Kbit or 8K weights if they're 8bit. This also keeps the weights and calcs on the same chip. I'm not sure this would put enough multipliers on one chip though. Systolic arrays can have tens or hundreds of thousands each doing one op per clock cycle.

cyptus a day ago [ - ]

analog chips could also be very interessting instead of using digital signals and processing them against the weights in the ROM. I have no idea if that scales with such big models though.

mdp2021 20 hours ago [ - ]

The drawback is in keeping signal fidelity (e.g. dissipation, temperature etc.) and in the conversion between analogue and digital.

Nonetheless, yes, there are already implemented solutions for small NNs (I understand mostly acting as triggers).

whazor 7 hours ago [ - ]

You don't need a single wafer, you can split the model into many smaller different chips and connect inputs/outputs.

Skip VHDL and directly go for GDSII / OASIS. Try to find similar vectors so you get re-usable blocks.

You can dynamically calibrate a chip by fine tuning output.

freakynit 16 hours ago [ - ]

This may be extreme, or, completely stupid, but, why are we not using genetics to "grow" chips in a chemical soup yet? Similar to Verilog/VHDL, don't we have some similar language to express circuits using gene sequences?

marcosqanil 12 hours ago [ - ]

I've worked for one of Europe's biggest synthetic biology labs and I know lots of biologists are low-key interested, but current players in semiconductors see it as kind of a tarpit.

IBM used to have a program using DNA origami for lithography back in 2009, which makes sense as lithography masks are a pain to make. I really wish I know why the program was stopped, but most of the researchers are retired by now.

As to whether you can just "grow" the whole chip from scratch, the answer is probably, but it would require lots of non-trivial scientific discoveries. For instance, we can't really make sizable chips using DNA without horrible defect rates. Biology is much better at making redundant rube goldberg machines, than very precise machines with no tolerance for errors.

I think we'd have a better chance of success if we made very weird kinds of chips that better took advantage of the medium, perhaps even something that we "train" rather than just use out of the box.

I'd love it if anyone here knew more about this !

freakynit 11 hours ago [ - ]

Would it be comparatively easy to make neuromorphic chips instead of traditional chips? I believe probabilistic algorithms like those employed by LLM's must be more tolerable to working with defects as well..?

whalee 15 hours ago [ - ]

We lack robust frameworks for 'forward engineering' stochastic thermodynamic computation over molecular free-energy landscapes (which is basically what a "chemical soup" is doing) like we do for analog/optical/digital computing. This is why, as a field, medicine is so heavily empirical and reverse engineering oriented.

freakynit 12 hours ago [ - ]

Man... I had to chatgpt your comment just to understand. But I do now.

Basically, unlike current chip manufacturing process where every stage is deterministic and precise, the soup-world, the chemistry, is not. And we do not have accurate enough models to handle them in deterministic way, or, model them precisely.

My respect for nature's engineering just shot up by 10 times more.

AceJohnny2 15 hours ago [ - ]

Are referencing the 1998 short story "Taklamakan" by Bruce Sterling?

freakynit 12 hours ago [ - ]

Thanks.. just looked it up. Seems super interesting.

fallat 16 hours ago [ - ]

Do that at scale

freakynit 16 hours ago [ - ]

Bacteria do that at scale, far far bigger than all chips combined. All it takes is chemical soup and a few starter seed dna's.

fallat 6 hours ago [ - ]

Ah, so we're not talking creating full on brains after-all?

voidUpdate 12 hours ago [ - ]

> "Downside is this chip would be huuuuge - a whole wafer."

Why don't we have chips like that? If a CPU the size of a postage stamp can do x amount of performance, imagine how much performance you could get if you used an entire wafer of chips running in parallel. Obviously there would be certain use cases, like you couldn't fit an entire wafer in a phone, but still

ngomez 12 hours ago [ - ]

Using the space of an entire wafer for one chip would result in extremely low manufacturing yields. Even with state of the art silicon cleanrooms, there will still be defects in parts of the output.

With CPUs and GPUs, chip makers can disable faulty cores and bin them as lower SKUs to get some yield out of it. But if you're using an entire wafer to embed weights, and a speck of dust causes a printing defect that makes the weights wrong, the entire wafer is worthless.

voidUpdate 9 hours ago [ - ]

Do failed wafers have to go in the trash, or can you recycle them?

ngomez 41 minutes ago [ - ]

You can grind some of the raw silicon out of a finished wafer but I don't think it'd be suitable to use in another batch of product. So instead of having the weights on the wafer like OOP was suggesting, hardware inference has been trending toward having a wafer of lots and lots of smaller cores with fast SRAM.

Jyaif 8 hours ago [ - ]

What's the difference between disabling faulty cores and disabling the parts of the wafer that have defects?

RussianCow 4 hours ago [ - ]

I'm not an expert, but I think those are the same thing. But for an LLM etched onto a whole wafer, it doesn't make sense to disable part of it since that would remove some weights entirely.

cactusplant7374 9 hours ago [ - ]

Is that defect easy to detect?

kimsey0 10 hours ago [ - ]

We do. The Cerebras line of Wafer Scale Engines is exactly an entire wafer of cores running in parallel with fast memory next to each one. It's intended for very high throughput LLM inference. https://www.cerebras.ai/chip

WithinReason 8 hours ago [ - ]

One token per clock cycle at 1B parameters would imply 2 ExaFLOPS, consuming about 10 KWs

a day ago [ - ]

[deleted]

yuriyguts a day ago [ - ]

I've also been thinking about this. Although the forward pass of a transformer model also involves some heavier operations like normalization, reciprocals, exponentiations or other non-linearities (GeLU, SiLU) which may (though typically don't) involve learned weights as operands.

Salgat 21 hours ago [ - ]

Supposedly memristors would be ideal for this (and it would be reprogrammable), but then again, memristors seem to be the carbon nanotubes of the computing world.

mdp2021 20 hours ago [ - ]

> weights [as] part of the rom of the chip

Not really that: you are pointing to Compute-In-Memory (CIM) - techniques where the data (here, a multiplier value) is part of the processor (here, the multiplying circuit).

The problem of "fetch and process" is bypassed completely architecturally: the data is there where the processing happens - it's not moved, there is no latency.

zkmon a day ago [ - ]

firmware upgrade would mean flashing a huge BIN file.

HDThoreaun 19 hours ago [ - ]

How would the pipelining work when the next token depends on the last token?

cruffle_duffle a day ago [ - ]

“ Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.”

Brain science people “love” traumatic brain injury cases because it can help explore what happens when bits of the “brain wafer” get damaged. We’ve learned a lot from such things.

I wonder if people are intentionally “destroying” parts of the model weights to learn more about what happens? Like could you strategically wipe a gig of the model so it’s “all zeros” and see what happens?

I have to wonder

zurfer a day ago [ - ]

This is called mechanistic interpretability. There is lots of fascinating insights already since you can do basically everything down to the neuron or weight level thousands of times. The human brain is many orders of magnitude harder to make sense of.

sometimelurker a day ago [ - ]

well its actually called ablation, and its one way to do mech interp. anthriopics got a bunch of work on mech interp here https://transformer-circuits.pub/, like SAEs and NLAs

Cantinflas a day ago [ - ]

mdp2021 19 hours ago [ - ]

Of course tampering with chunks or nodes in the NNs is a way to study the "spawned" (through gradient descent etc.) configuration and "reverse-engineer the black box" to get "AI transparency".

Anthropic published an important work around one year and a half ago.

mdp2021 14 hours ago [ - ]

> Anthropic published an important work around one year and a half ago

> #Tracing the thoughts of a large language model#

https://www.anthropic.com/research/tracing-thoughts-language...

https://news.ycombinator.com/item?id=43495617 (27 March 2025)

Computer0 a day ago [ - ]

Reminds me of Golden Gate Claude (https://www.anthropic.com/news/golden-gate-claude)