Asking as someone who is really out of the loop: how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level, and the infra would be mostly commoditized.

> how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level

Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.

How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics? How reconfigurable are TPUs? Are there any standardized feature flags for TPUs yet?

Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

From https://news.ycombinator.com/item?id=45775181 re: current TPUs in 2025; "AI accelerators" :

> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?

this is like 5 different questions all across the landscape - what exactly do you think answers will do for you?

> How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics?

LLMs? all the time? "tensor physics" (whatever that is) never

> How reconfigurable are TPUs?

very? as reconfigurable as any other programmable device?

> Are there any standardized feature flags for TPUs yet?

have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

> Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

i don't see why it wouldn't be? you're just asking is (stuff done)/(energy consumed) a good measure of efficiency to which the answer is yes?

> have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

X86, ARM, and RISC have all standardized on feature flags which can be reviewed on Linux with /proc/cpuinfo or with dmidecode.

  cat /proc/cpuinfo | grep -E '^processor|Features|^BogoMIPS|^CPU'
There are multiple TPU vendors. I listed multiple AI accelerator TPU products in the comment you are replying to.

> How reconfigurable are TPUs?

TIL Google's TPUs are reconfigurable with OCS Optical Circuit Switches that can be switched between for example 3D torus or twisted torus configurations.

(FWIW also, quantum libraries mostly have Line qubits and Lattice qubits. There is a recent "Layer Coding" paper; to surpass Surface Coding.)

But classical TPUs;

I had already started preparing a response to myself to improve that criteria; And then paraphrasing from 2.5pro:

> Don't rank by TOPS/wHr alone; rank by TOPS/wHr @ [Specific Precision]. Don't rank by Memory Bandwidth alone; rank by Effective Bandwidth @ [Specific Precision].

Hardware Rank criteria for LLM hosting costs:

Criterion 1: EGB (Effective Generative Bandwidth) Memory Bandwidth (GB/s) / Precision (Bytes)

Criterion 2: GE (Generative Efficiency) EGB / Total Board Power (Watts)

Criterion 3: TTFT Potential Raw TOPS @ Prompt Precision

LLM hosting metrics: Tokens Per Second (TPS) for throughput, Time to First Token (TTFT) for latency, and Tokens Per Joule for efficiency.

> There are multiple TPU vendors

There are not - TPU is literally a Google trademark:

> Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google.

https://en.wikipedia.org/wiki/Tensor_Processing_Unit

The rest of what you're talking about is irrelevant

There are some not so niche communities, like FlashAttention and LinearFlashAttention repos. New code/optimizations get committed on a weekly basis. They find a couple of percents here and there all the time. How useful their kernels actually are in term of producing good results remain to be seen, but their implementations are often much better (in FLOPS) compared to what were proposed in the original papers.

It's just like game optimization, cache-friendliness and memory hierarchy-awareness are huge in attention mechanism. But programming backward pass in these lower-level stacks is definitely not fun, tensor calculus breaks my brain.

a recent wave of interest in bitwise equivalent execution had a lot of kernels this level get pumped out.

new attention mechanisms also often need new kernels to run at any reasonable rate

theres definitely a breed of frontend-only ML dev that dominates the space, but a lot novel exploration needs new kernels