Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

Ultra-optimized HW-specific engines is what Mojo lang seems to be targeting, but I rarely hear about it here.

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

Check out cpp at 208.3 GiB/s, 3x faster than asm.

Yeah, because (and here's the trick) they are clever and do less work.

Optimizing things usually means "think of a way to do the same thing with less effort".

Hire the laziest programmer :)

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.

Just curious if you've tried GPT 5.5 Pro?

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.