> The frontier is the verifier.
Um, yes? The big value that AMD had in the x86 market over competitors was their verification model. This has been known for decades.
> 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness
Every single "improvement" is basically about routing around how absolutely abysmally bad the Gowin FPGAs are. Kudos to that, I guess?
Gowin FPGAs have extraordinarily bad carry chain and block to block routing systems. They are literally so bad that a 32-bit ripple carry is almost as fast as the carry skip version even if you manually route it. Jump prediction is almost all about avoiding arithmetic computation at all (which most other FPGAs would have no problem with).
Memory accesses are super slow and locked to clock edges rather than level sensitive (why ID/RF and WB take entire cycles and nothing optimization could do could change it). The additions are all routing around that (Note the immutability of the ID and WB phases).
To top it off, the 5-stage pipeline is an annoying quirk of the RISC-V architecture having an immediate value offset on its load instruction. If the RISC-V load mandated 0 as the offset, the MEM read phase could overlap the RX phase since no ALU would be necessary (Store doesn't care because the result goes to memory rather than back to the register file so RF writeback isn't an issue). The absolutely horrific add performance of the Gowin FPGAs makes this acute.
Finally, try to put this on a board. I found that anything above about 175MHz out of Nextpnr failed to execute on actual hardware (please correct me if this isn't valid. It's been over a year or more since I tried Nextpnr on the SiPeed Tang Primer 20K). That's simply right around where a 32-bit add plus some routing sits on these FPGAs. There's something a bit off in the timing analysis code for Nextpnr and the AI is almost certainly optimizing into it.
That having been said: I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs. Now THAT would be a project worth throwing a couple of grad students and a bunch of subsidized AI tokens at.
Assuming that your claims about GoWin FPGA flaws are correct, isn’t the point of this experiment that it was able to exploit these flaws without manual guidance?
His claims are indeed correct; Yes, you got my point tks!; AND the loop produced architecture gains that are not exclusive to the GoWin FPGA (CoreMark/Mhz is higher than VexRiscV)
Amazing comment.
As a non-hardware guy, I read, “well, duh, for a 20yr practitioner dealing with the intricacies of specific FPGA series, all this makes tons of sense”.
It only makes sense to me because I tried to implement a RISC-V on these Gowin FPGAs and banged into the limitations and can distill them down. A junior engineer looks at this post-AI, shrugs, and says "I'm done."
The AI doesn't flag "Hey, my adder sucks. Move to a better FPGA architecture." A junior engineer pre-AI would have to bang on this a while, get frustrated at the critical paths, and eventually ask for help. At which point we would both look at this, identify that the adder was doing a 32-bit ripple carry, both have a "WTF?!" moment, and switch FPGA families.
In addition, the AI also doesn't flag how close to the margin you are. To my eye, almost all the Fmax gains look like PnR (place and route) noise. The DIV/REM obviously isn't and the replay predictor looks real. To top it off, the branch predictor wins look anomalously low to my eye.
This is what a bunch of us are yelling about with AI. AI gets you a thing. AI gets you no insight into that thing. And because the juniors will use the AI, they will never learn the insight.
Side note: The granularity of the CM/MHz numbers look a bit suspicious. Why are there identical entries?
The frontier is the verifier not in the sense of this project, but to every project. If we have a good verifier for a task, any task, this type of loop can be applied to it. Today LLMs are good enough to tackle FPGA projects, but what this type of loop will be applicable to many more things
Board should be arriving next week. I will let you know!
"I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs."
The only reason I'm using Gowin is because it has a slightly more mature opensource tooling. Maybe we can apply this loop to nextpnr also
Please apply this loop to nextpnr for any of the commodity Xilinx, Altera, or Lattice parts. For example, everything about Lattice has been stuck for almost a decade at this point.
[flagged]