> RISC-V will get there, eventually.
Not trolling: I legitimately don't see why this is assumed to be true. It is one of those things that is true only once it has been achieved. Otherwise we would be able to create super high performance Sparc or SuperH processors, and we don't.
As you note, Arm once was fast, then slow, then fast. RISC-V has never actually been fast. It has enabled surprisingly good implementations by small numbers of people, but competing at the high end (mobile, desktop or server) it is not.
I think the bigger question is does RISC-V need to be fast? Who wants to make it fast?
I'm a chip designer and I see people using RISC-V as small processor cores for things like PCIE link training or various bookkeeping tasks. These don't need to be fast, they need to be small and low power which means they will be relatively slow.
Most people on tech review sites only care about desktop / laptop / server performance. They may know about some of the ARM Cortex A series CPUs that have MMUs and can run desktop or smartphone Linux versions.
They generally don't care about the ARM Cortex M or R versions for embedded and real time use. Those are the areas where you don't need high performance and where RISC-V is already replacing ARM.
EDIT:
I'll add that there are companies that COULD make a fast RISC-V implementation.
Intel, AMD, Apple, Qualcomm, or Nvidia could redirect their existing teams to design a high performance RISC-V CPU. But why should they? They are heavily invested in their existing x86 and ARM CPU lines. Amazon and Google are using licensed ARM cores in their server CPUs.
What is the incentive for any of them to make a high performance RISC-V CPU? The only reason I can think of is that Softbank keeps raising ARM licensing costs and it gets high enough that it is more profitable to hire a team and design your own RISC-V CPU.
Of your list, Qualcomm and Nvidia are fairly likely to make high perf Riscv cpus. Qualcomm because Arm sued them to try and stop them from designing their own arm chips without paying a lot more money, and Nvidia because they already have a lot of teams making riscv chips, so it seems likely that they will try to unify on the one that doesn't require licensing.
Yeah, they could but then what is the market? Qualcomm wants to sell smartphone chips and Android can run on RISC-V and most Android Java apps could in theory run.
But if you look at the Intel x86 smartphone chips from about 10 years ago they had to make an ARM to x86 emulator because even the Java apps contained native ARM instructions for performance reasons.
Qualcomm is trying to push their ARM Snapdragon chips in Windows laptops but I don't think they are selling well.
Nvidia could also make RISC-V based chips but where would they go? Nvidia is moving further away from the consumer space to the data center space. So even if Nvidia made a really fast RISC-V CPU it would probably be for the server / data center market and they may not even sell it to ordinary consumers.
Or if they did it could be like the Ampere ARM chips for servers. Yeah you can buy one as an ordinary consumer but they were in the $4,000 range last time I looked. How many people are going to buy that?
> Qualcomm is trying to push their ARM Snapdragon chips in Windows laptops but I don't think they are selling well.
That definitely seems to be the case. I think they likely would have more luck with Riscv phones (much less app brand loyalty). or servers (arm in the server has done a lot better than on windows)
For Nvidia, if they made a consumer riscv cpu it would be a gaming handheld/console (Switch 3 or similar) once the AI bubble pops. Before that, likely would be server cpus that cost $10k for big AI systems. Before that, I could see them expanding the role of Riscv in their GPUs (likely not visible to to users).
Many PC hardware enthusiasts say they want a RISC-V or ARM CPU but then when these system exist they don't actually want them.
Why? Because they want something like a $300 CPU and $150 motherboard using standard DDR4/5 DIMMs that is RISC-V or ARM or something not x86 but is faster than x86. The sub $1000 systems that hardware companies make that are RISC-V or ARM chips are low end embedded single board systems that are too slow for these people. The really fast systems are $4000 server level chips that they can't afford. The only company really bringing fast non-x86 CPUs with consumer level pricing is Apple. We can also include Qualcomm but I'm skeptical of the software infrastructure and compatibility since they are relying on x86 emulation for windows.
China is likely where it would come from - ARM and x86 are owned by Western companies.
> I think the bigger question is does RISC-V need to be fast? Who wants to make it fast?
Honestly, the initial reaction is it sounds like cope, and I know this because I've been saying it for ages to angry reactions. RISC-V looks for all the world like it is designed for competing with the 32 bit Arm ecosystem but that the designers didn't, and still don't, understand what 64 bit Arm is about.
Secondly, it's been necessary to claim such things are forever on the way in order to maintain hype and get software support. Without it you wouldn't see nearly so much Linux buildchain work. (See the open source SuperH implementations for what happens if you admit you don't go for high performance).
Finally though, as process nodes get smaller you can afford to put much more complex blocks in the same area, which can then burst through a series of operations and power off again, many times a second. (Edit to add: of course you know that, but it's still counter intuitive the extent to which it changes things over time. People have things like floating point support in places that not too long ago would have been completely minimalist, and there are some really extreme examples around).
> I'll add that there are companies that COULD make a fast RISC-V implementation.
Again, there is no proof of this until it actually happens. When Qualcomm were trying they wanted to change the spec of RISC-V, and I strongly suspect that is actually necessary.
RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots), largely because we learned from that. It's in fact a very "boring" architecture. There's no one that expects it'll be hard to optimize for. There are at least 2 designs that have taped out in small runs and have high end performance.
RISC-V does not have the pitfalls of experimental ISAs from 45 years ago, but it has other pitfalls that have not existed in almost any ISA since the first vacuum-tube computers, like the lack of means for integer overflow detection and the lack of indexed addressing.
Especially the lack of integer overflow detection is a choice of great stupidity, for which there exists no excuse.
Detecting integer overflow in hardware is extremely cheap, its cost is absolutely negligible. On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, because each arithmetic operation must be replaced by multiple operations.
Because of the unacceptable cost, normal RISC-V programs choose to ignore the risk of overflows, which makes them unreliable.
The highest performance implementations of RISC-V from previous years were forced to introduce custom extensions for indexed addressing, but those used inefficient encodings, because something like indexed addressing must be in the base ISA, not in an extension.
OK, look.
Since my previous attempt to measure the impact of trap on signed overflow didn't seem to have moved your position one bit, I thought I'd give it a go in the most representable way I could think of:
I build the same version of clang on a x86, aarch64 and RISC-V system using clang. Then I build another version with the `-ftrapv` flag enabled and compared the compiletimes of compiling programs using these clang builds running on real hardware:
As you can see, once again the overhead of -ftrapv is quite low.Suprizinglt the -ftrapv overhead seems the highest on the Cortex-A78. My guess is that this because clang generates a seperate brk with unique immediate for every overflow check, while on RISC-V it always branches to one unimp per function.
Please tell me if you have a better suggestion for measuring the real world impact.
Or heck, give me some artificial worst case code. That would also be an interesting data point.
Notes:
* The format is mean±variance
* Spacemit X100 is a Cortex-A76 like OoO RISC-V core and A100 an in-order RISC-V core.
* I tried to clock all of the cores to the same frequency of about 2.2GHz. *Except for the A55, which ran at 1.8GHz, but I linearly scaled the results.
* Program A was the chibicc (8K loc) compiler and program B microjs (30K loc).
> On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably,
Most languages don't care about integer overflow. Your typical C program will happily wrap around.
If I really want to detect overflow, I can do this:
Which is one more instruction, which is not great, not terrible.Because the other commenter wasn’t posting the actual answer, I went to find the documentation about checking for integer overflow and it’s right here https://docs.riscv.org/reference/isa/unpriv/rv32.html#2-1-4-...
And what did I find? Yep that code is right from the manual for unsigned integer overflow.
For signed addition if you know one of the signs (eg it’s a compile time constant) the manual says
But the general case for signed addition if you need to check for overflow and don’t have knowledge of the signs From what I’ve read most native compiled code doesn’t really check for overflows in optimised builds, but this is more of an issue for JavaScript et al where they may detect the overflow and switch the underlying type? I’m definitely no expert on this.A bit more reading shows there's a three instruction general case version for 32-bit additions on the 64-bit RISC-V ISA. I'm not familiar with RISC-V assembly and they didn't provide an example, but I _think_ it's as easy as this since 64-bit add wouldn't match the 32-bit overflowed add.
Contrast with x86:
Neither x86-64 nor RISC-V is implemented by running each single instruction. They both recognize patterns in the code and translate those into micro-ops. On high performance chips like Rivos's (now Meta's) I doubt there'd be any difference in the amount of work done.
Code size is a benefit for x86-64 however - no one is arguing that - but you have to trade that against the difficulty of instruction decoding.
That is not the correct way to test for integer overflow.
The correct sequence of instructions is given in the RISC-V documentation and it needs more instructions.
"Integer overflow" means "overflow in operations with signed integers". It does not mean "overflow in operations with non-negative integers". The latter is normally referred as "carry".
The 2 instructions given above detect carry, not overflow.
Carry is needed for multi-word operations, and these are also painful on RISC-V, but overflow detection is required much more frequently, i.e. it is needed at any arithmetic operation, unless it can be proven by static program analysis that overflow is impossible at that operation.
It's one more instruction only if you don't fuse those instructions in the decoder stage, but as the pattern is the one expected to be generated by compilers, implementations that care about performance are expected to fuse them.
I have no idea or practical experience with anything this low-level, so idk how much following matters, it's just someone from the crowd offering unvarnished impressions:
It's easy to believe you're replying to something that has an element of hyperbole.
It's hard to believe "just do 2x as many instructions" and "ehhh who cares [i.e. your typical C program doesn't check for overflow]", coupled to a seemingly self-conscious repetition of a quip from the television series Chernobyl that is meant to reference sticking your head in the sand, retire the issue from discussion.
There was no hyperbole in what I have said.
The sequence of instructions given above is incorrect, it does not detect integer overflow (i.e. signed integer overflow). It detects carry, which is something else.
The correct sequence, which can be found in the official RISC-V documentation, requires more instructions.
Not checking for overflow in C programs is a serious mistake. All decent C compilers have compilation options for enabling checking for overflow. Such options should always be used, with the exception of the functions that have been analyzed carefully by the programmer and the conclusion has been that integer overflow cannot happen.
For example with operations involving counters or indices, overflow cannot normally happen, so in such places overflow checking may be disabled.
> On the other hand, detecting integer overflow in software is extremely expensive
this just isn't true. both addition and multiplication can check for overflow in <2 instructions.
Fewer than two is exactly one instruction. Which?
dammmit I meant <=2. https://godbolt.org/z/4WxeW58Pc sltu or snez for add/multiply respectively.
This result is misleading.
First, the code claims to be returning "unsigned long" from each of these functions, but the value will only ever be 0 or 1 (see [1]). The code is actually throwing away the result and just returning whether overflow occurred. If we take unsigned long *c as another argument to the function, so that we actually keep the result, we end up having to issue an extra instruction for multiplication (see [2]; I'm ignoring the sd instruction since it is simply there to dereference the *c pointer and wouldn't exist if the function got inlined).
Second, this is just unsigned overflow detection. If we do signed overflow detection, now we're up to 5 instructions for add and mul (see [3]). Considering that this is the bigger challenge, it compares quite unfavorably to architectures where this is just 2 instructions: the operation itself and a branch against a condition flag.
[1]: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...
[2]: https://godbolt.org/z/7rWWv57nx
[3]: https://godbolt.org/z/PnzKaz4x5
That's fair. The good news is that for signed overflow, you can claw back to the cost of unsigned overflow if you know the sign of either argument (which is fairly common).
Yeah, it's not the end of the world, and as others mentioned, a good implementation can recognize the instruction pattern and optimize for it.
It's just a bizarre design choice. I understand wanting to get rid of condition flags, but not replacing them with nothing at all.
EDIT: It seems the same choice was made by MIPS, which is a clear inspiration for RISC-V.
The argument is that there are actually 3 distinct forms of replacement:
1. 64 bit signed math is a lot less overflow vulnerable than the 16/32 bit math that was extremely common 20 years ago
2. For the BigInt use-case, the Riscv design is pretty sensible since you want the top bits, not just presence of overflow
3. You can do integer operations on the FPU (using the inexact flag for detecting if rounding occurred).
4. Adding overflow detecting instructions can easily be done in an extension in the future if desired.
I think in the case of MIPS, at least, the decision logic was simply: condition flags behave like an implicit register, making the use of that register explicit would complicate the instruction encoding, and that complication would be for little benefit since most compilers ignore flags anyway, except for situations which could be replaced with direct tests on the result(s).
[flagged]
+1 -- misinformation is best corrected quickly. If not, AI will propagate it and many will believe the erroneous information. I guess that would be viral hallucinations.
One can quickly correct misinformation without being rude. It's not hard, and does not lessen the impact of the correction to do so. There's no reason to tolerate the kind of rudeness the parent post exhibits.
As a counterexample, I point to another relatively boring RISC, PA-RISC. It took off not (just) because the architecture was straightforward, but because HP poured cash into making it quick, and PA-RISC continued to be a very competitive architecture until the mass insanity of Itanic arrived. I don't see RISC-V vendors making that level of investment, either because they won't (selling to cheap markets) or can't (no capacity or funding), and a cynical take would say they hide them behind NDAs so no one can look behind the curtain.
I know this is a very negative take. I don't try to hide my pro-Power ISA bias, but that doesn't mean I wouldn't like another choice. So far, however, I've been repeatedly disappointed by RISC-V. It's always "five or six years" from getting there.
I would not call PA-RISC boring. Already at launch there was no doubt that it is a better ISA than SPARC or MIPS, and later it was improved. At the time when PA-RISC 2.0 was replaced by Itanium it was not at all clear which of the 2 ISAs is better. The later failures to design high-performance Itanium CPUs make plausible that if HP would have kept PA-RISC 2.0 they might have had more competitive CPUs than with Itanium.
SPARC (formerly called Berkeley RISC) and MIPS were pioneers that experimented with various features or lack of features, but they were inferior from many points of view to the earlier IBM 801.
The RISC ISAs developed later, including ARM, HP PA-RISC and IBM POWER, have avoided some of the mistakes of SPARC and MIPS, while also taking some features from IBM 801 (e.g. its addressing modes), so they were better.
ISAs fail to gain traction when the sufficiently smart compilers don't eventuate.
The x86-64 is a dog's breakfast of features. But due to its widespread use, compiler writers make the effort to create compilers that optimize for its quirks.
Itanium hardware designers were expecting the compiler writers to cater for its unique design. Intel is a semi company. As good as some of their compilers are, internally they invested more in their biggest seller and the Itanium never got the level of support that was anticipated at the outset.
I am a firm believer that if AMD wasn't in the position to be able to come up with AMD64 architecture, eventually those Itanium issues would have been sorted out, Windows XP was already there and there was no other way for 64 bit going forward.
It has never happened that a compiler was able to do static scheduling of general purpose instructions over the long term.
Every CPU changes the cycles it takes for many instructions, adds new instructions etc.
Out of order execution is a huge dividing line in performance for a reason. The CPU itself needs to figure these things out to minimize memory latency, cache latency, pipelining, prefetching and all that stuff.
I don't know anything about Itanium in particular, but AMD's NPU uses a VLIW architecture and they had to break backwards compatibility in the ISA for the second generation NPU (XDNA2) to get better performance.
I mean "boring" in the sense that its ISA was relatively straightforward, no performance-entangling kinks like delay slots, a good set of typical non-windowed GPRs, no wild or exotic operations. And POWER/PowerPC and PA-RISC weren't a lot later than SPARC or MIPS, either.
> RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots),
You're saying ISA design does have implementation performance implications then? ;)
> There's no one that expects it'll be hard to optimize for
[Raises hand]
> There are at least 2 designs that have taped out in small runs and have high end performance.
Are these public?
Edit: I should add, I'm well aware of the cultural mismatch between HN and the semi industry, and have been caught in it more than a few times, but I also know the semi industry well enough to not trust anything they say. (Everything from well meaning but optimistic through to outright malicious depending on the company).
The 2 designs I'm thinking of are (tiresomely) under NDA, although I'm sure others will be able to say what they are. Last November I had a sample of one of them in my hand and played with the silicon at their labs, running a bunch of AI workloads. They didn't let me take notes or photographs.
> There's no one that expects it'll be hard to optimize for
No one who is an expert in the field, and we (at Red Hat) talk to them routinely.
Expert here, are these made for general purpose workloads or do you expect them to be fast for AI only?
I assume the TensTorrent TT-Ascalon is one of the CPU designs.
I don't think anybody suggests Oracle couldn't make faster SPARC processors, it's just that development of SPARC ended almost 10 years ago. At the time SPARC was abandoned, it was very competitive.
In single-threaded performance? That’s not how I remember it: Sun was pushing parallel throughput over everything else, with designs like the T-Series & Rock.
Perhaps not single thread, but Rock was a dead end a while before Oracle pulled the plug, and Sun/Oracle's core market of course was always servers not workstations. We used Niagara machines at my work around the T2 era, a long time ago, but they were very competitive if you could saturate the cores and had the RAM to back it up.
Sure, my work got a few of the Niagaras too and they were tremendous build machines for Solaris software.
But if you’re judging an ISA by performance scalability, you generally want to look at single-threaded performance.
Sparc stopped being competitive in the early 2000’s.
Because today, getting a fast CPU out it isn't as much an engineering issue as it is about getting the investment for hiring a world-class fab.
The most promising RISC-V companies today have not set out to compete directly with Intel, AMD, Apple or Samsung, but are targeting a niche such as AI, HPC and/or high-end embedded such as automotive.
And you can bet that Qualcomm has RISC-V designs in-house, but only making ARM chips right now because ARM is where the market for smartphone and desktop SoCs is. Once Google starts allowing RVA23 on Android / ChromeOS, the flood gates will open.
It's very much both. You need millions of dollars for the fab, but you also need ~5 years to get 3 generations of cpus out (to fix all the performance bugs you find in the first two)
Fast, RVA23-compatible microarchitectures already exist. Everything high performance seems to be based on RVA23, which is the current application profile and comparable to ARMv9 and x86-64v4.
However, it takes time from microarchitecture to chips, and from chips to products on shelves.
The very first RVA23-compatible chips to show up will likely be the spacemiT K3 SoC, due in development boards April (i.e. next month).
More of them, more performant, such as a development board with the Tenstorrent Ascalon CPU in the form of the Atlantis SoC, which was tapped out recently, are coming this summer.
It is even possible such designs will show up in products aimed at the general public within the present year.