>Misaligned loads and stores are Zicclsm

Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here.

Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").

IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.

>atomic memory operations are made mandatory in Ziccamoa

In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.

And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.

I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.

By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.

RISC-V truly is the RyanAir of processors: Oh, you want FP maths? That's an optional extra, did you check that when you booked? And was that single or double-precision, all optional extras at an extra charge. Atomic instructions, that's an extra too, have your credit card details handy. Multiply and divide? Yeah, extras. Now, let me tell you about our high-end customer options, packed SIMD and user-level interrupts, only for business class users. And then there's our first-class benefits, hypervisor extensions for big spenders, and even more, all optional extras.

So it's modular. This is normally considered a good thing. It means you don't have to pay for features you don't need.

The ISA is open so there's no greedy corporation trying to upsell you. I mean there's an implementation and die area cost for each extension but it's not being set at an artificial level by a monopolist.

It's a good thing in many cases but not if you're going to be running applications distributed as binaries. Maybe if we go the Gentoo route of everybody always recompiling everything for their own system?

Then you stick to RVA23, which is comparable to ARMv9 and x86-64v4.

But that means a port of Linux can’t be to RISC-V, it has to be to a specific implementation of RISC-V, or if sufficient (which seems still debatable) to a specific common RISC-V profile.

You can target the minimum instruction set and it'll run everywhere. Albeit very slowly. Perhaps you use a fat binary to get reasonable performance in most cases.

This isn't easy but it can be done (and it is being done on x86, despite constantly evolving variations of AVX).

Then x86_64 is the cable television service of processors. "Oh, you want channel 5? Then you have to buy this bundle with 40 other channels you will never watch, including 7 channels in languages you do not speak."

>Multiply and divide

And where it actually mattered they did not introduce a separate extension. Integer division is significantly more complex than multiplication, so it may make sense for low-end microcontrollers to implement in hardware only the latter.

There is Zmmul for multiplication-but-not-divide.

Yes, adding instructions to your ISA has a cost

I think having separate unaligned load/store instructions would be a much worse design, not least because they use a lot of the opcode space. I don't understand why you don't just have an option to not generate misaligned loads for people that happen to be running on CPUs where it's really slow. You don't need to wait for a profile for that.

As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient. By the time you get to CPUs where portable code is important a CSPRNG is probably fine.

I agree about page size though. Svnapot seems overly complicated and gives only a fraction of the advantages of actually bigger pages.

>As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient.

It's a terrible attitude to have towards programmers, but looking at misaligned ops, I guess we can see a pattern from RISC-V authors here.

Most programmers do not target a concrete microcontroller and develop every line of code from scratch. They either develop portable libraries (e.g. https://docs.rs/getrandom) or build their projects using those libraries.

The whole raison d'être of an ISA is to provide a portable contract between hardware vendors and programmers . RISC-V authors shirk this responsibility with "just look at your micro specs, lol" attitude.

The option to generate or not generate misaligned loads/stores does exist (-mno-strict-align / -mstrict-align). But of course that's a compile-time option, and of course the preferred state would be to have use of them on by default, but RVA23 doesn't sufficiently guarantee/encourage them not being unreasonably-slow, leaving native misaligned loads/stores still effectively-unusable (and off by default on clang/gcc on -march=rva23u64).

aka, Zicclsm / RVA23 are entirely-useless as far as actually getting to make use of native misaligned loads/stores goes.

The cursed thing is that RVA23 does basically guarantees that `vle8.v` + `vmv.x.s` on misaligned addresses is fast.

Yeah, that is quite funky; and indeed gcc does that. Relatedly, super-annoying is that `vle64.v` & co could then also make use of that same hardware, but that's not guaranteed. (I suppose there could be awful hardware that does vle8.v via single-byte loads, which wouldn't translate to vle64.v?)

> RVA23 doesn't guatantee them not being unreasonably-slow

Right but it doesn't guarantee that anything is unreasonably slow does it? I am free to make an RVA23 compliant CPU with a div instruction that takes 10k cycles. Does that mean LLVM won't output div? At some point you're left with either -mcpu=<specific cpu> and falling back to reasonable assumptions about the actual hardware landscape.

Do ARM or x86 make any guarantees about the performance of misaligned loads/stores? I couldn't find anything.

I don't think x86/ARM particularly guarantee fastness, but at least they effectively encourage making use of them via their contributions to compilers that do. They also don't really need to given that they mostly control who can make hardware anyway. (at the very least, if general-purpose HW with horribly-slow misaligned loads/stores came out from them, people would laugh at it, and assume/hope that that's because of some silicon defect requiring chicken-bit-ing it off, instead of just not bothering to implement it)

Indeed one can make any instruction take basically-forever, but I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation (on at least some inputs), else having said instruction is strictly-redundant.

And if any significant general-purpose hardware actually did a 10k-cycle div around the time the respective compiler defaults were decided, I think there's a good chance that software would have defaulted to calling division through a function such that an implementation can be picked depending on the running hardware. (let's ignore whether 10k-cycle-division and general-purpose-hardware would ever go together... but misaligned-mem-ops+general-purpose-hardware definitely do)

> if general-purpose HW with horribly-slow misaligned loads/stores came out from them

How is that different for RISC-V?

> I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation

I agree! So just use misaligned loads if Zicclsm is supported. As you observed there's a feedback loop between what compilers output and what gets optimised in hardware. Since RVA23 hardware is basically non-existent at the moment you kind of have the opportunity to dictate to hardware "LLVM will use misaligned accesses on RVA23; if you make an RVA23 chip where this is horribly slow then people will laugh at you and assume it's some sort of silicon defect".

> How is that different for RISC-V?

RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent, and it seems not enough people have laughed at them, and instead compilers did just surrender and default to not using them.

> As you observed there's a feedback loop between what compilers output and what gets optimised in hardware.

Well, that loop needs to start somewhere, and it has already started, and started wrong. I suppose we'll see what happens with real RVA23 hardware; at the very least, even if it takes a decade for most hardware to support misaligned well, software could retroactively change its defaults while still remaining technically-RVA23-compatible, so I suppose that's good.

>So just use misaligned loads if Zicclsm is supported.

LLVM and GCC developers clearly disagree with you. In other words, re-iterating the previously raised point: Zicclsm is effectively useless and we have to wait decades for hypothetical Oilsm.

Most programmers will not know that the misaligned issue even exists, even less about options like -mno-strict-align. They just will compile their project with default settings and blame RISC-V for being slow.

RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.

Well, we don't necessarily have to wait for Oilsm; software that wants to could just choose to be opinionated and run massively-worse on suboptimal hardware. And, of course, once Oilsm hardware becomes the standard, it'd be fine to recompile RVA23-targeting software to it too.

> RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.

Rather hard to mandate performance by an open ISA. Especially considering that there could actually be scenarios where it may be necessary to chicken-bit it off; and of course the fact that there's already some questionability on ops crossing pages, where even ARM/x86 are very slow.

I am not saying that RISC-V should mandate performance. If anything, we wouldn't had the problem with Zicclsm if they did not bother with the stupid performance note.

I would be fine with any of the following 3 approaches:

1) Mandate that store/loads do not support misaligned pointers and introduce separate misaligned instructions (good for correctness, so its my personal preference).

2) Mandate that store/loads always support misaligned pointers.

3) Mandate that store/loads do not support misaligned pointers unless Zicclsm/Oilsm/whatever is available.

If hardware wants to implement a slow handling of misaligned pointers for some reason, it's squarely responsibility of the hardware's vendor. And everyone would know whom to blame for poor performance on some workloads.

We are effectively going to end up with 3, but many years later and with a lot of additional unnecessary mess associated with it. Arguably, this issue should've been long sorted out in the age of ratification of the I extension.

2 is basically infeasible with RISC-V being intended for a wide range of use-cases. 1 might be ok but introduces a bunch of opcode space waste.

Indeed extremely sad that Zicclsm wasn't a thing in the spec, from the very start (never mind that even now it only lives in the profiles spec); going through the git history, seems that the text around misaligned handling optionality goes all the way back to the very start of the riscv/riscv-isa-manual repo, before `Z*` extensions existed at all.

More broadly, it's rather sad that there aren't similar extensions for other forms of optional behavior (thing that was recently brought up is RVV vsetvli with e.g. `e64,mf2`, useful for massive-VLEN>DLEN hardware).

>1 might be ok but introduces a bunch of opcode space waste.

I wouldn't call it "waste". Moreover, it's fine for misaligned instructions to use a wider encoding or be less rich than their aligned counterparts. For example, they may not have the immediate offset or have a shorter one. One fun potential possibility is to encode the misaligned variant into aligned instructions using the immediate offset with all bits set to one, as a side effect it also would make the offset fully symmetric.

Of course that'd result in entirely-avoidable slowdown for the potentially-misaligned ops. Perhaps fine for a program that doesn't use them frequently, but quite bad for ones that need misaligned ops everywhere.

In terms of correctness, there's also the possibility of partially-misaligned ops (e.g. an 8B load with 4B alignment, loading two adjacent int32_t fields) so you're not handling everything with correct faults anyways.

Exactly, I 100% agree, and IMO toolchains should default to assuming fast misaligned load/store for RISC-V.

However, the spec has the explicit note:

> Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.

Which was a mistake. As you said any instruction could be arbitrarily slow, and in other aspects where performance recommendations could actually be useful RVI usually says "we can't mandate implementation".

RISC-V is not particularly good at using opcode space, unfortunately.

I don't think it's too bad. The compressed extension was arguably a mistake (and shouldn't be in RVA23 IMO), but apart from that there aren't any major blunders. You're probably thinking about how JAL(R) basically always uses x1/x5 (or whatever it is), but I don't think that's a huge deal.

About 1/3 of the opcode space is used currently so there's a decent amount of space left.