A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.

On arm64, all instructions are four bytes. The BL and BX to effect the branching is 8 bytes of instruction already. Plus non-leaf functions need to push and pop the return address via some means (which generally depends on what the surrounding code is doing, so isn't a fixed cost).

Obviously making that work requires not just the parallel dispatch for all the individual bits, but a stack engine in front of the cache that can remember what it was doing. Not free. But it's 100% a big win in cache footprint.

Yeah totally. It's really easy to forget about the fact that x86 is abstracting a lot of stack operations away from you (and obviously that's part of why it's a useful abstraction!).

> A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.

True for `ret`, I'm not convinced it's true for `call` on typical amd64 code. The vast majority I see are 5 bytes for a regular call, with a significant number of 6 bytes e.g. `call 0xa4b4b(%rip)` or 7 bytes if relative to a hi register. And a few 2 bytes if indirect via a lo register e.g. `call %rax` or 3 for e.g. `call *%r8`.

But mostly 5 bytes, while virtually all calls on arm64 and riscv64 are 4 bytes with an occasional call needing an extra `adrp` or `lui/auipc` to give ±2 GB range.

But in any case, it is indisputable that on average, for real-world programs, fixed-length 4 byte arm64 matches 1-15 byte variable-length amd64 in code density and both are significantly beaten by two length riscv64.

All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.

> All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.

(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)

$ podman run --platform=linux/amd64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'

22629493

$ podman run --platform=linux/arm64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'

29173962

$ podman run --platform=linux/riscv64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'

22677127

One can see that amd64 and riscv64 are actually very close, with in fact a slight edge to amd64. Both are far ahead of arm64 though.

>(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)

Please use `size`, it does matter.

It would literally change your conclusion here. RISC-V is denser than amd64; It's not even close.

[deleted]