> While reviewing folks' C and C++ code, I've found the following reasons for inline assembly, where 1 is most common:

3a. rdpru (similar issues to cpuid) or rdpmc perhaps surrounded with lfence or cpuid inside the same assembly chunk

For obvious reasons, this is somewhat niche and may not even make it into production code, but it’s also important when you do need it. It’s also memory safe. I guess in such cases you’d use fast C rather than Fil-C though.

4a. rseq

Probably even less feasible than atomics TBH, as such blocks will usually also contain control flow (at least that implied by to the nature of rseqs).

> Before the advent of AI, writing a parser for x86_64 assembly would have been such an annoying task that I might have never gotten around to implementing support for memory safe inline assembly [...].

It is annoying, but even before the advent of AI that didn’t stop the developers of TCC for instance.

With that said, given Fil-C is Clang/LLVM-based, shouldn’t an assembly parser, at least, be already available somewhere? I was under the impression that Clang (unlike GCC) actually parsed asm blocks.

I had yet another reason to use inline assemblies in WAH [1]. In some compilers (I think it was GCC) intrinsics are imbued with `target("foo")` attributes, which cause forced-inline via `always_inline` attribute to fail somehow. I really needed that though because I was writing a fast bytecode compiler and being unable to force-inline meant each supported SIMD instruction had to pay function call overhead (which can be significant when your bytecode is literally a single native SIMD operation!).

[1] https://github.com/lifthrasiir/wah/

I’m having vague flashbacks here so I might be guessing wrong, but: the only reason I’ve ever seen an “inlining failed” error from GCC when using intrinsics (and it was an actual error) was when it actually meant “you can’t use this intrinsic in this configuration” (the second half of the message, “target specific option mismatch”, is more helpful if still cryptic). Thus the fix was to change the argument of the -march= option or (for dynamic dispatch) decorate the caller with the correct __attribute__((target)). E.g. if you pass -march=x86-64-v1 but try to use AVX you’ll get such an inlining error. (This is unlike MSVC which will always allow you to use any intrinsic supported by the compiler.)

I think you are right in the interpretation. In my case though dynamic dispatch was already integrated with bytecode lowering so the error wasn't helpful :(

Ah, so you want the function containing the interpreter loop to be compiled for the baseline architecture but some of the bytecode implementations inside it to use more advanced intrinsics? Yeah, I don’t think GCC has a good answer to that one. It also sounds gnarly from a general calling-convention perspective—how is the VZEROUPPER on exit supposed to be emitted if you can’t count on AVX?..

We recently (finally) got __attribute__((musttail)) in GCC[1], I’ve just tried it between functions with mismatched __attribute__((target))s and it does work, so theoretically you could code your interpreter that way. But it seems like you’re still bound to keep loading and storing vector state from and to memory and VZEROUPPERing your registers after each bytecode, and that doesn’t sound like a particularly good time.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119616