The jump from change #5 to #6 (inline caches + hidden-class object model) doing the bulk of the work here really tracks with how V8/JSC got fast historically — dynamic dispatch on property access is where naive interpreters die, and everything else is kind of rounding error by comparison. Nice that it's laid out so you can see the contribution of each step in isolation; most perf writeups just show the final number.

@jiusanzhou The interesting implementation detail in change #6 is how the inline caching is done in an AST-walking interpreter specifically. In bytecode interpreters, IC rewriting is natural — the "cache site" is a stable byte offset in the bytecode stream you can patch. Here, the cache site is an AST node, so @pizlonator uses placement new to construct a specialized AST node on top of the generic one in-place (via constructCache<>). It's self-modifying code at the AST level.

The tradeoff is that this requires mutable AST nodes, which conflicts with the immutable-AST assumption most compilers rely on (e.g., for sharing subtrees or parallelizing compilation). For a single-threaded interpreter it works cleanly, but it'd be a problem if you wanted to JIT-compile from the same AST on a background thread while the interpreter is mutating nodes.

I agree, but there’s a tiny caveat that this is for one specific benchmark that, I think, doesn’t reflect most real-world code.

I’m basing that on the 1.6% improvement they got on speeding up sqrt. That surprised me, because, to get such an improvement, the benchmark must spend over 1.6% of its time in there, to start with.

Looking in the git repo, it seems that did happen in the nbody simulation (https://github.com/pizlonator/zef/blob/master/ScriptBench/nb...).