SpiderMonkey actually ditched most of the profiling stuff in favor of transpilin...

kannanvijayan · 2025-03-03T19:19:43 1741029583

WarpMonkey doesn't get rid of the profiling stuff - the profiling is inherent in ICs - we keep hitcounts and other information for various paths taken through code (including ICs) and use that to guide compilation later.

Warp's uniqueness is in how it implements the ICs. The design goal when we built baseline JIT in SpiderMonkey was to split the code and data components of ICs. At the time, we were looking at V8 ICs which where basically compiled code blocks with the relevant parameter data (e.g. pointer to the hidden type to compare against) baked into the code.

We wanted to segregate the data from the code - e.g. so that all ShapedGetProp ICs can have a data stub with a pointer to their own shape, but share a pointer to the code. Effectively your ICs end up looking like small linked lists of C++ pure virtual objects (without the vtable indirection and just a single code pointer hanging off of the stub).

Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed. CacheIR was a simple bytecode language that the stubs could express their logic in, which would get compiled down to machine code. The CacheIR bytecode would be a key to the compiled stubcode.

That let stubs generate arbitrary CacheIR for their logic, but still share code between stubs that emitted the same logic.

That led to the idea of Warp, where we noticed that one could build the input for an optimized method-jit compiler just by combining the profiling info that stubs produced, and the CacheIR bytecode for those stubs.

Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.

With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.

Ultimately it does the same thing as the other JITs, but it goes about it in a really nice and clean way. It kind of expresses some of the ideas that Maxime Boisvert-Chevalier was exploring in their work with basic block versioning.

mintplant · 2025-03-03T22:24:58 1741040698

Thanks for the more complete explanation!

> Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.

> With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.

This is what I meant by ditching most of the profiling stuff; I suppose I should have said "type inference stuff" to be more precise.

> Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed.

I remember all too well :) I worked on the first pass at implementing megamorphic caches into the original stub generators that spit out (macro)assembly directly, before we had CacheIR. So much code duplication...

kannanvijayan · 2025-03-04T00:31:13 1741048273

Ah, sorry for the misinterpretation!

Also, we may have overlapped on the team :)

hinkley · 2025-03-03T18:31:05 1741026665

My understanding is that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.

pizlonator · 2025-03-03T18:38:17 1741027097

The modern VM technique looks almost exactly like what the original PIC papers talked about in the 90s. There are some details that are different, but I'm not sure that the details come down to exploiting changes in branch prediction efficiency. I think the things that changed come mostly down to the fact that the original PIC paper was a first stab by a small team whereas modern VMs involve decades of engineering by larger teams (so everything that could get more complex as a consequence of tuning did get more complex).

So, while it's true that microarches changed in a lot of ways, the overall implications for how you build VMs are not so big.

hinkley · 2025-03-03T21:19:12 1741036752

Are you still using a threaded interpreter main loop? That didn't really come around until the mid 90's and I've been hearing for about ten years now that it's not a clear win anymore due to predictors being able to read through two levels of indirection.

pizlonator · 2025-03-03T21:37:56 1741037876

The last time I ran the experiment of having a single jump, it was slower than jump-per-opcode-handler.

It's true that predictors are able to see through multiple levels, but a threaded interpreter gives them plus one level, and that ends up mattering as much as it always did.

titzer · 2025-03-04T15:25:30 1741101930

Threaded dispatch is absolutely worth it. Wizard's interpreter gets anywhere from 10-35% performance improvement from threaded dispatch.

gopalv · 2025-03-03T23:20:36 1741044036

> that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.

They got better than they had any right to be, but then we found out that Spectre & Meltdown were vulnerabilities rather than optimizations.

For example, a switch based interpreter was fast as a CGOTO one for a brief period between 2012 and 2018, but suddenly got slower again as the CPUs could no longer rely on branch prediction to do prefetching.

titzer · 2025-03-04T15:30:24 1741102224

While better predictors allow the speculation window to be larger on average, the the real culprit is that large speculation window. Even if the branch predictor weren't very smart, it will still do well on a program with stable, predictable branches, thus allowing a large speculation window to open up. The vulnerability is that some of those branches guard really important things, like not going out-of-bounds of an array. So a Spectre attack, which works by exploiting a mispredicted branch, is a constructive attack where the gadget is tuned for the branch predictor anyway. The other part of an attack, the windowing gadget, just relies on making a really slow input into a branch. Neither of them would be particularly harder with a dumb predictor.

IainIreland · 2025-03-03T23:31:01 1741044661

We talk about this a bit in our CacheIR paper. Search for "IonBuilder".

https://www.mgaudet.ca/s/mplr23main-preprint.pdf

pizlonator · 2025-03-03T18:35:17 1741026917

It sounds like you're describing something similar to what the other JS VMs do

IainIreland · 2025-03-04T00:01:03 1741046463

The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR), instead of generating machine code directly. For example, a simple monomorphic property access (obj.prop) would be GuardIsObject / GuardShape / LoadSlot. We can then lower that IR directly to MIR for the optimizing compiler.

It gives us a lot of flexibility in choosing what to guard, without having to worry as much about getting out of sync between the baseline ICs and the optimizer's frontend. To a first approximation, our CacheIR generators are the single source of truth for speculative optimization in SpiderMonkey, and the rest of the engine just mechanically follows their lead.

There are also some cool tricks you can do when your ICs have associated IR. For example, when calling a method on a superclass, with receivers of a variety of different subclasses, you often end up with a set of ICs that all 1. Guard the different shapes of the receiver objects, 2. Guard the shared shape of the holder object, then 3. Do the call. When we detect that, we can mechanically walk the IR, collect the different receiver shapes, and generate a single stub-folded IC that instead guards against a list of shapes. The cool thing is that stub folding doesn't care whether it's looking at a call IC, or a GetProp IC, or anything else: so long as the only thing that differs is the a single GuardShape, you can make the transformation.

pizlonator · 2025-03-04T03:22:08 1741058528

> The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR)

JSC calls this PolymorphicAccess. It’s a mini IR with a JIT that tries to emit optimal code based on this IR. Register allocation and everything, just for a very restricted IR.

It’s been there since I don’t remember when. I wrote it ages ago and then it has evolved into a beast.

IainIreland · 2025-03-04T06:10:46 1741068646

Taking a quick look at the JSC code, the main difference is that CacheIR is more pervasive and load-bearing. Even monomorphic cases go through CacheIR.

The main justification for CacheIR isn't that it enables us to do optimizations that can't be done in other ways. It's just a convenient unifying framework.

mintplant · 2025-03-03T22:27:49 1741040869

This is unique to SpiderMonkey, as far as I'm aware.