Ion register allocator scales poorly due to spilling for calls
Categories
(Core :: JavaScript Engine: JIT, task, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox133 | --- | fixed |
People
(Reporter: jandem, Assigned: jandem)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [sp3])
Attachments
(5 files)
Ion's register allocator spills all registers around LIR instructions marked as calls. This is implemented by adding a small fixed range for each of the physical registers during live range building.
This introduces a performance issue when we're looking for an available register to assign to a bundle in tryAllocateAnyRegister
, because we iterate over all physical registers (and for each register over all its aliases - also bundles can have many live ranges). What frequently happens is that we try all of the physical registers and none are available because there's a call-instruction in the bundle's range. This is especially wasteful on ARM64 because it has so many registers.
The allocator already builds a separate tree of call-ranges, so we can add a fast path to tryAllocateAnyRegister
to check if there any call instructions we overlap with and return immediately if that's the case.
On my M1 running Octane in the shell this reduces the total number of calls to AvlTree::contains
from ~2.53 million to ~1.79 million, so a reduction of 29%. (The call-ranges tree should also be a bit smaller and isn't mutated after we create it, so is more cache-friendly.)
Assignee | ||
Comment 1•3 months ago
|
||
Denis, if you're able to test this, I'm curious if you see any improvements with this patch. I also sent this to try but it will take a while to get perf results for the Android devices...
Updated•3 months ago
|
Updated•3 months ago
|
Updated•3 months ago
|
Comment 2•3 months ago
|
||
Another option might be to have callee saved registers. The difficult point is that a frame would no longer live without the context on where to find the location of callee saved registers, but Ion frames are not seekable either. Thus, we could have a safepoint recording for all callee-saved registers to reduce the splilling pressure around call sites.
Assignee | ||
Comment 3•3 months ago
|
||
I have an additional patch to change the call ranges from an AvlTree
to a list of positions that we do a binary search on.
Together these patches make Ion compilation of a large Wasm module (OpenOffice, 200 MB) about 7% faster.
Comment 4•3 months ago
|
||
I think this is clever and worth doing. I do have one conceptual reservation,
although no suggestion how to fix it, as follows.
Currently there is only one "source of truth" about the effects of calls on
registers -- the physical register fixed ranges mentioned in comment 0. With
this new tree/array of call-ranges, we in effect acquire a second place where
it is specified how calls interact with registers. So far there's no problem,
because calls cause all registers to be spilled, so (providing there are no bugs)
the two mechanisms can carry the same information, and we're fine.
But if we at some point change the behaviour of calls to trash only a subset of
registers (== have some caller-saved regs), then the two mechanisms will end up
making conflicting claims about the lifetimes of the caller-saved regs.
I guess .. at least comment this in the patch?
Relatedly, is there some simple way we can check that the patch doesn't change
the allocations at all? Compute a checksum of derived from the sequence of
MoveGroup insertion points and check it is unchanged for a couple of large
inputs?
Assignee | ||
Comment 5•3 months ago
•
|
||
(In reply to Julian Seward [:jseward] from comment #4)
With this new tree/array of call-ranges, we in effect acquire a second place where
it is specified how calls interact with registers. So far there's no problem,
because calls cause all registers to be spilled, so (providing there are no bugs)
the two mechanisms can carry the same information, and we're fine.
This data structure isn't new though, the callRanges tree already exists today and is used for range splitting around calls.. so it's adding an extra consumer of that.
Comment 6•3 months ago
|
||
Ah, my apologies for the noise. I should have looked into this more carefully.
Assignee | ||
Comment 7•3 months ago
|
||
No worries. The suggestion to verify the allocation results is a good idea. I looked into this and the allocations can be different in some cases due to this code where we don't add a fixed range for a call instruction if the call-instruction also has a fixed output for that register.
This can affect the order we allocate/split bundles in. Today we can allocate a bundle to the call's output register and we'll later realize there's a conflict and evict it. With the suggested change we split the bundle immediately instead of doing it later.
I'll get some data on how this affects the register allocation output. The result should be very similar but we should confirm that.
Comment 8•3 months ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #1)
Denis, if you're able to test this, I'm curious if you see any improvements with this patch. I also sent this to try but it will take a while to get perf results for the Android devices...
This seems like a promising improvement. I used perfetto to measure the time it took to complete register allocation on a Pixel 8 and this seems to be roughly a +5% win for that time. The numbers here are a geometric mean of the speedup in regalloc of each function that was compiled by Ion. (i.e. I compared the speedup of the same function before and after the patch and then took a geometric mean of those speedups.)
Charts-chartjs: 0.9737
Charts-observable-plot: 1.0301
Editor-CodeMirror: 1.0439
Editor-TipTap: 1.0385
NewsSite-Next: 1.0429
NewsSite-Nuxt: 1.0359
Perf-Dashboard: 1.0423
React-Stockcharts-SVG: 1.0426
TodoMVC-Angular-Complex-DOM: 1.0350
TodoMVC-Backbone: 1.0451
TodoMVC-JavaScript-ES5: 0.9067
TodoMVC-JavaScript-ES6-Webpack: 1.0275
TodoMVC-jQuery: 1.0502
TodoMVC-Lit-Complex-DOM: 1.0256
TodoMVC-Preact-Complex-DOM: 1.0229
TodoMVC-React-Complex-DOM: 1.0897
TodoMVC-React-Redux: 1.0223
TodoMVC-Svelte-Complex-DOM: 1.0722
TodoMVC-Vue: 1.0069
TodoMVC-WebComponents: 1.0396
The actual SP3 impact may be smaller than this since I had to make a couple adjustments to get consistent numbers such as set the device into performance mode, disable omt compilations, and reduce the Ion threshold down to about 500 to get more samples. Nevertheless, it seems to be a general win on most tests.
Assignee | ||
Comment 9•3 months ago
|
||
Thanks Denis, that's useful data.
I'm working on some other perf improvements for register allocation and other parts of the Ion backend in bug 1916442, but that's more about fixing pathological cases that come up with huge Wasm functions and these changes probably won't help Speedometer much.
Updated•3 months ago
|
Assignee | ||
Comment 10•2 months ago
|
||
In computeRequirement
we never set hint
to FIXED
, so we can turn the code for
that into an assertion.
After that, conflicting.empty()
is always true because the caller clears this
vector right before calling tryAllocateNonFixed
.
Assignee | ||
Comment 11•2 months ago
|
||
If the bundle contains a call instruction, we'd often try all physical registers
because calls currently require spilling all registers. Modern architectures such
as arm64 have so many available registers that this can be quite slow.
This speeds up Ion compilation of a 200 MB OpenOffice Wasm file by about 3-4% on x64.
The speed up is likely larger on arm64 because it has more registers.
Assignee | ||
Comment 12•2 months ago
|
||
This is only set to true
for call instructions and not for other fixed uses.
Assignee | ||
Comment 13•2 months ago
|
||
The canAllocate
variable is set before the loop and is never mutated after that
point. This means it's simpler to check it outside the loop instead of on each
iteration.
This also lets us move the chooseBundleSplit
call to the end of the function.
Assignee | ||
Comment 14•2 months ago
|
||
Note that the pfixed
argument for this function can be nullptr
so this code would
have crashed in that case if the if-body was reachable.
Comment 15•2 months ago
|
||
Comment 16•2 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e9a3fcc87262
https://hg.mozilla.org/mozilla-central/rev/ee69209cfbe9
https://hg.mozilla.org/mozilla-central/rev/0e473de9393e
https://hg.mozilla.org/mozilla-central/rev/57689ae8d79d
https://hg.mozilla.org/mozilla-central/rev/dc6e3e292783
Comment 17•2 months ago
|
||
3.1% improvement on AWFY-Windows-webassembly embenchen-compile Windows/Linux / MacOS
Overall 1.6% improvement on MacOS-WebAssembly Godot
Description
•