Closed Bug 1915863 Opened 3 months ago Closed 2 months ago

Ion register allocator scales poorly due to spilling for calls

Categories

(Core :: JavaScript Engine: JIT, task, P1)

task

Tracking

()

RESOLVED FIXED
133 Branch
Tracking Status
firefox133 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(5 files)

Ion's register allocator spills all registers around LIR instructions marked as calls. This is implemented by adding a small fixed range for each of the physical registers during live range building.

This introduces a performance issue when we're looking for an available register to assign to a bundle in tryAllocateAnyRegister, because we iterate over all physical registers (and for each register over all its aliases - also bundles can have many live ranges). What frequently happens is that we try all of the physical registers and none are available because there's a call-instruction in the bundle's range. This is especially wasteful on ARM64 because it has so many registers.

The allocator already builds a separate tree of call-ranges, so we can add a fast path to tryAllocateAnyRegister to check if there any call instructions we overlap with and return immediately if that's the case.

On my M1 running Octane in the shell this reduces the total number of calls to AvlTree::contains from ~2.53 million to ~1.79 million, so a reduction of 29%. (The call-ranges tree should also be a bit smaller and isn't mutated after we create it, so is more cache-friendly.)

Denis, if you're able to test this, I'm curious if you see any improvements with this patch. I also sent this to try but it will take a while to get perf results for the Android devices...

Flags: needinfo?(dpalmeiro)
Whiteboard: [sp3]
Blocks: sm-regalloc

Another option might be to have callee saved registers. The difficult point is that a frame would no longer live without the context on where to find the location of callee saved registers, but Ion frames are not seekable either. Thus, we could have a safepoint recording for all callee-saved registers to reduce the splilling pressure around call sites.

I have an additional patch to change the call ranges from an AvlTree to a list of positions that we do a binary search on.

Together these patches make Ion compilation of a large Wasm module (OpenOffice, 200 MB) about 7% faster.

I think this is clever and worth doing. I do have one conceptual reservation,
although no suggestion how to fix it, as follows.

Currently there is only one "source of truth" about the effects of calls on
registers -- the physical register fixed ranges mentioned in comment 0. With
this new tree/array of call-ranges, we in effect acquire a second place where
it is specified how calls interact with registers. So far there's no problem,
because calls cause all registers to be spilled, so (providing there are no bugs)
the two mechanisms can carry the same information, and we're fine.

But if we at some point change the behaviour of calls to trash only a subset of
registers (== have some caller-saved regs), then the two mechanisms will end up
making conflicting claims about the lifetimes of the caller-saved regs.

I guess .. at least comment this in the patch?

Relatedly, is there some simple way we can check that the patch doesn't change
the allocations at all? Compute a checksum of derived from the sequence of
MoveGroup insertion points and check it is unchanged for a couple of large
inputs?

(In reply to Julian Seward [:jseward] from comment #4)

With this new tree/array of call-ranges, we in effect acquire a second place where
it is specified how calls interact with registers. So far there's no problem,
because calls cause all registers to be spilled, so (providing there are no bugs)
the two mechanisms can carry the same information, and we're fine.

This data structure isn't new though, the callRanges tree already exists today and is used for range splitting around calls.. so it's adding an extra consumer of that.

Ah, my apologies for the noise. I should have looked into this more carefully.

No worries. The suggestion to verify the allocation results is a good idea. I looked into this and the allocations can be different in some cases due to this code where we don't add a fixed range for a call instruction if the call-instruction also has a fixed output for that register.

This can affect the order we allocate/split bundles in. Today we can allocate a bundle to the call's output register and we'll later realize there's a conflict and evict it. With the suggested change we split the bundle immediately instead of doing it later.

I'll get some data on how this affects the register allocation output. The result should be very similar but we should confirm that.

(In reply to Jan de Mooij [:jandem] from comment #1)

Denis, if you're able to test this, I'm curious if you see any improvements with this patch. I also sent this to try but it will take a while to get perf results for the Android devices...

This seems like a promising improvement. I used perfetto to measure the time it took to complete register allocation on a Pixel 8 and this seems to be roughly a +5% win for that time. The numbers here are a geometric mean of the speedup in regalloc of each function that was compiled by Ion. (i.e. I compared the speedup of the same function before and after the patch and then took a geometric mean of those speedups.)

Charts-chartjs:                      0.9737
Charts-observable-plot:              1.0301
Editor-CodeMirror:                   1.0439
Editor-TipTap:                       1.0385
NewsSite-Next:                       1.0429
NewsSite-Nuxt:                       1.0359
Perf-Dashboard:                      1.0423
React-Stockcharts-SVG:               1.0426
TodoMVC-Angular-Complex-DOM:         1.0350
TodoMVC-Backbone:                    1.0451
TodoMVC-JavaScript-ES5:              0.9067
TodoMVC-JavaScript-ES6-Webpack:      1.0275
TodoMVC-jQuery:                      1.0502
TodoMVC-Lit-Complex-DOM:             1.0256
TodoMVC-Preact-Complex-DOM:          1.0229
TodoMVC-React-Complex-DOM:           1.0897
TodoMVC-React-Redux:                 1.0223
TodoMVC-Svelte-Complex-DOM:          1.0722
TodoMVC-Vue:                         1.0069
TodoMVC-WebComponents:               1.0396

The actual SP3 impact may be smaller than this since I had to make a couple adjustments to get consistent numbers such as set the device into performance mode, disable omt compilations, and reduce the Ion threshold down to about 500 to get more samples. Nevertheless, it seems to be a general win on most tests.

Flags: needinfo?(dpalmeiro)

Thanks Denis, that's useful data.

I'm working on some other perf improvements for register allocation and other parts of the Ion backend in bug 1916442, but that's more about fixing pathological cases that come up with huge Wasm functions and these changes probably won't help Speedometer much.

Severity: -- → N/A
Priority: -- → P1
Depends on: 1922227

In computeRequirement we never set hint to FIXED, so we can turn the code for
that into an assertion.

After that, conflicting.empty() is always true because the caller clears this
vector right before calling tryAllocateNonFixed.

If the bundle contains a call instruction, we'd often try all physical registers
because calls currently require spilling all registers. Modern architectures such
as arm64 have so many available registers that this can be quite slow.

This speeds up Ion compilation of a 200 MB OpenOffice Wasm file by about 3-4% on x64.
The speed up is likely larger on arm64 because it has more registers.

This is only set to true for call instructions and not for other fixed uses.

The canAllocate variable is set before the loop and is never mutated after that
point. This means it's simpler to check it outside the loop instead of on each
iteration.

This also lets us move the chooseBundleSplit call to the end of the function.

Note that the pfixed argument for this function can be nullptr so this code would
have crashed in that case if the if-body was reachable.

Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e9a3fcc87262 part 1 - Remove dead code from tryAllocateNonFixed. r=jseward https://hg.mozilla.org/integration/autoland/rev/ee69209cfbe9 part 2 - Stop searching for an available register when we find a call instruction. r=jseward https://hg.mozilla.org/integration/autoland/rev/0e473de9393e part 3 - Rename pfixed/fixed to hasCall. r=jseward https://hg.mozilla.org/integration/autoland/rev/57689ae8d79d part 4 - Simplify control flow in processBundle a bit. r=jseward https://hg.mozilla.org/integration/autoland/rev/dc6e3e292783 part 5 - Turn an always-false if-statement into an assertion. r=jseward

3.1% improvement on AWFY-Windows-webassembly embenchen-compile Windows/Linux / MacOS

Overall 1.6% improvement on MacOS-WebAssembly Godot

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: