Open Bug 1952778 Opened 8 months ago Updated 2 months ago

OMT Baseline compilation is 1.2x slower compared with MT Baseline compilation on this artificial testcase

Categories

(Core :: JavaScript Engine, task, P5)

task

Tracking

()

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Attached file Empty loop.HTML

Open attached testcase
Enter some number and press generate

2500000

3500000

What is the browser doing in this period in OMT case where no activity is shown and which probably contributes the most to the perf gap: https://share.firefox.dev/41Qt0jP ?

Slowness with OMT baseline compilation is currently expected, with ongoing work to improve it with techniques like Batching. But filing this bug as a simple (but artificial) testcase.

N=1500000
https://share.firefox.dev/3FtB3dy

Empty Switch statement and if-then-else bodies dont seem to get OMT Baseline compilation?

This testcase is creating an absurdly large script (well over the maximum size for Ion compilation). Parsing it takes up most of our time, but baseline compiling it is also slow.

The difference appears to be that, while OMT baseline compilation successfully moves the bulk of the compilation off-thread (saving about 1s), for some reason finishCompile takes much longer in the OMT case: 1700 ms vs 150 ms. It's just memcpy, so it doesn't make much sense that it's 10x slower. Maybe there's some sort of cache coherency problem moving it between threads?

The net result is that both approaches take ~1.7s of main-thread time to baseline compile.

The other big difference is that the OMT version is slower on the main thread. One example is that baseline code spends longer in ToBoolFallback. This is because we're running in blinterp while the (very long) compilation is completed off-thread, and we have to keep attaching ICs for the JumpIfFalse in the loop conditional. In baseline, we do enough analysis to realize that the input will already be a boolean.

More broadly, we spend almost no time in blinterp in the MT case, but nearly half of our time in blinterp in the OMT case. This is an inevitable tradeoff of OMT compilation; it's probably a bit worse in this case because the function is enormous and the compile is slow, so we have more overall blinterp time than we would in a less pathological case.

So the only unexpected part here is the big slowdown while linking the OMT compilation on the mainthread. I'm not sure what to make of that. It seems plausible that it's in some way caused by the size of the script. I've had a hard time reproducing it locally, because I usually finish executing in blinterp before the OMT compile is finished. I did trigger the long pause once. I also managed to capture a profile where the baseline compile didn't finish in time, but I could see it being cancelled by a GC.

In both cases, the hotspot was a loop that looks like this:

mov rdi, qword [rbx + 0x10]  // Load field from current array element
mov qword [rbx + 0x10], 0x0  // Zero that field
test rdi, rdi                // Check to see if loaded value was already zero
jz 0x31fa1c1                 // If not:
call 0x980a000               //   call something
add rbx, 0x18                // Move to next element
cmp rbx, r15                 // Check if done
jb 0x31fa1ab                 // Loop

This appears to be iterating through an array with 3-word elements (0x18), loading the last word, checking if it's non-zero, and calling something if it is, zeroing it either way. There are no samples on the call, so I assume it is always zero in practice.

I can't quite get the offsets to line up exactly right in BaselineCompiler itself, but this looks a lot like the destructor for the opcodes_ vector inside BaselinePerfSpewer. OpcodeEntry is the right size, the str field is at the right offset, and this looks like an inlined version of the UniqueChars destructor. We're looping to ensure that the UniqueChars doesn't have to be freed.

This implies that a chunk of the overhead here is the profiler profiling itself: this vector would be empty if we weren't profiling.

I think for now I'm comfortable saying that this isn't a problem in practice, unless we see a lot of future profiles with big pauses while linking baseline compilation.

(In reply to Mayank Bansal from comment #2)

Empty Switch statement and if-then-else bodies dont seem to get OMT Baseline compilation?

This code only runs once, so if there are no loops, then there's no point in compiling it (or even going to the baseline interpreter). The cost of interpreting an op once in the C++ interpreter is going to be lower than the overhead of compiling that op (or allocating an IC for it).

Once there's a loop, then the ICs are at least speeding up the loop code itself.

Priority: -- → P5

These are profiles with Samply in a new Firefox profile where i never enabled the gecko profiler:

Empty Loop, N=3500000

Empty switch, N=1500000

Blocks: 1980560
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: