OMT Baseline compilation is 1.2x slower compared with MT Baseline compilation on this artificial testcase
Categories
(Core :: JavaScript Engine, task, P5)
Tracking
()
People
(Reporter: mayankleoboy1, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
Open attached testcase
Enter some number and press generate
2500000
- OMT Baseline compilation: https://share.firefox.dev/41Q1rHk (9.1s)
- MT Baseline compilation: https://share.firefox.dev/3FepUNQ (7.5s)
- Chrome couldnt complete - it inhaled all my RAM
3500000
- OMT: https://share.firefox.dev/4ibeSaE (13s)
- MT: https://share.firefox.dev/43Elo5c (10.5s)
What is the browser doing in this period in OMT case where no activity is shown and which probably contributes the most to the perf gap: https://share.firefox.dev/41Qt0jP ?
Slowness with OMT baseline compilation is currently expected, with ongoing work to improve it with techniques like Batching. But filing this bug as a simple (but artificial) testcase.
| Comment hidden (obsolete) |
| Reporter | ||
Comment 2•8 months ago
|
||
N=1500000
https://share.firefox.dev/3FtB3dy
Empty Switch statement and if-then-else bodies dont seem to get OMT Baseline compilation?
Comment 3•8 months ago
|
||
This testcase is creating an absurdly large script (well over the maximum size for Ion compilation). Parsing it takes up most of our time, but baseline compiling it is also slow.
The difference appears to be that, while OMT baseline compilation successfully moves the bulk of the compilation off-thread (saving about 1s), for some reason finishCompile takes much longer in the OMT case: 1700 ms vs 150 ms. It's just memcpy, so it doesn't make much sense that it's 10x slower. Maybe there's some sort of cache coherency problem moving it between threads?
The net result is that both approaches take ~1.7s of main-thread time to baseline compile.
The other big difference is that the OMT version is slower on the main thread. One example is that baseline code spends longer in ToBoolFallback. This is because we're running in blinterp while the (very long) compilation is completed off-thread, and we have to keep attaching ICs for the JumpIfFalse in the loop conditional. In baseline, we do enough analysis to realize that the input will already be a boolean.
More broadly, we spend almost no time in blinterp in the MT case, but nearly half of our time in blinterp in the OMT case. This is an inevitable tradeoff of OMT compilation; it's probably a bit worse in this case because the function is enormous and the compile is slow, so we have more overall blinterp time than we would in a less pathological case.
So the only unexpected part here is the big slowdown while linking the OMT compilation on the mainthread. I'm not sure what to make of that. It seems plausible that it's in some way caused by the size of the script. I've had a hard time reproducing it locally, because I usually finish executing in blinterp before the OMT compile is finished. I did trigger the long pause once. I also managed to capture a profile where the baseline compile didn't finish in time, but I could see it being cancelled by a GC.
In both cases, the hotspot was a loop that looks like this:
mov rdi, qword [rbx + 0x10] // Load field from current array element
mov qword [rbx + 0x10], 0x0 // Zero that field
test rdi, rdi // Check to see if loaded value was already zero
jz 0x31fa1c1 // If not:
call 0x980a000 // call something
add rbx, 0x18 // Move to next element
cmp rbx, r15 // Check if done
jb 0x31fa1ab // Loop
This appears to be iterating through an array with 3-word elements (0x18), loading the last word, checking if it's non-zero, and calling something if it is, zeroing it either way. There are no samples on the call, so I assume it is always zero in practice.
I can't quite get the offsets to line up exactly right in BaselineCompiler itself, but this looks a lot like the destructor for the opcodes_ vector inside BaselinePerfSpewer. OpcodeEntry is the right size, the str field is at the right offset, and this looks like an inlined version of the UniqueChars destructor. We're looping to ensure that the UniqueChars doesn't have to be freed.
This implies that a chunk of the overhead here is the profiler profiling itself: this vector would be empty if we weren't profiling.
I think for now I'm comfortable saying that this isn't a problem in practice, unless we see a lot of future profiles with big pauses while linking baseline compilation.
(In reply to Mayank Bansal from comment #2)
Empty Switch statement and if-then-else bodies dont seem to get OMT Baseline compilation?
This code only runs once, so if there are no loops, then there's no point in compiling it (or even going to the baseline interpreter). The cost of interpreting an op once in the C++ interpreter is going to be lower than the overhead of compiling that op (or allocating an IC for it).
Once there's a loop, then the ICs are at least speeding up the loop code itself.
| Reporter | ||
Comment 4•5 months ago
|
||
These are profiles with Samply in a new Firefox profile where i never enabled the gecko profiler:
Empty Loop, N=3500000
- OMT Baseline: https://share.firefox.dev/3FttgwL (9.5s)
- MT Baseline: https://share.firefox.dev/4kg8gsO (11s)
Empty switch, N=1500000
- OMT Baseline: https://share.firefox.dev/43aES0U (1.9s)
- MT Baseline: https://share.firefox.dev/4kya4wM (1.9s)
Description
•