Closed Bug 1646663 Opened 3 months ago Closed 3 months ago

Mandelbrot SIMD benchmark speeds up by 15% with --enable-debug

Categories

(Core :: Javascript: WebAssembly, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

I have a simple SIMD/scalar mandelbrot benchmark, see github.com/lars-t-hansen/simd-bench; it compiles with current emscripten and runs fine in the browser and in the shell.

In the JS shell, I observed that a debug build was 15% faster than a release build. Specifically, this configuration:

../configure --enable-debug --enable-optimize --without-intl-api

runs the benchmark 15% faster than this (default) configuration:

../configure --disable-debug --enable-optimize --without-intl-api

In both cases the command line to run it is

~/m-u/js/src/$(BUILDDIR)/dist/bin/js --wasm-compiler=ion mandel.js

I've confirmed that the runs produce bitwise equivalent output, so it's not an obvious bug. (And for that matter, --enable-optimize doesn't matter, it's all about --enable-debug.)

This reminds me of bug 1138876, which was never resolved, and which suggested some weirdness with the register allocator.

The scalar code does not seem to be affected by this from what I've seen so far: it runs at the same speed with and without --enable-debug.

(I should add: Fedora Core 30 on x64 dual-Xeon system, fresh checkout of mozilla-unified, fresh ./mach bootstrap this morning.)

This appears to be a result of compiling with emcc -O3; with emcc -O2, performance is normal with the release build.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → INVALID

(I guess that's not really a great explanation, after all the same code was run in the two engines, and we still don't know why the debug build was faster than the non-debug build. Looking at the object code, it could come down to code alignment, as a few extra instructions are emitted in the debug build to check alignment of the stack pointer around calls, but this is speculative.)

Status: RESOLVED → REOPENED
Resolution: INVALID → ---

This is just loop alignment. Ion-compiled code does not align the loop head. (Ironically, baseline-compiled code does, and has for a long time.) When I insert alignment the perf difference disappears.

Any reason I should not simply nopAlign(CodeAlignment) at the loop head? It's a one-liner fix. I guess code size might suffer, I can measure the impact somehow.

Flags: needinfo?(sunfish)
Flags: needinfo?(jseward)
Flags: needinfo?(bbouvier)

A 15% hit seems extraordinarily high for a code alignment issue. Can you show the insns in the block, if it is short?

Any reason I should not simply nopAlign(CodeAlignment) at the loop head?

Would this affect compilation of JS via Ion, or wasm only? I'd be nervous about messing with the (presumably delicately balanced) JS machinery.

Assuming that nopAlign(CodeAlignment) behaves sanely and inserts the absolute minimum number of no-op insns (meaning, using the longest variants of the officially allowed nops), then this sounds reasonable to me.

Flags: needinfo?(jseward)

The change would be wasm-specific: we would masm.nopAlign(CodeAlignment) before masm.wasmInterruptCheck(...) in visitWasmInterruptCheck. This interrupt check is only used at loop heads (prologues have a separate one). If we were paranoid we could introduce a flag on the MIR/LIR nodes that carries information about whether the check is for a loop head or not, to avoid any future mishaps -- we would only insert the alignment for checks at loop heads.

I guess there's in principle a risk that the optimizer moves the interrupt check around in the code, but the MIR node has a setGuard() and the node is not marked as moveable. I suppose there's a risk that code might be inserted before it. I see there's an isLoopHeader() predicate on MIR blocks. Maybe emit the alignment before each such block instead (in the case of wasm compilation)?

I agree that the hit is large though of course the inner loop of Mandelbrot is not too many instructions (that said - at least 10 SIMD instructions and a number of scalar ones). Given how the program is compiled with -O3, the loop is duplicated and there are other things going on, so it's not easy to extract the specific code without further analysis (for which I have limited time, sadly). But the 15% comes from running the same wasm code on the same machines with two engines whose only difference is the alignment described above. As a check, I will try to run an experiment where the insertion of the alignment is controlled by an env var so that the executable literally does not change between the two cases.

Last time we've measured in bug 867001, we eventually did not align blocks in general because the performance impact was in the noise range, but this one-liner fix seems pretty harmless indeed.

Flags: needinfo?(bbouvier)

Another thing here is that my octacore Xeon running Linux is not consumer or typical Firefox-user hardware. Memo to self: I really need to look into this on my MBP, minimally, and reproducing it in a properly built browser would be best. (And if so, it's possible to get tests run on other hardware to see if it's an across-the-board issue.)

On the MBP (2.7Ghz quad i7), using an env var to control the code that's emitted, there's a consistent 3% [sic] speedup from aligning the loop head, and this is true whether the wasm code is compiled with -O2 or -O3. This is in the JS shell (--enable-optimize --disable-debug --enabl-release). Thus the 15% is either an artifact of the Xeon or the Linux environment or something else noisy.

Given that -O2 code yields the same result it may be easier to extract the machine code for the hot loops, should we want to.

I have re-run this test on the Xeon with the above patch (so, one shell executable, just the env flag to control the code) and the difference is not shrinking; if anything it is larger now, 669ms for unaligned vs 558ms for aligned (at emcc -O3).

Attachment #9158851 - Attachment description: Bug 1646663 - Conditionally align loop head (WIP) → Bug 1646663 - Conditionally align loop head
Attachment #9158851 - Attachment description: Bug 1646663 - Conditionally align loop head → Bug 1646663 - Conditionally align loop head. r=bbouvier
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8e577c5158b8
Conditionally align loop head. r=bbouvier
Status: REOPENED → RESOLVED
Closed: 3 months ago3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
Flags: needinfo?(sunfish)
You need to log in before you can comment on or make changes to this bug.