Closed Bug 1646663 Opened 5 years ago Closed 5 years ago

Mandelbrot SIMD benchmark speeds up by 15% with --enable-debug

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla80

Tracking Flags:

Tracking

Status

firefox80

---

fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Bug 1646663 - Conditionally align loop head. r=bbouvier 5 years ago Lars T Hansen [:lth] 47 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Assignee

Description

•

5 years ago

I have a simple SIMD/scalar mandelbrot benchmark, see github.com/lars-t-hansen/simd-bench; it compiles with current emscripten and runs fine in the browser and in the shell.

In the JS shell, I observed that a debug build was 15% faster than a release build. Specifically, this configuration:

../configure --enable-debug --enable-optimize --without-intl-api

runs the benchmark 15% faster than this (default) configuration:

../configure --disable-debug --enable-optimize --without-intl-api

In both cases the command line to run it is

~/m-u/js/src/$(BUILDDIR)/dist/bin/js --wasm-compiler=ion mandel.js

I've confirmed that the runs produce bitwise equivalent output, so it's not an obvious bug. (And for that matter, --enable-optimize doesn't matter, it's all about --enable-debug.)

This reminds me of bug 1138876, which was never resolved, and which suggested some weirdness with the register allocator.

The scalar code does not seem to be affected by this from what I've seen so far: it runs at the same speed with and without --enable-debug.

Lars T Hansen [:lth]

Assignee

Comment 1

•

5 years ago

(I should add: Fedora Core 30 on x64 dual-Xeon system, fresh checkout of mozilla-unified, fresh ./mach bootstrap this morning.)

Lars T Hansen [:lth]

Assignee

Comment 2

•

5 years ago

This appears to be a result of compiling with emcc -O3; with emcc -O2, performance is normal with the release build.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → INVALID

Lars T Hansen [:lth]

Assignee

Comment 3

•

5 years ago

(I guess that's not really a great explanation, after all the same code was run in the two engines, and we still don't know why the debug build was faster than the non-debug build. Looking at the object code, it could come down to code alignment, as a few extra instructions are emitted in the debug build to check alignment of the stack pointer around calls, but this is speculative.)

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Lars T Hansen [:lth]

Assignee

Comment 4

•

5 years ago

This is just loop alignment. Ion-compiled code does not align the loop head. (Ironically, baseline-compiled code does, and has for a long time.) When I insert alignment the perf difference disappears.

Any reason I should not simply nopAlign(CodeAlignment) at the loop head? It's a one-liner fix. I guess code size might suffer, I can measure the impact somehow.

Flags: needinfo?(sunfish)

Flags: needinfo?(jseward)

Flags: needinfo?(bbouvier)

Julian Seward [:jseward]

Comment 5

•

5 years ago

A 15% hit seems extraordinarily high for a code alignment issue. Can you show the insns in the block, if it is short?

Any reason I should not simply nopAlign(CodeAlignment) at the loop head?

Would this affect compilation of JS via Ion, or wasm only? I'd be nervous about messing with the (presumably delicately balanced) JS machinery.

Assuming that nopAlign(CodeAlignment) behaves sanely and inserts the absolute minimum number of no-op insns (meaning, using the longest variants of the officially allowed nops), then this sounds reasonable to me.

Flags: needinfo?(jseward)

Lars T Hansen [:lth]

Assignee

Comment 6

•

5 years ago

The change would be wasm-specific: we would masm.nopAlign(CodeAlignment) before masm.wasmInterruptCheck(...) in visitWasmInterruptCheck. This interrupt check is only used at loop heads (prologues have a separate one). If we were paranoid we could introduce a flag on the MIR/LIR nodes that carries information about whether the check is for a loop head or not, to avoid any future mishaps -- we would only insert the alignment for checks at loop heads.

I guess there's in principle a risk that the optimizer moves the interrupt check around in the code, but the MIR node has a setGuard() and the node is not marked as moveable. I suppose there's a risk that code might be inserted before it. I see there's an isLoopHeader() predicate on MIR blocks. Maybe emit the alignment before each such block instead (in the case of wasm compilation)?

I agree that the hit is large though of course the inner loop of Mandelbrot is not too many instructions (that said - at least 10 SIMD instructions and a number of scalar ones). Given how the program is compiled with -O3, the loop is duplicated and there are other things going on, so it's not easy to extract the specific code without further analysis (for which I have limited time, sadly). But the 15% comes from running the same wasm code on the same machines with two engines whose only difference is the alignment described above. As a check, I will try to run an experiment where the insertion of the alignment is controlled by an env var so that the executable literally does not change between the two cases.

Benjamin Bouvier [:bbouvier] (inactive)

Comment 7

•

5 years ago

Last time we've measured in bug 867001, we eventually did not align blocks in general because the performance impact was in the noise range, but this one-liner fix seems pretty harmless indeed.

Flags: needinfo?(bbouvier)

Lars T Hansen [:lth]

Assignee

Comment 8

•

5 years ago

Another thing here is that my octacore Xeon running Linux is not consumer or typical Firefox-user hardware. Memo to self: I really need to look into this on my MBP, minimally, and reproducing it in a properly built browser would be best. (And if so, it's possible to get tests run on other hardware to see if it's an across-the-board issue.)

Lars T Hansen [:lth]

Assignee

Comment 9

•

5 years ago

On the MBP (2.7Ghz quad i7), using an env var to control the code that's emitted, there's a consistent 3% [sic] speedup from aligning the loop head, and this is true whether the wasm code is compiled with -O2 or -O3. This is in the JS shell (--enable-optimize --disable-debug --enabl-release). Thus the 15% is either an artifact of the Xeon or the Linux environment or something else noisy.

Given that -O2 code yields the same result it may be easier to extract the machine code for the hot loops, should we want to.

Lars T Hansen [:lth]

Assignee

Comment 10

•

5 years ago

Attached file Bug 1646663 - Conditionally align loop head. r=bbouvier — Details

Lars T Hansen [:lth]

Assignee

Comment 11

•

5 years ago

I have re-run this test on the Xeon with the above patch (so, one shell executable, just the env flag to control the code) and the difference is not shrinking; if anything it is larger now, 669ms for unaligned vs 558ms for aligned (at emcc -O3).

Phabricator Automation

Updated

•

5 years ago

Attachment #9158851 - Attachment description: Bug 1646663 - Conditionally align loop head (WIP) → Bug 1646663 - Conditionally align loop head

Phabricator Automation

Updated

•

5 years ago

Attachment #9158851 - Attachment description: Bug 1646663 - Conditionally align loop head → Bug 1646663 - Conditionally align loop head. r=bbouvier

Pulsebot

Comment 12

•

5 years ago

Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8e577c5158b8 Conditionally align loop head. r=bbouvier

Noemi Erli[:noemi_erli]

Comment 13

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/8e577c5158b8

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

status-firefox80: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla80

Lars T Hansen [:lth]

Assignee

Updated

•

5 years ago

Flags: needinfo?(sunfish)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Mandelbrot SIMD benchmark speeds up by 15% with --enable-debug

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

Tracking

()

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type