Closed Bug 1316803 Opened 8 years ago Closed 7 years ago

Wasm baseline: fold constant rhs into integer operations

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: lth, Assigned: dannas, Mentored)

References

Details

Attachments

(11 files, 11 obsolete files)

6.44 KB, text/plain
Details
2.43 KB, text/plain
Details
4.87 KB, application/javascript
Details
1003 bytes, application/x-shellscript
Details
3.03 KB, text/plain
Details
1.44 KB, text/plain
Details
18.76 KB, patch
lth
: review+
Details | Diff | Splinter Review
3.52 KB, text/plain
Details
13.26 KB, patch
lth
: review+
Details | Diff | Splinter Review
9.84 KB, patch
Details | Diff | Splinter Review
1.34 KB, application/x-javascript
Details
There are many opportunities for cheaply folding in a constant rhs to arithmetic operations, we do this already for I32 add and shift operators, this reduces register pressure and instruction count.  Most places are already tagged with TODO in the code.

(Note, folding in constant rhs for conditionals is covered by bug 1286816.)
There are some special cases that are probably worth adressing:

- Bug 1276009, x ^ -1 is the way wasm expresses ~x, mbebinita may have other cases like this
- Multiplication by certain easy constants
- Divisions by constant-power-of-two if the lhs is nonnegative, maybe worth open-coding the sign check
  and specializing the path

The multiply and divide cases are of unclear value to me - it depends on what the front-end does, for whatever source language we care about... probably worth investigating in more depth.
I'll take a stab at this.

The TODO's in the code can be found with this search http://searchfox.org/mozilla-central/search?q=1316803&case=false&regexp=false&path=WasmBaselineCompile and the an example of folding for i32 adds can be found at http://searchfox.org/mozilla-central/source/js/src/wasm/WasmBaselineCompile.cpp#3748
That's great.  If it were me I would do add/sub/shift/rotate first (I now see the subtract primitives are not annotated with the bug numbers but they probably should be) and then worry about multiply, quotient, and remainder in subsequent rounds.
Some SpiderMonkey newbie questions:

* Why do we need fold optimizations in the BaseCompiler? Won't emscripten et.al. do those optimizations for us already?

* What is the relation between IonMonkey and wasm::BaseCompiler? AFAICT, for regular javascript, the BaselineCompiler is a transition step between interpreting and full-JIT compilation. But in http://searchfox.org/mozilla-central/source/js/src/wasm/WasmIonCompile.cpp#3793 it looks like it's either Ion or Baseline. (And I don't understand how the Ion code can be used for both javascript and wasm; the MIR and LIR bytecodes for javascript is not the same as the bytecode for wasm).

I interpret the wasm::BaseCompiler code as: "generate assembly code by calling the MacroAssembler". How can I inspect the generated assembly code for an wasm function? https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips has some suggestions on how to use gdb for inspecting the odin and ion generated code. How do it for the BaseCompiler?
Attached file printwasm.py (obsolete) —
A script using the gdb python api for inspecting the assembly generated by the BaseCompiler. An attempt at codifying the manual instructions at https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips#Printing_asm.jswasm_generated_assembly_code_(from_gdb)

Works ok but with two hiccups:
* There are 32 bytes at the beginning of each asm listing, that gdb is not able to disassemble. Why?
* How can I read the length of the assembly? Right now I'm decoding 64 bytes, but it would be neat if I could show only the relevant instructions.
Attached file printwasm.py (obsolete) —
Attached file printwasm.py
Attachment #8812606 - Attachment is obsolete: true
Attachment #8812607 - Attachment is obsolete: true
FWIW, the simplest ways I know how to look at the generated machine code:

(1) look at the codegen output, this is what I do most often:

$ IONFLAGS=codegen dist/bin/js --wasm-always-baseline whatever.js > foo.txt 2>&1
$ <whatever command you like for viewing a text file> foo.txt
search for "wasm baseline" in the file

After that it's pretty self-explanatory, if you know a little bit about how the compiler generates the function prologue code so that you're not confused by the jumping around.

(2) emit a breakpoint in your code and examine the code from gdb:

In the code, call masm.breakpoint() to emit a breakpoint at the start of a function, say, or after the prologue.
Run under gdb.
When gdb stops, x/10i shows you the next 10 instructions.
Use 'stepi' to step past the breakpoint.
Attached file annotated disassembly
Thanks for the suggestions to use IONFLAGS=codegen and setting a breakpoint from the MacroAssembler API.

Attaching my annotations on a disassembly of a Wast function.

Sorry for spamming this issue. Hopefully, that's enough lateral movements, now I can inspect the assembly and understand (sort of) the call flow. Time to move forward and do part 1:  add/sub/shift/rotate.
(In reply to Daniel Näslund from comment #9)
> Created attachment 8812902 [details]
> annotated disassembly

A minor correction:  Following the retq that follows .returnLabel, the instructions up to and including the next retq comprise the profiling epilogue.  After that, the jmpq + int3 and callq + int3 sequences are indeed traps.
Attached file 1316803_tests.js (obsolete) —
Attached patch wip-1316803-part1.patch (obsolete) — Splinter Review
Need to ponder a bit more on the shift and rotate instructions. Just need a little more time I guess.

The f32.add and f64.add instructions were marked with this bug number. But floating point add instructions don't provide immediate operands AFAICT. I asked sunfish the leading question: "why do hardware floating point op instr don't provide immediate operands". He suggested that floating point constants aren't as common and that arm/mips has a limited number of bits available.

The MacroAssembler does not expose an API that has immediate operands for adding floating point values either.

Is there a way to provide floating point immediate add operations? Or was the comment a copy-paste slip?

I'll start thinking about part 2. I wonder how we can get a feel for what kind of optimizations are worthwhile? Some sort of instrumentation to determine how often we run mult/div/remainder instructions with a constant operand.

Replace multiplications of factor two by shifts is one thing that can be easily done. Guess its' time to pick out Henry Warrens book Hackers Delight from the shelf and read up on divisions by constants.
Assignee: nobody → dannas
(In reply to Daniel Näslund from comment #12)
> Created attachment 8813867 [details] [diff] [review]
> wip-1316803-part1.patch
> 
> Is there a way to provide floating point immediate add operations? Or was
> the comment a copy-paste slip?

It wasn't a slip, but that the comment is there is no commitment to actually implement an optimization, it's just where the optimization would go if it were to be implemented.

On all platforms there are plenty of FP registers and it's almost certainly less important to implement this optimization for FP than for integer.

> Replace multiplications of factor two by shifts is one thing that can be
> easily done.

And if a small number of bits is set in the rhs it may be profitable to replace multiplication by a series of shifts and adds.  It seems probable that Ion already has some heuristics here for how few bits can be set but I haven't looked.

We have benchmarks, you know: https://github.com/kripken/embenchen/tree/master/asm_v_wasm
(In reply to Lars T Hansen [:lth] from comment #13)
> (In reply to Daniel Näslund from comment #12)

> > Replace multiplications of factor two by shifts is one thing that can be
> > easily done.
> 
> And if a small number of bits is set in the rhs it may be profitable to
> replace multiplication by a series of shifts and adds.  It seems probable
> that Ion already has some heuristics here for how few bits can be set but I
> haven't looked.

I'll investigate that, though IonMonkey appears to only deal with -1, 0, 1, and powers of two: http://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/CodeGenerator-x86-shared.cpp#944

Agner Fog says that instruction latencies for imul on recent x86 is 3-4. If that's correct, then there wouldn't be much use in replacing imul with shift and adds. But when I checked the gcc.godbolt.org output for icc and gcc, they did most of these transformations: http://www.xarg.org/2010/04/optimizing-integer-multiplication/. I guess maybe the shift and add can use different ports.

Anyway. I'll try a few approaches: only imul; replace -1, 0, 1 and powers of two; do shift-add transformations up to some cutoff, and compare the benchmark results.

IonMonkey does division optimizations for non-powers-of-two http://searchfox.org/mozilla-central/source/js/src/jit/shared/CodeGenerator-shared.cpp#1671. That looks fiddly and complicated for a BaseCompiler that is meant to just do the minimal amount of optimization.

I'll start with just replacing powers-of-two and see what that gives me in speedup.

(The IonMonkey links are for x86 only. Haven't investigate if we doe something radically different for arm and mips).

> We have benchmarks, you know:
> https://github.com/kripken/embenchen/tree/master/asm_v_wasm

Ah, great! I wasn't aware of them.
BTW, there's no reason not to finish + review + land the patch for add/sub/shift/rotate/bitwise-ops now, we don't have to have a single patch for this bug.  In fact I'd probably prefer to review that patch separately from any multiply/divide work, since the latter is probably somewhat more complex and less obviously a benefit.
Attached file 1316803_tests.js
Attachment #8813859 - Attachment is obsolete: true
Attached file runtests.sh
Attached file assembly_optimized.txt
See assembly_optimized.txt for the assembly output for each operation.

This is supposed to be a performance improvement, but so far I haven't proved that it speed things up. I tried running the embenchen benchmark but saw no performance improvement. Thought, my testing was ad-hoc. I'll try again tomorrow and make sure to have a lightly loaded system, pin the test to one cpu and do a larger number of samples.

Do you have  a script for running the embenchen benchmarks from the shell and automate measurements? I've ran the benchmarks in benches/ one by one and checked user cpu time with time(8) so far.

I'll write a few artifical benchmarks for myself as well (e.g. perform N add-operations with constant rhs) to see what to expect in the best-case.

Are you ok with dropping constant folding for floating point? Should I give it more thought?
Attachment #8814718 - Flags: review?(lhansen)
(In reply to Daniel Näslund from comment #19)

> Do you have a script for running the embenchen benchmarks from the shell
> and automate measurements? 

I do now, because I needed it myself.  Clone https://github.com/lars-t-hansen/embenchen.  Then cd to asm_v_wasm/ (not benches/) and run wasm_bench.sh.
Comment on attachment 8814718 [details] [diff] [review]
bug_1316803_part_1_fold_constant_rhs_for_add_sub_shift_rotate_bitwise_ops.patch

Review of attachment 8814718 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you!  Very nice, but we'll do one more round because of the problems noted below for rotate which require fixes + test cases.

As for data showing this is an optimization, we should run the benchmark suite but I expect the results to be mostly in the noise because the benchmark suite is pretty narrow.  If we had some crypto or some graphics code, then maybe...

Obviously contributions to the benchmark suite or the benchmark runner that alleviate that problem are very welcome, but I'm perfectly happy to take this patch without that.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +4463,5 @@
>  {
> +    int32_t c;
> +    if (popConstI32(c)) {
> +        RegI32 r = popI32();
> +        masm.rotateRight(Imm32(c), r, r);

"c & 31".  Apparently we do not have a test case that exercises that case, so it would be lovely if you could create one and include it with this patch.  We should also have a test for the case where c is zero, which the MacroAssembler allows but has to handle specially on some architectures.

@@ +4480,5 @@
>  {
> +    int64_t c;
> +    if (popConstI64(c)) {
> +        RegI64 r = popI64();
> +        masm.rotateRight64(Imm32(c), r, r, InvalidReg);

"c & 63", ditto.

@@ +4497,5 @@
>  {
> +    int32_t c;
> +    if (popConstI32(c)) {
> +        RegI32 r = popI32();
> +        masm.rotateLeft(Imm32(c), r, r);

"c & 31", ditto.

@@ +4514,5 @@
>  {
> +    int64_t c;
> +    if (popConstI64(c)) {
> +        RegI64 r = popI64();
> +        masm.rotateLeft64(Imm32(c), r, r, InvalidReg);

"c & 63", ditto.
Attachment #8814718 - Flags: review?(lhansen)
(In reply to Daniel Näslund from comment #19)
> 
> Are you ok with dropping constant folding for floating point?

Yes.  Let's drop that (and remove the comments, as you've done) and let that be a data-driven item, whenever we get around to cataloguing instruction sequences.
Fixed range of rotate count.

I've added a test for rotating a too large i64 number. There already exists such a test for i32 (a count &= 0x1f statement in the x64 MacroAssembler caused that test not to trigger)

AFAICT, there already exists tests for rotate counts of zero. The i32 case can be found here: http://searchfox.org/mozilla-central/source/js/src/jit-test/tests/wasm/integer.js#130

The i64 test for 0 rotate count is here: http://searchfox.org/mozilla-central/source/js/src/jit-test/tests/wasm/integer.js#209

To me, it was a surprise to learn that different hardware platforms treat too large shift and rotate amounts differently.
Attachment #8813867 - Attachment is obsolete: true
Attachment #8814718 - Attachment is obsolete: true
Attachment #8815042 - Flags: review?(lhansen)
Benchmark results from running lth's wasm_bench.sh script for the embenchen benchmarks. The results are inconclusive, but atleast there's no major slowdown.
(In reply to Daniel Näslund from comment #23)
> To me, it was a surprise to learn that different hardware platforms treat
> too large shift and rotate amounts differently.

FWIW, this goes for C++ as well -- and the rationale for C++ doing so is exactly that different hardware behaves differently on it, and picking any one semantics penalizes code compiled on whatever hardware implemented different semantics.
(In reply to Daniel Näslund from comment #24)
> Created attachment 8815064 [details]
> benchmark_results_part_1.txt
> 
> Benchmark results from running lth's wasm_bench.sh script for the embenchen
> benchmarks. The results are inconclusive, but atleast there's no major
> slowdown.

Actually, for my benchmark runner the first two columns are such that lower-is-better, but for the ratio in the third column higher-is-better (this is probably dumb but there you have it, I should print a proper legend, WILLFIX).  These numbers look pretty noisy; I'm going to do a local run as well, since I have some stable numbers from yesterday.
Compared to a stable run from yesterday (7 runs).  Most changes are extremely minor, to get a ++ below you only need to improve the ratio by .02 or more.

box2d           969     1675    .578 +
bullet          1191    2085    .571 -
conditionals    287     960     .298 +
copy            591     2935    .201 +
corrections     961     1467    .655 +
fannkuch        2186    4628    .472 +
fasta           739     2113    .349 -
ifs             197     294     .670 ++
linpack         8276    10401   .795 +
lua_binarytrees 3916    5941    .659 ++
lua_scimark     4147    7102    .583 +
memops          889     3510    .253 +
primes          1141    1155    .987 -
skinning        824     2187    .376 +
zlib            1144    1955    .585 +

I'd say this is just fine.
Attachment #8815042 - Flags: review?(lhansen) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d62d5b78e234
Fold constant rhs for add/sub/shift/rotate/bitwise-ops. r=lth
Keywords: checkin-needed
My apologies. I wasn't aware of the distinction between mozilla-central and mozilla-inbound.

I've applied for try server access, and will make sure in the future that my patches apply cleanly against mozilla-inbound.

I'll also make sure to use the "r?somebody" pattern in my commit messages to ensure that a push will fail-fast for missing r= stanzas.

Again, sorry for the inconvenience.
Flags: needinfo?(dannas)
Keywords: leave-open
When building for arm64 in treeherder, I get these build failures for part1:

/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:3990:31: error: use of deleted function 'void js::jit::MacroAssembler::add64(js::jit::Imm64, js::jit::Register64)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:769:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4044:31: error: use of deleted function 'void js::jit::MacroAssembler::sub64(js::jit::Imm64, js::jit::Register64)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:788:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4649:59: error: use of deleted function 'void js::jit::MacroAssembler::rotateRight64(js::jit::Imm32, js::jit::Register64, js::jit::Register64, js::jit::Register)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:911:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4683:58: error: use of deleted function 'void js::jit::MacroAssembler::rotateLeft64(js::jit::Imm32, js::jit::Register64, js::jit::Register64, js::jit::Register)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:902:17: error: declared here
make[3]: *** [Unified_cpp_js_src37.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [js/src/target] Error 2
make[1]: *** [compile] Error 2
make: *** [default] Error 2
subprocess.CalledProcessError: Command 'make -s -w -j6' returned non-zero exit status 2

A few operations with immediate operands are not defined for arm64. What to do now?

1. Add #ifndef JS_CODEGEN_ARM64 around the operations using constants. Gets a bit fiddly, but doable. Then create an issue for the missing methods. Or..

2. Add the missing methods to jit/arm64/MacroAssembler-inl.h. But then I'd have to test the functionality and I don't have access to a computer equipped with an arm64 processor. The arm simulator only supports 32 bits https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips#Using_the_ARM_simulator. Can I use qemu? Some amazon VM?
Flags: needinfo?(lhansen)
(In reply to Daniel Näslund from comment #31)
> 2. Add the missing methods to jit/arm64/MacroAssembler-inl.h. But then I'd
> have to test the functionality and I don't have access to a computer
> equipped with an arm64 processor. The arm simulator only supports 32 bits
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/
> Hacking_Tips#Using_the_ARM_simulator. Can I use qemu? Some amazon VM?

ARM64 has a separate simulator. On a 64-bit host system, you can build it by passing --enable-simulator=arm64 to configure.

The ARM64 IonMonkey JIT doesn't work, so it's hard to test these things. You could either implement them correctly, or implement them as MOZ_CRASH("NYI"), with an open bug.

The missing functions look easy enough to implement that I would suggest just implementing them. I would be happy to review that part of the patch.
Flags: needinfo?(lhansen)
Daniel, extra points to you if you implement these functions.  For a while I was doing that myself, but really, lately Arm64 has gotten in the way of getting work done and I've started just stubbing out everything with a MOZ_CRASH("NYI").  There must be dozens of those by now.  A few more won't make much difference.
I've verified that this patch now builds for arm64 by building with configure --enable-simulator=arm64. 

I wrote an implementation for Add64; reasoned that by following the pattern for the other overloaded Add methods I'd do ok.

For sub and rotate there were no examples available. So I stubbed them out. The work of implementing the stubs for arm64 is tracked since before in https://bugzilla.mozilla.org/show_bug.cgi?id=1313336
Attachment #8815042 - Attachment is obsolete: true
Attachment #8817216 - Flags: review?(lhansen)
Comment on attachment 8817216 [details] [diff] [review]
bug_1316803_v3_fold_constant_rhs_for_add_sub_shift_rotate_bitwise_ops.patch

Review of attachment 8817216 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you!  Nice work.
Attachment #8817216 - Flags: review?(lhansen) → review+
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bfbdec4ea08
part 1 - fold constant rhs for add/sub/shift/rotate/bitwise-ops r=lth
Keywords: checkin-needed
Sorry, I had to back this out for asserting in wasm.js (low 32 bits don't match) on Windows XP 32-bit:

https://hg.mozilla.org/integration/mozilla-inbound/rev/dffe4787171445ea01f99a41514d262f7882d120

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3bfbdec4ea08c9fb7fb0a4ee624c1607e85a4013
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=40379278&repo=mozilla-inbound

TEST-PASS | js\src\jit-test\tests\wasm\import-export.js | Success (code 0, args "--wasm-always-baseline")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --baseline-eager")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --wasm-always-baseline")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --no-baseline --no-ion")
TEST-PASS | js\src\jit-test\tests\wasm\js-reexport.js | Success (code 0, args "")
c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match
Stack:
  assertEqI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9
  _wasmFullPassInternal@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:125:5
  wasmFullPassI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:147:70
  testBinary64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:32:5
  @c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:206:5
Exit code: 3
FAIL - wasm\integer.js
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\integer.js | c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match (code 3, args "--wasm-always-baseline")
INFO exit-status     : 3
INFO timed-out       : False
INFO stderr         2> c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match
INFO stderr         2> Stack:
INFO stderr         2> assertEqI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9
INFO stderr         2> _wasmFullPassInternal@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:125:5
INFO stderr         2> wasmFullPassI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:147:70
INFO stderr         2> testBinary64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:32:5
INFO stderr         2> @c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:206:5

Please fix the issue and submit a new patch. Thank you.
Flags: needinfo?(dannas)
Changes v3 => v4
* Allocate valid temp register on x86 for rotateI64.
Attachment #8817216 - Attachment is obsolete: true
Flags: needinfo?(dannas)
Attachment #8817919 - Flags: review?(lhansen)
Attachment #8817919 - Flags: review?(lhansen) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec3372c74e8ccb03e4edd87d5456bfd292ffe948

Three "intermittent orange" (jandem helped me classify them):
tc(V): Vague valgrind error. Green when I re-ran the test.
tc(B): gcc segfault. Green when I re-ran.
SM(cgc): Known intermittent error.

All other tests pass.
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f771233d340
b=Bug 1316803 part 1 - fold constant rhs for add/sub/shift/rotate/bitwise-ops. r=lth
Keywords: checkin-needed
The TODO comment said: optimize for lhs >= 0 && c is power of two. But we can replace division with shifts even for negative lhs. I did that.

The div/rem emit-methods for i64 are different for 32-bit platforms and 64-bit platforms. I've only modified the 64bit platforms. The 32 bit code was a little too entangled. I can do that as well, if we think that some improvements can be gained there (I recently saw numbers from some mozilla metrics page suggesting that 90% of all firefox users are on 32 bit platforms, so maybe we should consider that case as well).

I'll attach some embenchen benchmarks numbers. I saw some improvements.
Attachment #8820830 - Flags: review?(lhansen)
Comment on attachment 8820830 [details] [diff] [review]
bug1316803_part2_optimize_division_for_constant_rhs.patch

Review of attachment 8820830 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry this took so long to review; Hawaii encourages laziness...

This is looking good but I'd like you to do another round with more test cases and with refactorings as suggested below.

I think not handling the 64-bit case on 32-bit platforms is fine.  The situation with many 32-bit users is temporary: The majority of Firefox users are on Windows, and we only recently started shipping 64-bit Windows builds.  Since 32-bit Firefox does not auto-upgrade to 64-bit the users must be persuaded to switch over; this is taking time.  I expect that we will soon have few desktop users that run 32-bit.  Mobile is a different matter but let's pick low-hanging fruit for now.  I'm much more interested in diverting your attention to the multiply-by-constant case than in fast 64-bit division :-)

::: js/src/jit-test/tests/wasm/integer.js
@@ +109,5 @@
>  testBinary32('div_s', -40, 2, -20);
>  testBinary32('div_u', -40, 2, 2147483628);
>  testBinary32('rem_s', 40, -3, 1);
>  testBinary32('rem_s', 0, -3, 0);
> +testBinary32('rem_s', 5, 2, 1);

I'd like to see a test that does not divide by 2, but by a higher power of two, since that is more interesting.  The same goes for div_s, div_u, and rem_u (which appears not to have a test for constant power-of-two rhs at all).

Since there are two code paths for div_s and rem_s depending on whether the lhs is negative or not, we want test cases here for both paths.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +4155,5 @@
>  void
>  BaseCompiler::emitQuotientI32()
>  {
> +    int32_t c;
> +    if (popConstI32(c)) {

Here I'd like to see a different pattern: instead of what you do here, introduce popConstPositivePowerOfTwoI32(uint32_t& c, uint8_fast_t& power) and use that instead - the second argument is the exponent.

(And since we know it is a power of two presumably the exponent can be computed quickly in that routine by counting leading or trailing zeroes, we may not need the full power / expense of FloorLog2?)

Then the code for the known-power-of-two case becomes the "then" branch for that condition, and the existing general code becomes the "else" branch, as so many other places in this compiler.  That gets rid of the try-to-pop-but-undo pattern used here.

Ditto for the 64-bit case.
Attachment #8820830 - Flags: review?(lhansen)
Changes v1 => v2
* Introduced popConstPositivePowerOfTwoIXX(). One quirk: I'm not allowing 1 as a valid power-of-two, to avoid problems with ARM shift instructions not accepting zero shift widths. The alternative was to add if statements for the shift-calls. This looks cleaner to me. 

* You suggested that maybe I should replace FloorLog2. I pasted the gcc mozilla::FloorLog2 code to https://godbolt.org/g/5zR0Fr; it generates a bsr instruction and a subtract. So I guess FloorLog2 is good enough.

* Added testcases for larger power-of-two divisors and negative dividends.
Attachment #8820830 - Attachment is obsolete: true
Attachment #8822264 - Flags: review?(lhansen)
Comment on attachment 8822264 [details] [diff] [review]
bug1316803_part2_v2_optimize_division_for_constant_rhs.patch

Review of attachment 8822264 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1701,5 @@
>          stk_.popBack();
>          return true;
>      }
>  
> +    MOZ_MUST_USE bool popConstPositivePowerOfTwoI32(int32_t& c, uint_fast8_t &power) {

House style is "uint8_fast_t& power", note location of the space character.  Also below.

@@ +1708,5 @@
> +            return false;
> +        c = v.i32val();
> +        uint32_t u = static_cast<uint32_t>(c);
> +        // c=1 is a power-of-two, but we use |power| for shift amounts, and
> +        // some platforms don't support shifts by zero.

I can live with this and you can land it if you like, but consider the following.

For quotient, it would probably be better to test c <= 0 here and special-case c==1 in the code below (where the entire body of the constant-rhs case becomes wrapped by "if (power != 0) { ... }"), as this does not penalize simple code generators that emit e/1, which I can see as somewhat likely.

For remainder, e%1 is zero always, surely, so perhaps again the code below can guard on that and always push zero if power == 0?  This is not an important optimization but just a fallout of changing this routine here to benefit quotient.  The extra complexity is regrettable.

If it were me, I would probably pass the cutoff value for c as a parameter and pass '0' from quotient and '1' from remainder, just to keep things manageable.

Anyway, if you do decide on something fancier here I'd appreciate another look before you land.

@@ +1709,5 @@
> +        c = v.i32val();
> +        uint32_t u = static_cast<uint32_t>(c);
> +        // c=1 is a power-of-two, but we use |power| for shift amounts, and
> +        // some platforms don't support shifts by zero.
> +        if (c <= 1 || !IsPowerOfTwo(u))

Now I'm nitpicking but since u has a single use, why not expand the rhs of the definition for u into the argument of IsPowerOfTwo?  Indeed, doesn't the language provide int32_t -> uint32_t casting for us?  (Ditto below.)

@@ +1722,5 @@
> +        if (v.kind() != Stk::ConstI64)
> +            return false;
> +        c = v.i64val();
> +        uint64_t u = static_cast<uint64_t>(c);
> +        // See comment for popConstPositivePowerOfTwoI32

Might as well copy that comment down here, as it is quite short.
Attachment #8822264 - Flags: review?(lhansen) → review+
Changes v2 => v3
* Adjust whitespace and line length according to style guide.
* Handle power=0 by providing a cutoff and special if-statement for quotient
* Remove temp var for arg to IsPowerOfTwo; but it's a template function so I can't rely on implicit casting - need to provide an exact match.
Attachment #8822264 - Attachment is obsolete: true
Attachment #8822620 - Flags: review?(lhansen)
Comment on attachment 8822620 [details] [diff] [review]
bug1316803_part2_v3_optimize_division_for_constant_rhs.patch

Review of attachment 8822620 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice.  Just a couple of nits to clean up here, I think we're otherwise ready to ship.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1703,5 @@
>      }
>  
> +    MOZ_MUST_USE bool popConstPositivePowerOfTwoI32(int32_t& c,
> +                                                    uint_fast8_t& power,
> +                                                    int32_t cutoff = 0) {

Opening brace goes on next line when signature takes up more than one line.  Also below.  And I strongly prefer for the cutoff argument not to have a default value.

@@ +4205,5 @@
>  {
> +    int32_t c;
> +    uint_fast8_t power;
> +    if (popConstPositivePowerOfTwoI32(c, power)) {
> +        RegI32 r = popI32();

You can move pop/push into the if, since pushing what you just popped is a no-op -- except that it may also force the value into a register, so it's really best to just leave it alone.  Also below, several places.
Attachment #8822620 - Flags: review?(lhansen) → review+
(In reply to Lars T Hansen [:lth] from comment #54)
> Comment on attachment 8822620 [details] [diff] [review]
> bug1316803_part2_v3_optimize_division_for_constant_rhs.patch
> 
> @@ +4205,5 @@
> >  {
> > +    int32_t c;
> > +    uint_fast8_t power;
> > +    if (popConstPositivePowerOfTwoI32(c, power)) {
> > +        RegI32 r = popI32();
> 
> You can move pop/push into the if, since pushing what you just popped is a
> no-op -- except that it may also force the value into a register, so it's
> really best to just leave it alone. 

Poor English in my part - what I mean is, it really is best to move pop/push into the if(), leaving the value on the stack alone if the power is zero.  It won't matter in practice but it's more logical code.
Changes v3 => v4
* Fix brace placement
* Replace default param with explicit param
* Move push/pop inside the if statement.
Attachment #8822620 - Attachment is obsolete: true
Attachment #8823138 - Flags: review?(lhansen)
First attempt at optimizing multiply-by-constant.

I replace near-powers-of-two rhs with shl+add, shl or shl+sub. Ran the embenchen benchmark. No speedups.

Alternatives:
* Replace all rhs that has just one or two bits set, with shl+shl+add. E.g. replace 1010, but not 1110. mbx listed the most common multipliers for some wast source files here: https://bugzilla.mozilla.org/show_bug.cgi?id=1275442#c6  (I wonder if there's an already existing tool for analyzing wast/wasm source files like that or if he wrote an ad-hoc script). Many of the most common multipliers there have two bits set.
* Optimize each case for values below some threshold, say 10 or 20, individually.
* Add optimizations for negative multipliers as well, a neg instruction after the shl's+add's.

FWIW, here's a godbolt link where I list what optimizations gcc/clang/icc do https://godbolt.org/g/chLZtN
Comment on attachment 8823138 [details] [diff] [review]
bug1316803_part2_v4_optimize_division_for_constant_rhs.patch

Review of attachment 8823138 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, ship it.

As a matter of process: once you've got r+ with a request to fix a few straggling issues, it is usually not necessary to ask for review again.  Mostly people just push the patch, or update the patch while setting r+ themselves with a comment saying "Carrying r+", if they are allowed to do that.
Attachment #8823138 - Flags: review?(lhansen) → review+
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7410f0a2600e
part 2 - optimize division for constant rhs. r=lth
Keywords: checkin-needed
I've added optimizations for rhs with popcount == 2 in addition to rhs close to powers-of-two.

No speedups when running the embenchen benchmarks

So, either we're not calling the mult-with-constant code or that code is not significantly faster than the original.

I'll try lth's patch for annotating optimizations with counters to see how often we do run the mult-with-constant path.

I'll try to write an artifical wast benchmark program as well that does lots of multiplications with constants to see if I do get a speedup when the new code is called often.
Attachment #8823139 - Attachment is obsolete: true
(In reply to Daniel Näslund from comment #61)
> Created attachment 8825516 [details] [diff] [review]
> bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch
> 
> No speedups when running the embenchen benchmarks
> 
> So, either we're not calling the mult-with-constant code or that code is not
> significantly faster than the original.

CPUs are getting really good at multiplication, probably a good thing.

I have an ARM dev board; my range check optimization, which yields little speedup on x86, is much more effective on the simpler ARM.  If I can find some time today I'll try your patch there to see if the same holds for multiplication.  (I saw a couple of big speedups after your division patch.)

It's quite possible that multiplications by constants are not common in current wasm code, the C++ compiler already having strength-reduced them.  That may change once we get other and less sophisticated producers of wasm code.  On balance, let's do the obvious things here -- popcount==2 and near-power-of-two are good -- but let's not overdo it.

--lars
Comment on attachment 8825516 [details] [diff] [review]
bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch

Adding the r? flag to make it easier to find in your review queue.

As I mentioned earlier, I haven't been able to detect any speed ups with this patch on my Skylake i7 laptop. So I guess there are three options: 

1. Find some other, less capable, hardware and verify that this patch speeds up things there. Then ship the patch.
2. Assume that embenchen does a good job of strength reducing multiplication by constant, but that future wasm generators may be less capable. So, ship the patch.
3. Drop the patch.
Attachment #8825516 - Flags: review?(lhansen)
Attached file multiply.js
Simple microbenchmark.

Jetson TK1, release build:

Without the optimization: 562ms
With the optimization: 952ms

I'm pretty sure I got this right, so the optimization represents a significant performance regression.  I don't know why that is, I haven't had time to look any deeper.  The code in the patch looks all right though.

I've experimented briefly with other operands (not just 1*2 as here) without seeing any slowdown from the multiply instruction.

(The optimization does not change the running times of any of the benchmarks on ARM.)
Comment on attachment 8825516 [details] [diff] [review]
bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch

Review of attachment 8825516 [details] [diff] [review]:
-----------------------------------------------------------------

With a fix to the benchmark to avoid the problem I've commented on below, things are getting more interesting.

- the case where we can transform a multiply to a shift is marginally faster than a multiply (520ms vs 560ms)
- the case where the multiply turns into a mov, a shift, and an add is almost 2x slower than multiply (950ms vs 560ms)

In both cases the lhs operands were smallish constants.  It's possible larger constants on both lhs and rhs will change this (no early exits) but I don't know.

Presumably these ratios are massively hardware-dependent, but the Tegra on the Jetson TK1 is not brand new, and our target is fairly high-end hardware, so let's assume it's representative.

I'll clear the r? flag for now.  I think that (a) we can probably land a fix for power-of-two only since that seems like a clear win and (b) benchmark more if we want to do near-power-of-two.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1785,5 @@
> +            return false;
> +        uint64_t u = static_cast<uint64_t>(c);
> +        if (IsPowerOfTwo(u - 1))
> +            offset = -1;
> +        else if (IsPowerOfTwo(u))

We need to test IsPowerOfTwo(u) first because of the presumably common case '2', since 2-1==1 is also a power of 2, leading to pretty silly code for that case.
Attachment #8825516 - Flags: review?(lhansen)
That said, given how marginal the win is for the shift, I don't think this particular patch is high priority, in the event you are torn about what to spend time on.
In order to do much better on multiply (notably, better than a multiply instruction) we'll need to descend to the platform level to use LEA on x86 and add-shifted-operand on ARM, to keep the instructions counts down.  If we're going to do that kind of instruction selection we have a long list of opportunities, but that's a different and very low priority bug (which I will open, just to track it).

I think we're done here.
Status: NEW → RESOLVED
Closed: 7 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: