Closed
Bug 1275442
Opened 8 years ago
Closed 11 months ago
Optimize imul by small constants.
Categories
(Core :: JavaScript Engine: JIT, defect, P5)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
INACTIVE
People
(Reporter: mbx, Unassigned, NeedInfo)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 3 obsolete files)
58 bytes,
text/x-review-board-request
|
Details | |
11.17 KB,
patch
|
sunfish
:
feedback+
|
Details | Diff | Splinter Review |
C/C++: =============================== int f(int n) { int result = 0; for (int i = 0; i < n; i++) { result += i * i * i; } return result; } Firefox: =============================== wasm-function[0]: sub rsp, 8 ; 0x000020 48 83 ec 08 cmp edi, 1 ; 0x000024 83 ff 01 jge 0x34 ; 0x000027 0f 8d 07 00 00 00 0x00002d: xor eax, eax ; 0x00002d 33 c0 jmp 0x7e ; 0x00002f e9 4a 00 00 00 0x000034: mov eax, edi ; 0x000034 8b c7 add eax, -1 ; 0x000036 83 c0 ff mov eax, eax ; 0x000039 8b c0 mov ecx, edi ; 0x00003b 8b cf add ecx, -2 ; 0x00003d 83 c1 fe mov ecx, ecx ; 0x000040 8b c9 imul rax, rcx ; 0x000042 48 0f af c1 mov rdx, rax ; 0x000046 48 8b d0 shr rax, 1 ; 0x000049 48 d1 e8 mov eax, eax ; 0x00004c 8b c0 imul eax, eax, 7 ; 0x00004e 6b c0 07 add eax, edi ; 0x000051 03 c7 mov ecx, edi ; 0x000053 8b cf add ecx, -3 ; 0x000055 83 c1 fd mov ecx, ecx ; 0x000058 8b c9 imul rcx, rdx ; 0x00005a 48 0f af ca mov edx, ecx ; 0x00005e 8b d1 shl edx, 1 ; 0x000060 d1 e2 and edx, 0xfffffffc ; 0x000062 83 e2 fc add eax, edx ; 0x000065 03 c2 add edi, -4 ; 0x000067 83 c7 fc mov edx, edi ; 0x00006a 8b d7 imul rdx, rcx ; 0x00006c 48 0f af d1 shr rdx, 2 ; 0x000070 48 c1 ea 02 mov ecx, edx ; 0x000074 8b ca and ecx, 0xfffffffe ; 0x000076 83 e1 fe add eax, ecx ; 0x000079 03 c1 add eax, -1 ; 0x00007b 83 c0 ff 0x00007e: ; 0x00007e from: [0x00002f] nop ; 0x00007e 66 90 add rsp, 8 ; 0x000080 48 83 c4 08 ret ; 0x000084 c3 LLVM: =============================== .text .intel_syntax noprefix .file "upload/cbfba0a8e132fe2f920025a52e225237.cpp" .globl _Z1fi .type _Z1fi,@function _Z1fi: .cfi_startproc xor eax, eax test edi, edi jle .LBB0_2 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax mov rax, rcx shr rax lea edx, [8*rax] sub edx, eax add edx, edi lea eax, [rdi - 3] add edi, -4 imul rdi, rax imul rdi, rcx imul ecx, eax and ecx, 2147483646 lea eax, [rdx + 2*rcx] shr rdi, 2 and edi, -2 lea eax, [rdi + rax - 1] .LBB0_2: ret .Lfunc_end0: .size _Z1fi, .Lfunc_end0-_Z1fi .cfi_endproc .ident "clang version 3.9.0 (trunk 270097)" .section ".note.GNU-stack","",@progbits
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → sunfish
Comment 1•8 years ago
|
||
At a first readthrough, the main differences are: - use lea instead of add/sub to eliminate copies and fold in shifts - use lea+sub instead of multiply by 7
Reporter | ||
Comment 2•8 years ago
|
||
Are these kinds of peephole optimizations something we should look into?
Reporter | ||
Comment 3•8 years ago
|
||
I benchmarked this test and it looks like the LLVM code is about 2x faster than what we JIT, so perhaps we should investigate this further. Below is the example I tested: int f(int n) { int result = 0; for (int i = 0; i < n; i++) { result += i * i * i; } return result; } int main(int argc, char **argv) { int s = 0; for (int i = 0; i < 100000000; i++) { s += f(100000000 + i); } return s; }
Comment 4•8 years ago
|
||
This is an artificial microbenchmark, so it's not easy to assign it a priority, however these are generally useful optimizations. Implementing the multiply-by-near-power-of-2 optimizations would be straightforward; I'd be happy to mentor someone and/or review patches for that. Lea pattern-matching is a little more complex. We could do some of it at the MIR level. There's also the case where it eliminates a copy which isn't visible until register allocation. In LLVM for example, the register allocator knows how to convert an add into an lea on demand when it needs to.
Reporter | ||
Comment 5•8 years ago
|
||
Out of curiosity, this is what LLVM generates for i * [2 .. 20] * 2 lea eax, [rdi + rdi] * 3 lea eax, [rdi + 2*rdi] * 4 lea eax, [4*rdi] * 5 lea eax, [rdi + 4*rdi] * 6 add edi, edi lea eax, [rdi + 2*rdi] * 7 lea eax, [8*rdi] sub eax, edi * 8 lea eax, [8*rdi] * 9 lea eax, [rdi + 8*rdi] * 10 add edi, edi lea eax, [rdi + 4*rdi] * 11 imul eax, edi, 11 * 12 shl edi, 2 lea eax, [rdi + 2*rdi] * 13, 14 imul eax, edi, 13 * 15 lea eax, [rdi + 4*rdi] lea eax, [rax + 2*rax] * 16 shl edi, 4 * 17 mov eax, edi shl eax, 4 lea eax, [rax + rdi] * 18 add edi, edi lea eax, [rdi + 8*rdi] * 19 imul eax, edi, 19 * 20 shl edi, 2 lea eax, [rdi + 4*rdi] See: https://raw.githubusercontent.com/llvm-mirror/llvm/master/lib/Target/X86/X86ISelLowering.cpp in combineMul
Reporter | ||
Updated•8 years ago
|
Summary: WasmExplorer Test Bug: LLVM generates tighter code. → Optimize imul by small constants.
Reporter | ||
Comment 6•8 years ago
|
||
For reference, the most common multipliers in AB.wast and BB.wast are: BB.wast: 1394 (i32.const 12) 630 (i32.const 40) 604 (i32.const 24) 477 (i32.const 36) 428 (i32.const 48) 316 (i32.const 60) 291 (i32.const 20) 288 (i32.const 44) 274 (i32.const 52) 259 (i32.const 112) 223 (i32.const 28) 208 (i32.const 120) 168 (i32.const 84) 145 (i32.const 260) 137 (i32.const 1120) 133 (i32.const 96) 122 (i32.const 336) 114 (i32.const 100) 110 (i32.const 3) 108 (i32.const 10) 91 (i32.const 6) 91 (i32.const 33) 89 (i32.const 560) 68 (i32.const 272) 66 (i32.const 76) 66 (i32.const 72) 53 (i32.const 264) 52 (i32.const 108) 51 (i32.const 156) 48 (i32.const 80) 46 (i32.const 320) 41 (i32.const 56) 40 (i32.const 200) 33 (i32.const 5) 30 (i32.const 148) 23 (i32.const 24584) 20 (i32.const 1000) 18 (i32.const 364) 15 (i32.const 9) 15 (i32.const 88) 15 (i32.const 268) 10 (i32.const 252) AB.wast 4971 (i32.const 12) 2807 (i32.const 24) 1682 (i32.const 40) 1351 (i32.const 48) 1167 (i32.const 20) 835 (i32.const 28) 477 (i32.const 224) 425 (i32.const 52) 395 (i32.const 36) 390 (i32.const 284) 384 (i32.const 56) 357 (i32.const 96) 295 (i32.const 148) 284 (i32.const 76) 265 (i32.const 400) 232 (i32.const 944) 220 (i32.const 3) 220 (i32.const 176) 213 (i32.const 80) 209 (i32.const 160) 193 (i32.const 144) 182 (i32.const 44) 174 (i32.const 60) 173 (i32.const 72) 167 (i32.const 216) 137 (i32.const 272) 136 (i32.const 68) 114 (i32.const 9) 98 (i32.const 112) 97 (i32.const 104) 95 (i32.const 125613361) 82 (i32.const 1048) 79 (i32.const 336) 76 (i32.const 92) 74 (i32.const 1128) 73 (i32.const 288) 71 (i32.const 6) 71 (i32.const 120) 66 (i32.const 364) 65 (i32.const 84) 61 (i32.const 1120) 59 (i32.const 240) 54 (i32.const 960) 50 (i32.const 192) 47 (i32.const 164) 47 (i32.const 10) 29 (i32.const 1088) 27 (i32.const 50) 26 (i32.const 130329821) 21 (i32.const 190) 19 (i32.const 108) 18 (i32.const 196) 17 (i32.const 468) 16 (i32.const 33) 16 (i32.const 292) 15 (i32.const 1540483477) 15 (i32.const 140) 14 (i32.const 16843009) 14 (i32.const 100) 11 (i32.const -1640531535) 10 (i32.const 7919) 10 (i32.const 486187739) 10 (i32.const 101)
Reporter | ||
Comment 7•8 years ago
|
||
I don't yet handle the +/- powerOfTwo case because that requires a shift + add, and the add needs a copy of the original LHS. I was getting some assertions when trying to access lhsCopy.
Attachment #8756737 -
Flags: review?(sunfish)
Reporter | ||
Comment 8•8 years ago
|
||
I should add that in micro-benchmarks this gives ~20% speed improvement whenever the optimization kicks in.
Reporter | ||
Comment 9•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29f9ac5539c0
Reporter | ||
Comment 10•8 years ago
|
||
Handle 2^x +/- 1 case.
Attachment #8756737 -
Attachment is obsolete: true
Attachment #8756737 -
Flags: review?(sunfish)
Reporter | ||
Updated•8 years ago
|
Attachment #8757012 -
Flags: review?(sunfish)
Comment 11•8 years ago
|
||
Comment on attachment 8756737 [details] [diff] [review] patch.diff Review of attachment 8756737 [details] [diff] [review]: ----------------------------------------------------------------- Overall looks good; just a few concerns. ::: js/src/jit/x64/CodeGenerator-x64.cpp @@ +427,5 @@ > + } else if ((constant % 3) == 0) { > + a = 3; > + b = constant / 3; > + } > + if (b && (IsPowerOfTwo(b) || b == 3 || b == 5 || b == 9)) { The |b &&| part here is redundant, because IsPowerOfTwo correctly returns false for zero. (And same in the 32-bit code below). @@ +428,5 @@ > + a = 3; > + b = constant / 3; > + } > + if (b && (IsPowerOfTwo(b) || b == 3 || b == 5 || b == 9)) { > + if (IsPowerOfTwo(a)) { It looks like there are no paths to this point where |a| could be a power of 2. Are there supposed to be more cases above? @@ +432,5 @@ > + if (IsPowerOfTwo(a)) { > + masm.shlq(Imm32(FloorLog2(a)), lhs); > + } else { > + Scale scale = ScaleFromElemWidth(a - 1); > + masm.leal(Operand(lhs, lhs, scale, 0), lhs); This is for visitMulI64, so we need to do a 64-bit multiply here. |leal| is the 32-bit instruction; we need |leaq| here (and below).
Attachment #8756737 -
Attachment is obsolete: false
Reporter | ||
Comment 12•8 years ago
|
||
Address review comments.
Attachment #8756737 -
Attachment is obsolete: true
Attachment #8757012 -
Attachment is obsolete: true
Attachment #8757012 -
Flags: review?(sunfish)
Attachment #8757427 -
Flags: review?(sunfish)
Reporter | ||
Comment 13•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/55972/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/55972/
Attachment #8757531 -
Flags: review?(sunfish)
Reporter | ||
Updated•8 years ago
|
Attachment #8757427 -
Attachment is obsolete: true
Attachment #8757427 -
Flags: review?(sunfish)
Reporter | ||
Comment 14•8 years ago
|
||
Comment on attachment 8757531 [details] MozReview Request: Bug 1275442 - Optimize imul by small constants. r=sunfish Review request updated; see interdiff: https://reviewboard.mozilla.org/r/55972/diff/1-2/
Comment 15•8 years ago
|
||
https://reviewboard.mozilla.org/r/55972/#review53112 ::: js/src/jit/x86-shared/Lowering-x86-shared.cpp:199 (Diff revision 1) > > void > LIRGeneratorX86Shared::lowerMulI(MMul* mul, MDefinition* lhs, MDefinition* rhs) > { > - // Note: If we need a negative zero check, lhs is used twice. > - LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation(); > + // Note: If we need a negative zero check or rhs is constant, lhs is used twice. > + LAllocation lhsCopy = (rhs->isConstant() || mul->canBeNegativeZero()) ? use(lhs) : LAllocation(); Requiring a copy of lhs in all cases where rhs is constant could regress performance on the cases that don't needed (actual power-of-2, or cases that don't use the lea/shl/add/sub optimization). I agree that it's inconvenient to have lowering have detailed knowledge of what codegen is going to do, but it's difficult to avoid that in the current system. Would it be possible to factor out the code that decides what to do for a given constant into a standalone utility routine that can be used from both places?
Comment 16•8 years ago
|
||
Comment on attachment 8757531 [details] MozReview Request: Bug 1275442 - Optimize imul by small constants. r=sunfish Overall this patch looks good; it's just waiting on one last issue.
Attachment #8757531 -
Flags: review?(sunfish)
Comment 17•8 years ago
|
||
mbx, can we get this finished/landed?
Flags: needinfo?(mbebenita)
Priority: -- → P5
Updated•8 years ago
|
Updated•8 years ago
|
Flags: needinfo?(bbouvier)
Updated•7 years ago
|
Flags: needinfo?(bbouvier)
Comment 18•7 years ago
|
||
Attempt at updating mbx patch to latest HEAD. (I hope mbx is ok with me, taking his patch). Does not pass tests at the moment. Changes: * update patch to reflect file renames * add hack to make the Operand constructor accept 64 bit registers Question: In bug 1279248, visitMulI64 was rewritten to be shared between x86 and x64 https://hg.mozilla.org/integration/mozilla-inbound/rev/0c56943e6d0e. It now uses MacroAssembler methods that takes Register64 parameters. The optimization in the original mbx patch relies on the use of the lea instruction, which uses a Operand parameter for specifying the addressing mode and "adressing parameters". But Operand does not accept Register64 parameters. Can I just add new constructors to Operands as in this patch? But then, what would I have to do when running on plain x86? If a Register64 is represented as a (low, high) pair of 32 bit registers then I can't use it as lea parameters. The "x86 reg64 => (low, hight) lowering dance" would surely eat up the gain of using lea for multiplications. Should I make one version of the code for x64 with optimizations and another one for x86 without those optimizations? If so, maybe place them in separate files CodeGenerator-x64 and CodeGenerator-x86?
Assignee: sunfish → dannas
Attachment #8836856 -
Flags: feedback?(sunfish)
Comment 19•7 years ago
|
||
Comment on attachment 8836856 [details] [diff] [review] bug1275442_WIP_optimize_imul_by_small_constants-attempt-at-rebasing-to-tip.patch Review of attachment 8836856 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Daniel Näslund from comment #18) > Question: > In bug 1279248, visitMulI64 was rewritten to be shared between x86 and x64 > https://hg.mozilla.org/integration/mozilla-inbound/rev/0c56943e6d0e. It now > uses MacroAssembler methods that takes Register64 parameters. > > The optimization in the original mbx patch relies on the use of the lea > instruction, which uses a Operand parameter for specifying the addressing > mode and "adressing parameters". > > But Operand does not accept Register64 parameters. Can I just add new > constructors to Operands as in this patch? But then, what would I have to do > when running on plain x86? If a Register64 is represented as a (low, high) > pair of 32 bit registers then I can't use it as lea parameters. The "x86 > reg64 => (low, hight) lowering dance" would surely eat up the gain of using > lea for multiplications. Plain x86 doesn't have a 64-bit lea, so unless I've misunderstood what you're doing here, it wouldn't be able to use the same algorithm for 64-bit multiplies anyway (or at least, not without additional cleverness). So it makes sense to me to add an Operand constructor for Register64, as this patch does, and just avoid using it on x86 (for now). > Should I make one version of the code for x64 with optimizations and another > one for x86 without those optimizations? If so, maybe place them in separate > files CodeGenerator-x64 and CodeGenerator-x86? Since x64 has a 64-bit lea and 32-bit x86 doesn't, and that significantly affects what optimizations can be done, it would make sense to me to place them in separate CodeGenerator-x64 and CodeGenerator-x86 files.
Attachment #8836856 -
Flags: feedback?(sunfish) → feedback+
Updated•3 years ago
|
Assignee: dannas → nobody
Updated•2 years ago
|
Severity: normal → S3
Updated•11 months ago
|
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → INACTIVE
You need to log in
before you can comment on or make changes to this bug.
Description
•