Closed Bug 1275442 Opened 8 years ago Closed 11 months ago

Optimize imul by small constants.

Categories

(Core :: JavaScript Engine: JIT, defect, P5)

defect

Tracking

()

RESOLVED INACTIVE

People

(Reporter: mbx, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

C/C++:
===============================
int f(int n) {
  int result = 0;
  for (int i = 0; i < n; i++) {
    result += i * i * i;
  }
  return result;
}

Firefox:
===============================
wasm-function[0]:
  sub rsp, 8                            ; 0x000020 48 83 ec 08
  cmp edi, 1                            ; 0x000024 83 ff 01
  jge 0x34                              ; 0x000027 0f 8d 07 00 00 00
 0x00002d:                              
  xor eax, eax                          ; 0x00002d 33 c0
  jmp 0x7e                              ; 0x00002f e9 4a 00 00 00
 0x000034:                              
  mov eax, edi                          ; 0x000034 8b c7
  add eax, -1                           ; 0x000036 83 c0 ff
  mov eax, eax                          ; 0x000039 8b c0
  mov ecx, edi                          ; 0x00003b 8b cf
  add ecx, -2                           ; 0x00003d 83 c1 fe
  mov ecx, ecx                          ; 0x000040 8b c9
  imul rax, rcx                         ; 0x000042 48 0f af c1
  mov rdx, rax                          ; 0x000046 48 8b d0
  shr rax, 1                            ; 0x000049 48 d1 e8
  mov eax, eax                          ; 0x00004c 8b c0
  imul eax, eax, 7                      ; 0x00004e 6b c0 07
  add eax, edi                          ; 0x000051 03 c7
  mov ecx, edi                          ; 0x000053 8b cf
  add ecx, -3                           ; 0x000055 83 c1 fd
  mov ecx, ecx                          ; 0x000058 8b c9
  imul rcx, rdx                         ; 0x00005a 48 0f af ca
  mov edx, ecx                          ; 0x00005e 8b d1
  shl edx, 1                            ; 0x000060 d1 e2
  and edx, 0xfffffffc                   ; 0x000062 83 e2 fc
  add eax, edx                          ; 0x000065 03 c2
  add edi, -4                           ; 0x000067 83 c7 fc
  mov edx, edi                          ; 0x00006a 8b d7
  imul rdx, rcx                         ; 0x00006c 48 0f af d1
  shr rdx, 2                            ; 0x000070 48 c1 ea 02
  mov ecx, edx                          ; 0x000074 8b ca
  and ecx, 0xfffffffe                   ; 0x000076 83 e1 fe
  add eax, ecx                          ; 0x000079 03 c1
  add eax, -1                           ; 0x00007b 83 c0 ff
 0x00007e:                              ; 0x00007e from: [0x00002f]
  nop                                   ; 0x00007e 66 90
  add rsp, 8                            ; 0x000080 48 83 c4 08
  ret                                   ; 0x000084 c3



LLVM:
===============================
	.text
	.intel_syntax noprefix
	.file	"upload/cbfba0a8e132fe2f920025a52e225237.cpp"
	.globl	_Z1fi
	.type	_Z1fi,@function
_Z1fi:
	.cfi_startproc
	xor	eax, eax
	test	edi, edi
	jle	.LBB0_2
	lea	eax, [rdi - 1]
	lea	ecx, [rdi - 2]
	imul	rcx, rax
	mov	rax, rcx
	shr	rax
	lea	edx, [8*rax]
	sub	edx, eax
	add	edx, edi
	lea	eax, [rdi - 3]
	add	edi, -4
	imul	rdi, rax
	imul	rdi, rcx
	imul	ecx, eax
	and	ecx, 2147483646
	lea	eax, [rdx + 2*rcx]
	shr	rdi, 2
	and	edi, -2
	lea	eax, [rdi + rax - 1]
.LBB0_2:
	ret
.Lfunc_end0:
	.size	_Z1fi, .Lfunc_end0-_Z1fi
	.cfi_endproc


	.ident	"clang version 3.9.0 (trunk 270097)"
	.section	".note.GNU-stack","",@progbits
Assignee: nobody → sunfish
At a first readthrough, the main differences are:
 - use lea instead of add/sub to eliminate copies and fold in shifts
 - use lea+sub instead of multiply by 7
Are these kinds of peephole optimizations something we should look into?
I benchmarked this test and it looks like the LLVM code is about 2x faster than what we JIT, so perhaps we should investigate this further. Below is the example I tested:

int f(int n) {
  int result = 0;
  for (int i = 0; i < n; i++) {
    result += i * i * i;
  }
  return result;
}

int main(int argc, char **argv) {
  int s = 0;
  for (int i = 0; i < 100000000; i++) {
    s += f(100000000 + i);  
  }
  return s;
}
This is an artificial microbenchmark, so it's not easy to assign it a priority, however these are generally useful optimizations.

Implementing the multiply-by-near-power-of-2 optimizations would be straightforward; I'd be happy to mentor someone and/or review patches for that.

Lea pattern-matching is a little more complex. We could do some of it at the MIR level. There's also the case where it eliminates a copy which isn't visible until register allocation. In LLVM for example, the register allocator knows how to convert an add into an lea on demand when it needs to.
Out of curiosity, this is what LLVM generates for i * [2 .. 20]

* 2

lea	eax, [rdi + rdi]

* 3

lea	eax, [rdi + 2*rdi]

* 4

lea	eax, [4*rdi]

* 5

lea	eax, [rdi + 4*rdi]

* 6

add	edi, edi
lea	eax, [rdi + 2*rdi]

* 7

lea	eax, [8*rdi]
sub	eax, edi

* 8

lea	eax, [8*rdi]

* 9

lea	eax, [rdi + 8*rdi]

* 10

add	edi, edi
lea	eax, [rdi + 4*rdi]

* 11

imul	eax, edi, 11

* 12

shl	edi, 2
lea	eax, [rdi + 2*rdi]

* 13, 14

imul	eax, edi, 13

* 15

lea	eax, [rdi + 4*rdi]
lea	eax, [rax + 2*rax]

* 16

shl	edi, 4

* 17

mov	eax, edi
shl	eax, 4
lea	eax, [rax + rdi]

* 18

add	edi, edi
lea	eax, [rdi + 8*rdi]

* 19

imul	eax, edi, 19

* 20

shl	edi, 2
lea	eax, [rdi + 4*rdi]

See: https://raw.githubusercontent.com/llvm-mirror/llvm/master/lib/Target/X86/X86ISelLowering.cpp in combineMul
Summary: WasmExplorer Test Bug: LLVM generates tighter code. → Optimize imul by small constants.
For reference, the most common multipliers in AB.wast and BB.wast are:

BB.wast:

1394 (i32.const 12)
 630 (i32.const 40)
 604 (i32.const 24)
 477 (i32.const 36)
 428 (i32.const 48)
 316 (i32.const 60)
 291 (i32.const 20)
 288 (i32.const 44)
 274 (i32.const 52)
 259 (i32.const 112)
 223 (i32.const 28)
 208 (i32.const 120)
 168 (i32.const 84)
 145 (i32.const 260)
 137 (i32.const 1120)
 133 (i32.const 96)
 122 (i32.const 336)
 114 (i32.const 100)
 110 (i32.const 3)
 108 (i32.const 10)
  91 (i32.const 6)
  91 (i32.const 33)
  89 (i32.const 560)
  68 (i32.const 272)
  66 (i32.const 76)
  66 (i32.const 72)
  53 (i32.const 264)
  52 (i32.const 108)
  51 (i32.const 156)
  48 (i32.const 80)
  46 (i32.const 320)
  41 (i32.const 56)
  40 (i32.const 200)
  33 (i32.const 5)
  30 (i32.const 148)
  23 (i32.const 24584)
  20 (i32.const 1000)
  18 (i32.const 364)
  15 (i32.const 9)
  15 (i32.const 88)
  15 (i32.const 268)
  10 (i32.const 252)


AB.wast

4971 (i32.const 12)
2807 (i32.const 24)
1682 (i32.const 40)
1351 (i32.const 48)
1167 (i32.const 20)
 835 (i32.const 28)
 477 (i32.const 224)
 425 (i32.const 52)
 395 (i32.const 36)
 390 (i32.const 284)
 384 (i32.const 56)
 357 (i32.const 96)
 295 (i32.const 148)
 284 (i32.const 76)
 265 (i32.const 400)
 232 (i32.const 944)
 220 (i32.const 3)
 220 (i32.const 176)
 213 (i32.const 80)
 209 (i32.const 160)
 193 (i32.const 144)
 182 (i32.const 44)
 174 (i32.const 60)
 173 (i32.const 72)
 167 (i32.const 216)
 137 (i32.const 272)
 136 (i32.const 68)
 114 (i32.const 9)
  98 (i32.const 112)
  97 (i32.const 104)
  95 (i32.const 125613361)
  82 (i32.const 1048)
  79 (i32.const 336)
  76 (i32.const 92)
  74 (i32.const 1128)
  73 (i32.const 288)
  71 (i32.const 6)
  71 (i32.const 120)
  66 (i32.const 364)
  65 (i32.const 84)
  61 (i32.const 1120)
  59 (i32.const 240)
  54 (i32.const 960)
  50 (i32.const 192)
  47 (i32.const 164)
  47 (i32.const 10)
  29 (i32.const 1088)
  27 (i32.const 50)
  26 (i32.const 130329821)
  21 (i32.const 190)
  19 (i32.const 108)
  18 (i32.const 196)
  17 (i32.const 468)
  16 (i32.const 33)
  16 (i32.const 292)
  15 (i32.const 1540483477)
  15 (i32.const 140)
  14 (i32.const 16843009)
  14 (i32.const 100)
  11 (i32.const -1640531535)
  10 (i32.const 7919)
  10 (i32.const 486187739)
  10 (i32.const 101)
Attached patch patch.diff (obsolete) — Splinter Review
I don't yet handle the +/- powerOfTwo case because that requires a shift + add, and the add needs a copy of the original LHS. I was getting some assertions when trying to access lhsCopy.
Attachment #8756737 - Flags: review?(sunfish)
I should add that in micro-benchmarks this gives ~20% speed improvement whenever the optimization kicks in.
Attached patch patch.diff (obsolete) — Splinter Review
Handle 2^x +/- 1 case.
Attachment #8756737 - Attachment is obsolete: true
Attachment #8756737 - Flags: review?(sunfish)
Attachment #8757012 - Flags: review?(sunfish)
Comment on attachment 8756737 [details] [diff] [review]
patch.diff

Review of attachment 8756737 [details] [diff] [review]:
-----------------------------------------------------------------

Overall looks good; just a few concerns.

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +427,5 @@
> +            } else if ((constant % 3) == 0) {
> +                a = 3;
> +                b = constant / 3;
> +            }
> +            if (b && (IsPowerOfTwo(b) || b == 3 || b == 5 || b == 9)) {

The |b &&| part here is redundant, because IsPowerOfTwo correctly returns false for zero. (And same in the 32-bit code below).

@@ +428,5 @@
> +                a = 3;
> +                b = constant / 3;
> +            }
> +            if (b && (IsPowerOfTwo(b) || b == 3 || b == 5 || b == 9)) {
> +              if (IsPowerOfTwo(a)) {

It looks like there are no paths to this point where |a| could be a power of 2. Are there supposed to be more cases above?

@@ +432,5 @@
> +              if (IsPowerOfTwo(a)) {
> +                masm.shlq(Imm32(FloorLog2(a)), lhs);
> +              } else {
> +                Scale scale = ScaleFromElemWidth(a - 1);
> +                masm.leal(Operand(lhs, lhs, scale, 0), lhs);

This is for visitMulI64, so we need to do a 64-bit multiply here. |leal| is the 32-bit instruction; we need |leaq| here (and below).
Attachment #8756737 - Attachment is obsolete: false
Attached patch patch.diff (obsolete) — Splinter Review
Address review comments.
Attachment #8756737 - Attachment is obsolete: true
Attachment #8757012 - Attachment is obsolete: true
Attachment #8757012 - Flags: review?(sunfish)
Attachment #8757427 - Flags: review?(sunfish)
Attachment #8757427 - Attachment is obsolete: true
Attachment #8757427 - Flags: review?(sunfish)
Comment on attachment 8757531 [details]
MozReview Request: Bug 1275442 - Optimize imul by small constants. r=sunfish

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/55972/diff/1-2/
https://reviewboard.mozilla.org/r/55972/#review53112

::: js/src/jit/x86-shared/Lowering-x86-shared.cpp:199
(Diff revision 1)
>  
>  void
>  LIRGeneratorX86Shared::lowerMulI(MMul* mul, MDefinition* lhs, MDefinition* rhs)
>  {
> -    // Note: If we need a negative zero check, lhs is used twice.
> -    LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation();
> +    // Note: If we need a negative zero check or rhs is constant, lhs is used twice.
> +    LAllocation lhsCopy = (rhs->isConstant() || mul->canBeNegativeZero()) ? use(lhs) : LAllocation();

Requiring a copy of lhs in all cases where rhs is constant could regress performance on the cases that don't needed (actual power-of-2, or cases that don't use the lea/shl/add/sub optimization).

I agree that it's inconvenient to have lowering have detailed knowledge of what codegen is going to do, but it's difficult to avoid that in the current system. Would it be possible to factor out the code that decides what to do for a given constant into a standalone utility routine that can be used from both places?
Comment on attachment 8757531 [details]
MozReview Request: Bug 1275442 - Optimize imul by small constants. r=sunfish

Overall this patch looks good; it's just waiting on one last issue.
Attachment #8757531 - Flags: review?(sunfish)
mbx, can we get this finished/landed?
Flags: needinfo?(mbebenita)
Priority: -- → P5
Blocks: wasm-perf
No longer blocks: wasm
Flags: needinfo?(bbouvier)
Flags: needinfo?(bbouvier)
Attempt at updating mbx patch to latest HEAD. (I hope mbx is ok with me, taking his patch). Does not pass tests at the moment.

Changes:
* update patch to reflect file renames
* add hack to make the Operand constructor accept 64 bit registers

Question:
In bug 1279248, visitMulI64 was rewritten to be shared between x86 and x64 https://hg.mozilla.org/integration/mozilla-inbound/rev/0c56943e6d0e. It now uses MacroAssembler methods that takes Register64 parameters.

The optimization in the original mbx patch relies on the use of the lea instruction, which uses a Operand parameter for specifying the addressing mode and "adressing parameters". 

But Operand does not accept Register64 parameters. Can I just add new constructors to Operands as in this patch? But then, what would I have to do when running on plain x86? If a Register64 is represented as a (low, high) pair of 32 bit registers then I can't use it as lea parameters. The "x86 reg64 => (low, hight) lowering dance" would surely eat up the gain of using lea for multiplications.

Should I make one version of the code for x64 with optimizations and another one for x86 without those optimizations? If so, maybe place them in separate files CodeGenerator-x64 and CodeGenerator-x86?
Assignee: sunfish → dannas
Attachment #8836856 - Flags: feedback?(sunfish)
Comment on attachment 8836856 [details] [diff] [review]
bug1275442_WIP_optimize_imul_by_small_constants-attempt-at-rebasing-to-tip.patch

Review of attachment 8836856 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Daniel Näslund from comment #18)
> Question:
> In bug 1279248, visitMulI64 was rewritten to be shared between x86 and x64
> https://hg.mozilla.org/integration/mozilla-inbound/rev/0c56943e6d0e. It now
> uses MacroAssembler methods that takes Register64 parameters.
>
> The optimization in the original mbx patch relies on the use of the lea
> instruction, which uses a Operand parameter for specifying the addressing
> mode and "adressing parameters". 
> 
> But Operand does not accept Register64 parameters. Can I just add new
> constructors to Operands as in this patch? But then, what would I have to do
> when running on plain x86? If a Register64 is represented as a (low, high)
> pair of 32 bit registers then I can't use it as lea parameters. The "x86
> reg64 => (low, hight) lowering dance" would surely eat up the gain of using
> lea for multiplications.

Plain x86 doesn't have a 64-bit lea, so unless I've misunderstood what
you're doing here, it wouldn't be able to use the same algorithm for 64-bit
multiplies anyway (or at least, not without additional cleverness). So it
makes sense to me to add an Operand constructor for Register64, as this
patch does, and just avoid using it on x86 (for now).

> Should I make one version of the code for x64 with optimizations and another
> one for x86 without those optimizations? If so, maybe place them in separate
> files CodeGenerator-x64 and CodeGenerator-x86?

Since x64 has a 64-bit lea and 32-bit x86 doesn't, and that significantly
affects what optimizations can be done, it would make sense to me to place
them in separate CodeGenerator-x64 and CodeGenerator-x86 files.
Attachment #8836856 - Flags: feedback?(sunfish) → feedback+
See Also: → 1712321
Assignee: dannas → nobody
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: