1911160 - Firefox takes 1.7x time as Chrome in SampleVarianceUpperTriangularMatrix.prototype.costBetween in Perf-Dashboard

Justin Link

Reporter

Description

•

1 year ago

5078 vs 3040 samples (diff: 2038, 1.67x): Firefox profile vs Chrome profile (Code)

Making Firefox as fast as Chrome on this function would reduce its time by 4.3%.

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-786

Iain Ireland [:iain]

Comment 1

•

1 year ago

I captured a local samply profile and looked at the generated assembly, then manually interspersed the IR-ops and asm annotations to get this:

7   Prologue
       7   push rbp
           mov rbp, rsp
           nop
    Parameter
    Parameter
    Parameter
    CheckOverRecursed
    OsiPoint
    MoveGroup [arg:0 -> rdx, x]
           mov rdx, qword [rbp + 0x20]
4   BoxNonStrictThis (rdx)
           mov r11, -0x2000000000000
           xor r11, rdx
           mov rax, r11
           shr r11, 0x2f
       4   jnz 0x48d732
    OsiPoint
    GuardShape (rax)
           mov r11, 0x3a7a35819ee0
           cmp qword [rax], r11
           jnz 0x48d778
12  LoadFixedSlotAndUnbox (rax)
           mov r11, -0x2000000000000
           xor r11, qword [rax + 0x18]
           mov rcx, r11
           shr r11, 0x2f
       12  jnz 0x48d77f
    GuardShape (rcx)
           mov r11, 0x11e42e835840
           cmp qword [rcx], r11
           jnz 0x48d786
    Elements (rcx)
           mov rbx, qword [rcx + 0x10]
    ArrayLength (rbx)
           mov rbx, qword [rcx + 0x10]
           mov esi, dword [rbx - 0x4]
           test esi, esi
           js 0x48d78d
    MoveGroup [arg:8 -> r8, x]
           mov r8, qword [rbp + 0x28]
85  Unbox (r8)
       74  mov r11, r8
           shr r11, 0x2f
        6  cmp r11d, 0x1fff1
           jnz 0x48d794
        5  mov edi, r8d
    CompareAndBranch (rdi), (rsi)
           cmp edi, esi
           jge 0x48d67b
2   MoveGroup [rdx -> arg:0, x], [r8 -> arg:8, x], [arg:16 -> r9, x]
        2  mov r9, qword [rbp + 0x30]
           mov qword [rbp + 0x28], r8
           mov qword [rbp + 0x20], rdx
14  Unbox (r9)
        1  mov r11, r9
        1  shr r11, 0x2f
        1  cmp r11d, 0x1fff1
        5  jnz 0x48d79b
        6  mov esi, r9d
    CompareAndBranch (rdi), (rsi)
           cmp edi, esi
           jnz 0x48d68a
    Value
           mov rcx, -0x7800000000000
    Return (rcx)
           jmp 0x48d714
1   MoveGroup [rdx -> arg:0, x], [r8 -> arg:8, x], [r9 -> arg:16, x]
           mov qword [rbp + 0x30], r9
           mov qword [rbp + 0x28], r8
        1  mov qword [rbp + 0x20], rdx
    InitializedLength (rbx)
           mov edx, dword [rbx - 0xc]
    KeepAliveObject (rcx)
    BoundsCheck (rdi), (rdx)
           cmp edx, edi
           jna 0x48d7a2
5   SubI (rsi), (rdi)
        1  sub esi, edi
        4  jo 0x48d7a9
6   LoadElementAndUnbox (rbx), (rdi)
           mov r11, -0x2000000000000
           xor r11, qword [rbx + rdi * 8]
        5  mov rdx, r11
           shr r11, 0x2f
        1  jnz 0x48d7b0
    KeepAliveObject (rcx)
2   SubI (rsi), (1)
        2  sub esi, 0x1
           jo 0x48d7b7
1   GuardShape (rdx)
        1  mov r11, 0x3a7a35819c60
           cmp qword [rdx], r11
           jnz 0x48d7bf
17  Int32ToIntPtr (rsi)
       17  movsxd rcx, esi
    ArrayBufferViewLength (rdx)
           mov rbx, qword [rdx + 0x20]
1   BoundsCheck (rcx), (rbx)
        1  cmp rbx, rcx
           jna 0x48d7c6
8   ArrayBufferViewElements (rdx)
        8  mov rax, qword [rdx + 0x30]
37  LoadUnboxedScalar (rax), (rcx)
           vmovss xmm0, dword [rax + rcx * 4]
       19  vucomiss xmm0, xmm0
       18  jnp 0x48d70b
           movss xmm0, dword [rip + 0xd7]
9   Float32ToDouble (%xmm0.s)
        9  cvtss2sd xmm0, xmm0
11  Box (%xmm0.d)
       11  vmovq rcx, xmm0
    Return (rcx)
8   Epilogue
        8  mov rsp, rbp
           pop rbp
           ret
    InvalidateEpilogue
    OOLCode

(These numbers are from only one iteration of the benchmark, but other iterations are substantively the same.)

Observations:

The hottest instruction in the profile is consistently the mov r11, r8 in the first Unbox, which makes no sense. That instruction is shuffling a value from one register to another. The instruction just before that was loading a value from the stack, which should also be reasonably fast. The instructions following it are all cheap register-to-register operations. The earlier MoveGroup/BoxNonStrictThis pair are doing substantively the same thing (loading a value from the stack and unboxing it), but with a much more reasonable sample count.
Similarly, a number of other hot instructions don't make much sense. The movsxd rcx, esi for Int32ToIntPtr is just a sign extension; it should be nearly free.
In general, our code generation here looks pretty good. Our register allocator is maybe shuffling things around a little more than I would like, but there's not a lot of fat in this code.
The one thing that stands out a little is the vucomiss in LoadUnboxedScalar. After we load a Float32 out of a typed array, we canonicalize it to make sure it has a specific NaN value. The profile seems to identify that as a hot spot, but given that the other hotspots don't seem reliable, I'm not sure how much credence to put into this one.

Markus:

Are there any known issues with Samply / best practices for interpreting the assignment of samples to individual instructions?
Would it be a lot of work to produce an interspersed assembly listing like the one above automatically? I think we should have all the information we need.
I'm putting up a WIP patch to stop canonicalizing the float32 load. I'm not sure if it's safe, but it seems worth checking to see if it's actually a performance hit. Can you run a perf-dashboard comparison report on that patch?

Flags: needinfo?(mstange.moz)

Iain Ireland [:iain]

Comment 2

•

1 year ago

Attached file WIP: Bug 1911160: Don't canonicalize in loadFromTypedArray (obsolete) — Details

Justin Link

Reporter

Comment 3

•

1 year ago

•

Edited

I think that the mov r11, r8 is showing a lot of samples because r8 is coming from memory and that instruction is the point at which the CPU has to stall.

The situation is probably similar with the vucomiss instruction - it's trying to operate on a value that is coming from memory right beforehand so it can't proceed until that load is completed.

Iain Ireland [:iain]

Comment 4

•

1 year ago

(In reply to Justin Link from comment #3)

I think that the mov r11, r8 is showing a lot of samples because r8 is coming from memory and that instruction is the point at which the CPU has to stall.

Note that the memory location it's coming from is on the stack. The caller just stored the value there, so it should 100% be in L1. That's especially true because we loaded the value right next to it in memory in the previous MoveGroup. I can't think of any reasonable sequence of events that would cause us to stall waiting for a memory load on the second load from a cache line, but not the first (especially when that cache line should already be in L1).

The vucomiss is more likely to be stuck waiting for the load to retire. On the other hand, if that's true, then I'm fresh out of ideas for making this function faster.

Justin Link

Reporter

Comment 5

•

1 year ago

You're right that it would seem unlikely to not be in L1 there. Even an L1 cache hit has a slight amount of latency but that's probably not what's happening here.

Can that load from the stack be moved earlier? I'm not seeing any obvious dependency that would prevent that.

Justin Link

Reporter

Comment 6

•

1 year ago

•

Edited

Is rbp required to be 64-byte aligned? If it's not, those two loads could be coming from different L1 cache lines too. (So the first one could hit while second one misses.)

Edit: My math/logic is slightly wrong above. If rbp is pointing to bytes 24-31 of a cache line, then the two loads (one from rbp + 0x20 and the other from rbp + 0x28) will be from different cache lines.

Iain Ireland [:iain]

Comment 7

•

1 year ago

Oops, wrote the first paragraph of this comment last week and then never submitted it.

To be clear: both values here are definitely in L1. They're at the top of the stack. They were stored there just before calling this function. We have touched roughly six memory locations since the caller pushed the arguments. I don't think store forwarding is going to kick in here, but it's close; I can imagine it working on some x86 chips that have special optimizations for the top of the stack. I am 98% confident that the profiler's information here is misleading; I just don't know what the correct interpretation should be.

The thing I came here to add: Jan and I talked about the vucomiss NaN check today. We concluded that we probably need to keep it. By overlaying an Int32Array with a Float32Array you can produce an arbitrary NaN payload, and at least on x86 it looks like the conversion from float to double preserves the NaN payload in the least convenient way:

The double precision output QNaN1 is created from the single precision input QNaN as follows: the sign bit is preserved, the 8-bit
exponent FFH is replaced by the 11-bit exponent 7FFH, and the 24-bit significand is extended to a 53-bit significand by appending
29 bits equal to 0.

The high bits of the NaN significand are where the tag lives, so copying the payload over would allow an attacker to craft an arbitrarily tagged Value. The fact that the bottom 29 bits must be 0 makes it a little harder to exploit, but that's the kind of restriction that clever attackers work around all the time.

So a comparison report on my patch is unlikely to correspond to an achievable win, but it will give us a sense of whether that's the real slowdown here or whether there's anything else.

Markus Stange [:mstange]

Comment 8

•

1 year ago

(In reply to Iain Ireland [:iain] from comment #1)

Are there any known issues with Samply / best practices for interpreting the assignment of samples to individual instructions?

What you see in samply should match what you would see with perf / perf annotate. One thing worth trying is to use "precise" sampling with perf:
perf record -g -k 1 -e cycles:pp or cycles:ppp and then import the perf.data file with samply import perf.data . Does that look more correct?

Would it be a lot of work to produce an interspersed assembly listing like the one above automatically? I think we should have all the information we need.

Some work but not a whole lot.

I'm putting up a WIP patch to stop canonicalizing the float32 load. I'm not sure if it's safe, but it seems worth checking to see if it's actually a performance hit. Can you run a perf-dashboard comparison report on that patch?

I've sent off the try pushes, will make a report later.
Base revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=148db5873836ff4f644cf444f61e2940a0a05486
Local revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=32d3606d0257a7b4fa6fb7e4085729a7f110d7de

Flags: needinfo?(mstange.moz)

Iain Ireland [:iain]

Comment 9

•

1 year ago

Hmm. The results I get from cycles:ppp look basically the same as the results I was getting from samply.

Frustratingly, I can't attribute this to some sort of systematic skew, because looking at the NaN-normalizing instructions, I see samples on the test and the jump, but none loading the normalized NaN to replace a loaded Nan, which is exactly what we'd expect given that none of the input values are NaN.

I've tested a bunch of other perf events, and none of them have shed any light. The ones that I can get data out of all seem to point to the same instruction, which would be reassuring if it weren't nonsensical.

I guess my assumption here is that the out-of-order execution is somehow causing samples to be recorded here instead of for the preceding loads. (That said, I would generally expect those loads to be in L1, because they're either on the stack, or reading the metadata of the costMatrix array, which will hit the same cache line every time we call this function.) Intuitively I would expect the vmovss in this function to be the slowest, since it's reading from a multi-dimensional array and it's not clear that the entire array should always be in memory. Sadly, the profiler insists otherwise.

I don't see any immediate path forward from this information. I might poke at the MoveGroups a little bit; it looks like we're maybe doing some redundant shuffling here.

Iain Ireland [:iain]

Updated

•

1 year ago

Depends on: 1912154

Iain Ireland [:iain]

Comment 10

•

1 year ago

I think we're spilling a few values to the stack unnecessarily. It's not hot in the profile, but it might still be worth cleaning up. Opened bug 1912154.

Iain Ireland [:iain]

Comment 11

•

1 year ago

Looking at a profile of a cut-down microbenchmark, I may have a partial answer for the mov r11, r8 anomaly.

Here's a profile of the same code running in a tight loop with boring inputs:

1246   push rbp
1      mov rbp, rsp
       nop
       mov rdx, qword [rbp + 0x20]
510    mov r11, -0x2000000000000
       xor r11, rdx
       mov rax, r11
       shr r11, 0x2f
1125   jnz 0xd0bc
       mov r11, 0x1bbc0a567560
       cmp qword [rax], r11
       jnz 0xd102
       mov r11, -0x2000000000000
330    xor r11, qword [rax + 0x18]
       mov rcx, r11
       shr r11, 0x2f
       jnz 0xd109
715    mov r11, 0x1bbc0a5606e0
       cmp qword [rcx], r11
       jnz 0xd110
1      mov rbx, qword [rcx + 0x10]
       mov esi, dword [rbx - 0x4]
1756   test esi, esi
       js 0xd117
3      mov r8, qword [rbp + 0x28]
       mov r11, r8
       shr r11, 0x2f
1785   cmp r11d, 0x1fff1
       jnz 0xd11e
       mov edi, r8d
       cmp edi, esi
       jge 0xd005
       mov r9, qword [rbp + 0x30]
1738   mov qword [rbp + 0x28], r8
       mov qword [rbp + 0x20], rdx
       mov r11, r9
       shr r11, 0x2f
1696   cmp r11d, 0x1fff1
       jnz 0xd125
1      mov esi, r9d
       cmp edi, esi
       jnz 0xd014
       mov rcx, -0x7800000000000
14     jmp 0xd09e
1733   mov qword [rbp + 0x30], r9
       mov qword [rbp + 0x28], r8
       mov qword [rbp + 0x20], rdx
       mov edx, dword [rbx - 0xc]
1707   cmp edx, edi
       jna 0xd12c
       sub esi, edi
       jo 0xd133
       mov r11, -0x2000000000000
1731   xor r11, qword [rbx + rdi * 8]
1      mov rdx, r11
       shr r11, 0x2f
       jnz 0xd13a
1748   sub esi, 0x1
       jo 0xd141
1      mov r11, 0x1bbc0a567540
       cmp qword [rdx], r11
       jnz 0xd149
1711   movsxd rcx, esi
       mov rbx, qword [rdx + 0x20]
       cmp rbx, rcx
       jna 0xd150
       mov rax, qword [rdx + 0x30]
1761   vmovss xmm0, dword [rax + rcx * 4]
       vucomiss xmm0, xmm0
3      jnp 0xd095
       movss xmm0, dword [rip + 0xd7]
1713   cvtss2sd xmm0, xmm0
       vmovq rcx, xmm0
19     mov rsp, rbp
1810   pop rbp
1      ret

Across most of this function, we see 1600-1800 samples every 3-5 instructions, which I think implies that we're consistently issuing/retiring multiple instructions per cycle, and the sample is being attributed to the final instruction of the batch. perf stat confirms that the IPC for this code is high (4-5 on my machine).

So if we're seeing a lot of samples for the mov r11, r8 in the browser version, that doesn't necessarily mean anything about that instruction in particular. In the particular profile I annotated, we don't have any samples for the 10 instructions preceding the mov. I think that implies that the block of instructions ending with the mov was being executed in parallel and was consistently retiring in the same place every time. So maybe the high sample count is just a sign that everything right before the mov is running unusually smoothly?

In any case, I think the practical upshot of this is that we currently have three main avenues of investigation:

If the performance reports that Markus has kicked off come back with a significant improvement from eliminating the NaN canonicalization, then we can chalk this up to a necessary side-effect of our value representation, and look to make easier improvements elsewhere.
If the register allocation issue we're looking at in bug 1912154 leads to any changes, then I would not expect a huge improvement here specifically, but we might see a small improvement across a lot of different code, so it might still be worth it.
I can peek at the code V8 generates and see if anything stands out as an obvious improvement.

Markus Stange [:mstange]

Comment 12

•

1 year ago

Sorry it took so long - the comparison report found no significant difference with the patch applied.
The base build has 5066 samples in this function: https://share.firefox.dev/46EzRh0
The fix build has 5087 samples in this function: https://share.firefox.dev/3yz7KmY
There also weren't any significant differences in other functions on sp3.

Phabricator Automation

Updated

•

1 year ago

Attachment #9417382 - Attachment is obsolete: true

Iain Ireland [:iain]

Comment 13

•

1 year ago

I've abandoned the patch, both because it doesn't seem to help and because it's unsound.

I ran the same test in my local (somewhat out of date) copy of d8, dumped the assembly, and did my best to annotate it. Here's my microbenchmark:

let C = {};
let c = new Object(C);
C.costMatrix = [];
for (var i = 0; i < 100; i++) {
  C.costMatrix.push(new Float32Array(100 - i));
}

C.costBetween = function(from, to) {
  if (from >= this.costMatrix.length || from == to)
    return 0;
  return this.costMatrix[from][to - from - 1];
}

for (var n = 0; n < 300000; n++) {
  for (var i = 0; i < 100; i++) {
    for (var j = i+1; j < 100; j++) {
      C.costBetween(i, j);
    }
  }
}

My rough observations:

V8's value representation generates slightly fewer instructions to unbox a value.
On the flip side, because V8 boxes its doubles, this function has to heap-allocate its return value unless the float32 can be exactly represented using a SMI.
In my little microbenchmark, with inlining turned off, the number of samples in costBetween is almost identical between SM and V8, but we're somehow spending way more time in the loops in the caller. It's hard to tell why, because we recompile the caller and the profiler's asm view only seems to show samples for the first compilation, which is not where we spend our time.

I'll attach my annotated disassembly.

I'm not sure what exactly we should be doing here, although I do think that I'm starting to see patterns in terms of which code V8 is faster at. Tightening up our unboxing code even a little bit would probably pay off, although it's already pretty optimized given our value representation.

Iain Ireland [:iain]

Comment 14

•

1 year ago

Attached file annotated-disassembly.txt — Details

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

1 year ago

Severity: -- → S3

Priority: -- → P3

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

1 year ago

Severity: S3 → N/A

Iain Ireland [:iain]

Updated

•

1 year ago

Depends on: 1915272

Mayank Bansal

Comment 15

•

10 months ago

With bug 1956655 fixed, worth re-evaluating a fresh profile?

WIP: Bug 1911160: Don't canonicalize in loadFromTypedArray 1 year ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
annotated-disassembly.txt 1 year ago Iain Ireland [:iain] 7.00 KB, text/plain		Details