Firefox takes 1.7x time as Chrome in SampleVarianceUpperTriangularMatrix.prototype.costBetween in Perf-Dashboard
Categories
(Core :: JavaScript Engine, enhancement, P3)
Tracking
()
People
(Reporter: jlink, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Whiteboard: [sp3])
Attachments
(1 file, 1 obsolete file)
|
7.00 KB,
text/plain
|
Details |
5078 vs 3040 samples (diff: 2038, 1.67x): Firefox profile vs Chrome profile (Code)
Making Firefox as fast as Chrome on this function would reduce its time by 4.3%.
Updated•1 year ago
|
Comment 1•1 year ago
|
||
I captured a local samply profile and looked at the generated assembly, then manually interspersed the IR-ops and asm annotations to get this:
7 Prologue
7 push rbp
mov rbp, rsp
nop
Parameter
Parameter
Parameter
CheckOverRecursed
OsiPoint
MoveGroup [arg:0 -> rdx, x]
mov rdx, qword [rbp + 0x20]
4 BoxNonStrictThis (rdx)
mov r11, -0x2000000000000
xor r11, rdx
mov rax, r11
shr r11, 0x2f
4 jnz 0x48d732
OsiPoint
GuardShape (rax)
mov r11, 0x3a7a35819ee0
cmp qword [rax], r11
jnz 0x48d778
12 LoadFixedSlotAndUnbox (rax)
mov r11, -0x2000000000000
xor r11, qword [rax + 0x18]
mov rcx, r11
shr r11, 0x2f
12 jnz 0x48d77f
GuardShape (rcx)
mov r11, 0x11e42e835840
cmp qword [rcx], r11
jnz 0x48d786
Elements (rcx)
mov rbx, qword [rcx + 0x10]
ArrayLength (rbx)
mov rbx, qword [rcx + 0x10]
mov esi, dword [rbx - 0x4]
test esi, esi
js 0x48d78d
MoveGroup [arg:8 -> r8, x]
mov r8, qword [rbp + 0x28]
85 Unbox (r8)
74 mov r11, r8
shr r11, 0x2f
6 cmp r11d, 0x1fff1
jnz 0x48d794
5 mov edi, r8d
CompareAndBranch (rdi), (rsi)
cmp edi, esi
jge 0x48d67b
2 MoveGroup [rdx -> arg:0, x], [r8 -> arg:8, x], [arg:16 -> r9, x]
2 mov r9, qword [rbp + 0x30]
mov qword [rbp + 0x28], r8
mov qword [rbp + 0x20], rdx
14 Unbox (r9)
1 mov r11, r9
1 shr r11, 0x2f
1 cmp r11d, 0x1fff1
5 jnz 0x48d79b
6 mov esi, r9d
CompareAndBranch (rdi), (rsi)
cmp edi, esi
jnz 0x48d68a
Value
mov rcx, -0x7800000000000
Return (rcx)
jmp 0x48d714
1 MoveGroup [rdx -> arg:0, x], [r8 -> arg:8, x], [r9 -> arg:16, x]
mov qword [rbp + 0x30], r9
mov qword [rbp + 0x28], r8
1 mov qword [rbp + 0x20], rdx
InitializedLength (rbx)
mov edx, dword [rbx - 0xc]
KeepAliveObject (rcx)
BoundsCheck (rdi), (rdx)
cmp edx, edi
jna 0x48d7a2
5 SubI (rsi), (rdi)
1 sub esi, edi
4 jo 0x48d7a9
6 LoadElementAndUnbox (rbx), (rdi)
mov r11, -0x2000000000000
xor r11, qword [rbx + rdi * 8]
5 mov rdx, r11
shr r11, 0x2f
1 jnz 0x48d7b0
KeepAliveObject (rcx)
2 SubI (rsi), (1)
2 sub esi, 0x1
jo 0x48d7b7
1 GuardShape (rdx)
1 mov r11, 0x3a7a35819c60
cmp qword [rdx], r11
jnz 0x48d7bf
17 Int32ToIntPtr (rsi)
17 movsxd rcx, esi
ArrayBufferViewLength (rdx)
mov rbx, qword [rdx + 0x20]
1 BoundsCheck (rcx), (rbx)
1 cmp rbx, rcx
jna 0x48d7c6
8 ArrayBufferViewElements (rdx)
8 mov rax, qword [rdx + 0x30]
37 LoadUnboxedScalar (rax), (rcx)
vmovss xmm0, dword [rax + rcx * 4]
19 vucomiss xmm0, xmm0
18 jnp 0x48d70b
movss xmm0, dword [rip + 0xd7]
9 Float32ToDouble (%xmm0.s)
9 cvtss2sd xmm0, xmm0
11 Box (%xmm0.d)
11 vmovq rcx, xmm0
Return (rcx)
8 Epilogue
8 mov rsp, rbp
pop rbp
ret
InvalidateEpilogue
OOLCode
(These numbers are from only one iteration of the benchmark, but other iterations are substantively the same.)
Observations:
- The hottest instruction in the profile is consistently the
mov r11, r8in the first Unbox, which makes no sense. That instruction is shuffling a value from one register to another. The instruction just before that was loading a value from the stack, which should also be reasonably fast. The instructions following it are all cheap register-to-register operations. The earlier MoveGroup/BoxNonStrictThis pair are doing substantively the same thing (loading a value from the stack and unboxing it), but with a much more reasonable sample count. - Similarly, a number of other hot instructions don't make much sense. The
movsxd rcx, esifor Int32ToIntPtr is just a sign extension; it should be nearly free. - In general, our code generation here looks pretty good. Our register allocator is maybe shuffling things around a little more than I would like, but there's not a lot of fat in this code.
- The one thing that stands out a little is the
vucomissin LoadUnboxedScalar. After we load a Float32 out of a typed array, we canonicalize it to make sure it has a specific NaN value. The profile seems to identify that as a hot spot, but given that the other hotspots don't seem reliable, I'm not sure how much credence to put into this one.
Markus:
- Are there any known issues with Samply / best practices for interpreting the assignment of samples to individual instructions?
- Would it be a lot of work to produce an interspersed assembly listing like the one above automatically? I think we should have all the information we need.
- I'm putting up a WIP patch to stop canonicalizing the float32 load. I'm not sure if it's safe, but it seems worth checking to see if it's actually a performance hit. Can you run a perf-dashboard comparison report on that patch?
Comment 2•1 year ago
|
||
| Reporter | ||
Comment 3•1 year ago
•
|
||
I think that the mov r11, r8 is showing a lot of samples because r8 is coming from memory and that instruction is the point at which the CPU has to stall.
The situation is probably similar with the vucomiss instruction - it's trying to operate on a value that is coming from memory right beforehand so it can't proceed until that load is completed.
Comment 4•1 year ago
|
||
(In reply to Justin Link from comment #3)
I think that the
mov r11, r8is showing a lot of samples because r8 is coming from memory and that instruction is the point at which the CPU has to stall.
Note that the memory location it's coming from is on the stack. The caller just stored the value there, so it should 100% be in L1. That's especially true because we loaded the value right next to it in memory in the previous MoveGroup. I can't think of any reasonable sequence of events that would cause us to stall waiting for a memory load on the second load from a cache line, but not the first (especially when that cache line should already be in L1).
The vucomiss is more likely to be stuck waiting for the load to retire. On the other hand, if that's true, then I'm fresh out of ideas for making this function faster.
| Reporter | ||
Comment 5•1 year ago
|
||
You're right that it would seem unlikely to not be in L1 there. Even an L1 cache hit has a slight amount of latency but that's probably not what's happening here.
Can that load from the stack be moved earlier? I'm not seeing any obvious dependency that would prevent that.
| Reporter | ||
Comment 6•1 year ago
•
|
||
Is rbp required to be 64-byte aligned? If it's not, those two loads could be coming from different L1 cache lines too. (So the first one could hit while second one misses.)
Edit: My math/logic is slightly wrong above. If rbp is pointing to bytes 24-31 of a cache line, then the two loads (one from rbp + 0x20 and the other from rbp + 0x28) will be from different cache lines.
Comment 7•1 year ago
|
||
Oops, wrote the first paragraph of this comment last week and then never submitted it.
To be clear: both values here are definitely in L1. They're at the top of the stack. They were stored there just before calling this function. We have touched roughly six memory locations since the caller pushed the arguments. I don't think store forwarding is going to kick in here, but it's close; I can imagine it working on some x86 chips that have special optimizations for the top of the stack. I am 98% confident that the profiler's information here is misleading; I just don't know what the correct interpretation should be.
The thing I came here to add: Jan and I talked about the vucomiss NaN check today. We concluded that we probably need to keep it. By overlaying an Int32Array with a Float32Array you can produce an arbitrary NaN payload, and at least on x86 it looks like the conversion from float to double preserves the NaN payload in the least convenient way:
The double precision output QNaN1 is created from the single precision input QNaN as follows: the sign bit is preserved, the 8-bit
exponent FFH is replaced by the 11-bit exponent 7FFH, and the 24-bit significand is extended to a 53-bit significand by appending
29 bits equal to 0.
The high bits of the NaN significand are where the tag lives, so copying the payload over would allow an attacker to craft an arbitrarily tagged Value. The fact that the bottom 29 bits must be 0 makes it a little harder to exploit, but that's the kind of restriction that clever attackers work around all the time.
So a comparison report on my patch is unlikely to correspond to an achievable win, but it will give us a sense of whether that's the real slowdown here or whether there's anything else.
Comment 8•1 year ago
|
||
(In reply to Iain Ireland [:iain] from comment #1)
- Are there any known issues with Samply / best practices for interpreting the assignment of samples to individual instructions?
What you see in samply should match what you would see with perf / perf annotate. One thing worth trying is to use "precise" sampling with perf:
perf record -g -k 1 -e cycles:pp or cycles:ppp and then import the perf.data file with samply import perf.data . Does that look more correct?
- Would it be a lot of work to produce an interspersed assembly listing like the one above automatically? I think we should have all the information we need.
Some work but not a whole lot.
- I'm putting up a WIP patch to stop canonicalizing the float32 load. I'm not sure if it's safe, but it seems worth checking to see if it's actually a performance hit. Can you run a perf-dashboard comparison report on that patch?
I've sent off the try pushes, will make a report later.
Base revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=148db5873836ff4f644cf444f61e2940a0a05486
Local revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=32d3606d0257a7b4fa6fb7e4085729a7f110d7de
Comment 9•1 year ago
|
||
Hmm. The results I get from cycles:ppp look basically the same as the results I was getting from samply.
Frustratingly, I can't attribute this to some sort of systematic skew, because looking at the NaN-normalizing instructions, I see samples on the test and the jump, but none loading the normalized NaN to replace a loaded Nan, which is exactly what we'd expect given that none of the input values are NaN.
I've tested a bunch of other perf events, and none of them have shed any light. The ones that I can get data out of all seem to point to the same instruction, which would be reassuring if it weren't nonsensical.
I guess my assumption here is that the out-of-order execution is somehow causing samples to be recorded here instead of for the preceding loads. (That said, I would generally expect those loads to be in L1, because they're either on the stack, or reading the metadata of the costMatrix array, which will hit the same cache line every time we call this function.) Intuitively I would expect the vmovss in this function to be the slowest, since it's reading from a multi-dimensional array and it's not clear that the entire array should always be in memory. Sadly, the profiler insists otherwise.
I don't see any immediate path forward from this information. I might poke at the MoveGroups a little bit; it looks like we're maybe doing some redundant shuffling here.
Comment 10•1 year ago
|
||
I think we're spilling a few values to the stack unnecessarily. It's not hot in the profile, but it might still be worth cleaning up. Opened bug 1912154.
Comment 11•1 year ago
|
||
Looking at a profile of a cut-down microbenchmark, I may have a partial answer for the mov r11, r8 anomaly.
Here's a profile of the same code running in a tight loop with boring inputs:
1246 push rbp
1 mov rbp, rsp
nop
mov rdx, qword [rbp + 0x20]
510 mov r11, -0x2000000000000
xor r11, rdx
mov rax, r11
shr r11, 0x2f
1125 jnz 0xd0bc
mov r11, 0x1bbc0a567560
cmp qword [rax], r11
jnz 0xd102
mov r11, -0x2000000000000
330 xor r11, qword [rax + 0x18]
mov rcx, r11
shr r11, 0x2f
jnz 0xd109
715 mov r11, 0x1bbc0a5606e0
cmp qword [rcx], r11
jnz 0xd110
1 mov rbx, qword [rcx + 0x10]
mov esi, dword [rbx - 0x4]
1756 test esi, esi
js 0xd117
3 mov r8, qword [rbp + 0x28]
mov r11, r8
shr r11, 0x2f
1785 cmp r11d, 0x1fff1
jnz 0xd11e
mov edi, r8d
cmp edi, esi
jge 0xd005
mov r9, qword [rbp + 0x30]
1738 mov qword [rbp + 0x28], r8
mov qword [rbp + 0x20], rdx
mov r11, r9
shr r11, 0x2f
1696 cmp r11d, 0x1fff1
jnz 0xd125
1 mov esi, r9d
cmp edi, esi
jnz 0xd014
mov rcx, -0x7800000000000
14 jmp 0xd09e
1733 mov qword [rbp + 0x30], r9
mov qword [rbp + 0x28], r8
mov qword [rbp + 0x20], rdx
mov edx, dword [rbx - 0xc]
1707 cmp edx, edi
jna 0xd12c
sub esi, edi
jo 0xd133
mov r11, -0x2000000000000
1731 xor r11, qword [rbx + rdi * 8]
1 mov rdx, r11
shr r11, 0x2f
jnz 0xd13a
1748 sub esi, 0x1
jo 0xd141
1 mov r11, 0x1bbc0a567540
cmp qword [rdx], r11
jnz 0xd149
1711 movsxd rcx, esi
mov rbx, qword [rdx + 0x20]
cmp rbx, rcx
jna 0xd150
mov rax, qword [rdx + 0x30]
1761 vmovss xmm0, dword [rax + rcx * 4]
vucomiss xmm0, xmm0
3 jnp 0xd095
movss xmm0, dword [rip + 0xd7]
1713 cvtss2sd xmm0, xmm0
vmovq rcx, xmm0
19 mov rsp, rbp
1810 pop rbp
1 ret
Across most of this function, we see 1600-1800 samples every 3-5 instructions, which I think implies that we're consistently issuing/retiring multiple instructions per cycle, and the sample is being attributed to the final instruction of the batch. perf stat confirms that the IPC for this code is high (4-5 on my machine).
So if we're seeing a lot of samples for the mov r11, r8 in the browser version, that doesn't necessarily mean anything about that instruction in particular. In the particular profile I annotated, we don't have any samples for the 10 instructions preceding the mov. I think that implies that the block of instructions ending with the mov was being executed in parallel and was consistently retiring in the same place every time. So maybe the high sample count is just a sign that everything right before the mov is running unusually smoothly?
In any case, I think the practical upshot of this is that we currently have three main avenues of investigation:
- If the performance reports that Markus has kicked off come back with a significant improvement from eliminating the NaN canonicalization, then we can chalk this up to a necessary side-effect of our value representation, and look to make easier improvements elsewhere.
- If the register allocation issue we're looking at in bug 1912154 leads to any changes, then I would not expect a huge improvement here specifically, but we might see a small improvement across a lot of different code, so it might still be worth it.
- I can peek at the code V8 generates and see if anything stands out as an obvious improvement.
Comment 12•1 year ago
|
||
Sorry it took so long - the comparison report found no significant difference with the patch applied.
The base build has 5066 samples in this function: https://share.firefox.dev/46EzRh0
The fix build has 5087 samples in this function: https://share.firefox.dev/3yz7KmY
There also weren't any significant differences in other functions on sp3.
Updated•1 year ago
|
Comment 13•1 year ago
|
||
I've abandoned the patch, both because it doesn't seem to help and because it's unsound.
I ran the same test in my local (somewhat out of date) copy of d8, dumped the assembly, and did my best to annotate it. Here's my microbenchmark:
let C = {};
let c = new Object(C);
C.costMatrix = [];
for (var i = 0; i < 100; i++) {
C.costMatrix.push(new Float32Array(100 - i));
}
C.costBetween = function(from, to) {
if (from >= this.costMatrix.length || from == to)
return 0;
return this.costMatrix[from][to - from - 1];
}
for (var n = 0; n < 300000; n++) {
for (var i = 0; i < 100; i++) {
for (var j = i+1; j < 100; j++) {
C.costBetween(i, j);
}
}
}
My rough observations:
- V8's value representation generates slightly fewer instructions to unbox a value.
- On the flip side, because V8 boxes its doubles, this function has to heap-allocate its return value unless the float32 can be exactly represented using a SMI.
- In my little microbenchmark, with inlining turned off, the number of samples in costBetween is almost identical between SM and V8, but we're somehow spending way more time in the loops in the caller. It's hard to tell why, because we recompile the caller and the profiler's asm view only seems to show samples for the first compilation, which is not where we spend our time.
I'll attach my annotated disassembly.
I'm not sure what exactly we should be doing here, although I do think that I'm starting to see patterns in terms of which code V8 is faster at. Tightening up our unboxing code even a little bit would probably pay off, although it's already pretty optimized given our value representation.
Comment 14•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 15•10 months ago
|
||
With bug 1956655 fixed, worth re-evaluating a fresh profile?
Description
•