Open Bug 1844857 Opened 2 years ago Updated 3 months ago

2x slower than Chrome on findOptimalSegmentationInternal (from the Speedometer 3 subtest "Perf-Dashboard")

Tracking

()

Status:

NEW

People

(Reporter: mstange, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(2 files, 1 obsolete file)

testcase 2 years ago Markus Stange [:mstange] 328.16 KB, text/html		Details
fixed testcase 2 years ago Markus Stange [:mstange] 329.04 KB, text/html		Details
updated-shell-testcase.js 1 year ago Iain Ireland [:iain] 328.82 KB, application/x-javascript		Details

Markus Stange [:mstange]

Reporter

Description

•

2 years ago

•

Edited

Attached file testcase (obsolete) — Details

On this function we're 1.7x slower than Chrome. Fixing the performance difference would improve our score on Perf-Dashboard by 3.2%.

I get the following numbers on the fixed testcase in attachment 9345436 [details]:
Firefox Nightly: ~150ms
Chrome Canary: ~90ms

The time is spent executing Ion-generated code.

Jira Integration Bot

Updated

•

2 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-467

Iain Ireland [:iain]

Comment 1

•

2 years ago

Looking into this.

Initial observations:

We're generating two different Ion ICs in cases where the baseline IC is polymorphic. The first is a JSOp::Add where ~90% of the time, one of the operands is undefined and the result is a NaN. We have some coverage for this case, but a) bug 1752504 to fold number + NaN in Warp is still open, and b) we don't transpile the CacheIR to Warp here anyway, because the IC is polymorphic. The BinaryValueCache that we generate is somewhere in the ballpark of 1% of the overall time, so fixing it doesn't seem like a huge win.
Is it intentional that we're adding undefined 90% of the time? It happens in this code:

        var costOfOptimalSegmentationOfLengthK =
            costOfOptimalSegmentationThatEndAtCurrentStart[k];
        var costOfCurrentSegment = costMatrix.costBetween(
          segmentStart,
          segmentEnd
        );
        var totalCost =
            costOfOptimalSegmentationOfLengthK + costOfCurrentSegment;

costOfOptimalSegmentationThatEndAtCurrentStart[k] is frequently undefined, which causes totalCost to be NaN. It looks like the author of this code was trying to prevent this case (noSegmentationOfLenghtKEndsAtCurrentStart), but for whatever reason that doesn't seem to work. Markus: Do you know if this bug is also present in the actual benchmark, or is an artifact of you pulling it out into a microbenchmark? (The typo in the variable name makes me suspect the former.)

The second Ion IC is a GetElem that is polymorphic between regular array access and typed array access. It looks like it's caused by this line:

  cost[0] = [0]; // The cost of segmenting single value is always 0.

All the other entries in cost are Float32Arrays, but this is a regular array. If I change the line to cost[0] = new Float32Array([0]), the Ion IC goes away. Making that change speeds things up by 5% or so, so it's unlikely that this is a big difference between us and V8.

I'll keep digging.

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Reporter

Comment 2

•

2 years ago

It's probably an artifact of my testcase reduction work, sorry for giving you a buggy testcase.

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Reporter

Comment 3

•

2 years ago

Attached file fixed testcase — Details

Attachment #9345125 - Attachment is obsolete: true

Will Medina [:willyelm]

Updated

•

2 years ago

Severity: -- → S3

Type: defect → task

Priority: -- → P3

Iain Ireland [:iain]

Comment 4

•

2 years ago

(Oops, wrote this comment yesterday and forgot to save it.)

The fix looks good. I can confirm that the Add is no longer polymorphic.

On the other hand, fixing the bug makes a SetElem polymorphic, for the same reason as the GetElem above. I think it's this one:

cost[segmentEnd][k + 1] = totalCost

Editing the testcase to allocate a Float32Array for cost[0] now reduces the time to execute run on my machine from 460ish to 400ish ms. A 10-15% improvement is non-trivial. I don't think we want to change the testcase, since I'm sure this sort of thing happens all the time in real world code. On the other hand, I don't have any immediate thoughts about what we could do to improve it.

Another observation, looking at the CacheIR we transpile, is that there are a couple of hot GetProp.DenseElementHole ICs that are repeatedly checking the proto chain to make sure that no dense elements have been added. It would be nice if we could be somehow smarter about missing properties/elements. Something about fuses?

Iain Ireland [:iain]

Comment 5

•

1 year ago

This code is a superset of the code that I've been looking at in bug 1911160: it includes costBetween, but also some of the surrounding context. Since we opened this bug, upstream Speedometer changed the cost array to be monomorphic. I'll attach a new shell version of the benchmark.

We continue to be slower on this benchmark. As described in bug 1911160, we seem to be slower in costBetween for reasons that are not obvious even after comparing the assembly between the two functions. On my laptop, we're about 50% slower with inlining disabled (3650ms vs 2420ms), and 90% slower with inlining enabled (2032ms vs 1072ms).

If I comment out these two lines in tryAttachDenseElementHole, to simulate the results of adding an invalidation fuse that lets us skip checking the proto chain for dense elements, then our numbers improve to 3518ms without inlining (~4% better) / 1890ms with inlining (~8% better). That's not nothing, but it's not a huge win.

This is some of the most Jetstream-ish code in Speedometer. I'm not sure if that's an argument for bumping it down in priority, or for seeing it as an opportunity to improve both sp3 and Jetstream. I would like to tag this as js-perf-next, but I'm not sure what the next steps are, other than staring even more intently at the generated assembly in an attempt to explain why our code is slower.

Iain Ireland [:iain]

Comment 6

•

1 year ago

Attached file updated-shell-testcase.js — Details

Iain Ireland [:iain]

Comment 7

•

1 year ago

js-perf-next: After thinking about this some more, I think I can come up with a semi-bounded list of next steps.

I've only looked at this on my personal laptop (Raptor Lake). It would be good to profile this on a variety of other machines to see how consistent this is across machines. In particular, it would be interesting to see if the weird spike in samples I see in the second unbox in this comment is a micro-architectural thing, or if it recurs on other machines as well.
To the extent that I can point to any lesson from this, it's that we spend a lot of time unboxing values. If we can shave even a cycle or two off of that, it might pay off in sheer volume. A couple of possibilities:
a) On machines with BMI2, which we already check, we could potentially replace mov rcx, r11; shr r11, 0x2f with shrx rcx, r11, 0x2f to extract the tag with one fewer instruction.
b) When we're unboxing an Int32 (or other value with a non-pointer payload), the high 4 bytes will only contain the tag. If we're unboxing from memory, we could consider testing the tag in-memory to avoid having to shift, converting something like this:

mov r11, -0x2000000000000
xor r11, qword [rbx + rdi * 8]
mov rdx, r11
shr r11, 0x2f
jnz 0x48d7b0

to something like:

cmp qword [rbx + rdi * 8 + 4], 0xfffe0000
jnz 0x48d7b0
mov dword [rbx + rdi * 8], edx

I'm not sure if the shorter dependency chain outweighs the cost of the additional load, but it might be worth testing.
3. This might be a good time for somebody to break out vtune to see if we can identify performance counters (cache misses? branch mispredicts?) where we're doing worse than V8.

Whiteboard: [sp3] → [sp3][js-perf-next]

Iain Ireland [:iain]

Comment 8

•

1 year ago

Unfortunately, now that I look more closely, there's no encoding of shrx that takes an immediate shift value. It only accepts register inputs.

Iain Ireland [:iain]

Comment 9

•

1 year ago

I played around a bit with gcc/clang in Godbolt to see if they would generate anything better than our current unboxing sequences if I gave them permission to use the latest instructions. The results mostly weren't very interesting: https://godbolt.org/z/TGb77qqav.

A couple of observations:

In the version where we unbox a pointer-sized value from memory, our instruction sequence (our_unboxPtr_m) seems better than gcc/Clang (7 instructions vs 5).
When unboxing a pointer-sized value, instead of clearing the upper bits via xor with the shifted tag, Clang used mov al, 47; bzhi rax, rdi, rax to zero the high bits of the register. Clang presumably thought it was better for some reason, although I will note that using xor can be a small obstacle to speculative execution attacks like Spectre (because unboxing a type-confused Value will result in an invalid pointer with high bits set).

Nicolas B. Pierron [:nbp]

Updated

•

7 months ago

Blocks: sm-perf-next

Markus Stange [:mstange]

Reporter

Comment 10

•

5 months ago

Bug 1804104 improved performance on findOptimalSegmentationInternal a little bit, it eliminated a SetElem IC.

Depends on: 1804104

Iain Ireland [:iain]

Comment 11

•

3 months ago

Unmarking this as js-perf-next, since we don't have obvious next steps. Note that bug 1956655 fixed the "weird spike in samples" I mentioned in comment 7.

Whiteboard: [sp3][js-perf-next] → [sp3]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

2x slower than Chrome on findOptimalSegmentationInternal (from the Speedometer 3 subtest "Perf-Dashboard")

Categories

(Core :: JavaScript Engine, task, P3)

Tracking

()

People

(Reporter: mstange, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Crash Data

Security

(public)

User Story

Attachments

(2 files, 1 obsolete file)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type