Closed Bug 1568450 Opened 1 year ago Closed 1 year ago

3.69 - 24.35% perf_reftest_singletons bloom-basic-2.html / perf_reftest_singletons bloom-basic.html / tabpaint (windows10-64-shippable-qr, windows7-32-shippable) regression on push 8ba3c1292475b96e2ccb46c3232c929863451ff6

Categories

(Firefox Build System :: General, defect)

Desktop
Windows
defect
Not set

Tracking

(firefox-esr60 unaffected, firefox-esr68 unaffected, firefox68 unaffected, firefox69 unaffected, firefox70 fixed)

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox68 --- unaffected
firefox69 --- unaffected
firefox70 --- fixed

People

(Reporter: igoldan, Assigned: froydnj)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: perf, regression, talos-regression)

Attachments

(1 file)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=0fcc92a53bfb5329f7f5117fa8b91af79bf22e3b&tochange=8ba3c1292475b96e2ccb46c3232c929863451ff6

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

24% perf_reftest_singletons bloom-basic-2.html windows7-32-shippable opt e10s stylo 42.38 -> 52.70
23% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 43.38 -> 53.54

Improvements:

7% perf_reftest_singletons parent-basic-singleton.html windows7-32-shippable opt e10s stylo 111.11 -> 103.52
2% tp5o_webext windows10-64-shippable opt e10s stylo 157.44 -> 153.68

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=21991

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Product: Testing → Firefox Build System
Flags: needinfo?(nfroyd)

I'm inclined to accept this -- enabling xLTO across platforms is probably more important than regressing two perf tests on win32. bholley as the owner what do you think?

Flags: needinfo?(nfroyd) → needinfo?(bobbyholley)

(In reply to Eric Rahm [:erahm] from comment #1)

I'm inclined to accept this -- enabling xLTO across platforms is probably more important than regressing two perf tests on win32. bholley as the owner what do you think?

bloom-basic measures raw selector matching throughput, and was the microbenchmark I optimized hardest for stylo. I think a 25% regression on that metric deserves some investigation, perhaps resulting in some #[inline(always)] sprinkled within the inner loop. I can't drive this investigation but I can assist, so whoever does that should ping me to get started.

Flags: needinfo?(bobbyholley)

It ought to be relatively straightforward to take two builds and profile the test. bloom-basic.html can run on its own without any talos plumbing.

(In reply to :dmajor from comment #3)

It ought to be relatively straightforward to take two builds and profile the test. bloom-basic.html can run on its own without any talos plumbing.

Yes. There's even a copy online at [1] though I'm not sure if the in-tree tests are still identical.

One important thing to check would be the performance running parallel vs serial. I believe the numbers in talos are for a parallel run, but checking the sequential performance is probably a better starting point. You can trigger sequential execution by ctrl-clicking to load in a background tab or launching with STYLO_THREADS=1 in your env.

[1] https://bholley.net/testcases/style-perf-tests/perf-reftest/bloom-basic.html

I can repro this, serial and parallel, between the two nightlies that hgweb links to in https://hg.mozilla.org/integration/autoland/rev/402b26b7e514ffc7a4c3e5ff289a221d92fd3e38. (Since autoland doesn't publish symbols anymore, I have to go to m-c builds or use nightlies.)

Nice.

So as for some general context here: the basic path of this microbenchmark is to generate a huge number of selectors which can all be trivially rejected with the bloom filter (looking up a precomputed hash in an array).

The rejection happens at [1], so it's possible that's gotten slower somehow. It's also worth noting that I found this test to be very sensitive to the size of the Rule [2] struct, since that's the type in the giant array that we're iterating over when matching, and thus making it smaller increases the number of matches we can perform on a cache line before the next miss. It seems unlikely that xLTO would impact memory layout, but it's certainly an angle to consider.

[1] https://searchfox.org/mozilla-central/rev/38c88cbf4be87dfa0636d15a5d3599a8ea0d1a72/servo/components/selectors/matching.rs#196
[2] https://searchfox.org/mozilla-central/rev/38c88cbf4be87dfa0636d15a5d3599a8ea0d1a72/servo/components/style/stylist.rs#2278

So it all comes down to this one large function (large in terms of codegen, inlined stuff):

xul.dll!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*><itself>, 1260
xul.dll!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*><itself>, 1578

I have a list of hot instruction addresses in each build, I'll try to map them back to disassembly and/or source...

Brace yourself.

In both builds, there's a point in a hot loop where the compiler has inserted some padding in order to align an upcoming instruction. But, for whatever reason, the fast build used a seven-byte nop instruction, and the slow build used seven one-byte nop instructions.

xul!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*>+0x49 [z:\task_1563488614\build\src\servo\components\style\selector_map.rs @ 274]:
10f83589 0f1f8000000000  nop     dword ptr [eax]

became

xul!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*>+0x49 [z:\task_1564486738\build\src\servo\components\style\selector_map.rs @ 274]:
14414359 90              nop
1441435a 90              nop
1441435b 90              nop
1441435c 90              nop
1441435d 90              nop
1441435e 90              nop
1441435f 90              nop

I was super skeptical that this would have any impact on performance, until I hex-edited the nops in a live process, reverting them to the old style, and the regression disappeared!

The single-byte nops are emitted because when doing the code-generation at link time, we've lost some of the information about the target CPU, so the backend thinks the target doesn't support multi-byte nops: https://github.com/llvm/llvm-project/blob/2463239777bfb41f63bb50d3eb931719612ddd0a/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp#L341-L347

By default, the linker chooses a "generic" 32-bit CPU to optimize for,
and LLVM's "generic" 32-bit CPU model doesn't include some features that
are helpful for performance on microbenchmarks. We explicitly specify a
CPU model to ensure the model we want is selected.

The patch in comment 10 makes the multi-byte nops come back for the function in question. The function does not appear to be as large as dmajor's numbers in comment 7 (it's only about 470 bytes, at least according to the crashreporter's xul.sym file), though, which I don't understand.

Nice! Your try build looks good to me. The numbers in comment 7 were sample counts, so don't worry about the function length.

143f7529 0f1f8000000000 nop dword ptr [eax]

Attachment #9082721 - Attachment description: Bug 1568450 - explicitly specify a cpu for 32-bit windows linking; r=#build → Bug 1568450 - explicitly specify a cpu for LTO linking on Windows; r=#build
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d1fe64b4ba20
explicitly specify a cpu for LTO linking on Windows; r=dmajor
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

== Change summary for alert #22296 (as of Tue, 06 Aug 2019 12:04:19 GMT) ==

Improvements:

23% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 55.25 -> 42.64
21% perf_reftest_singletons bloom-basic-2.html windows7-32-shippable opt e10s stylo 54.48 -> 42.93
20% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 53.97 -> 43.02

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=22296

You need to log in before you can comment on or make changes to this bug.