3.69 - 24.35% perf_reftest_singletons bloom-basic-2.html / perf_reftest_singletons bloom-basic.html / tabpaint (windows10-64-shippable-qr, windows7-32-shippable) regression on push 8ba3c1292475b96e2ccb46c3232c929863451ff6
Categories
(Firefox Build System :: General, defect)
Tracking
(firefox-esr60 unaffected, firefox-esr68 unaffected, firefox68 unaffected, firefox69 unaffected, firefox70 fixed)
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox-esr68 | --- | unaffected |
firefox68 | --- | unaffected |
firefox69 | --- | unaffected |
firefox70 | --- | fixed |
People
(Reporter: igoldan, Assigned: froydnj)
References
(Regression)
Details
(4 keywords)
Attachments
(1 file)
Talos has detected a Firefox performance regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
24% perf_reftest_singletons bloom-basic-2.html windows7-32-shippable opt e10s stylo 42.38 -> 52.70
23% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 43.38 -> 53.54
Improvements:
7% perf_reftest_singletons parent-basic-singleton.html windows7-32-shippable opt e10s stylo 111.11 -> 103.52
2% tp5o_webext windows10-64-shippable opt e10s stylo 157.44 -> 153.68
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=21991
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests
For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I'm inclined to accept this -- enabling xLTO across platforms is probably more important than regressing two perf tests on win32. bholley as the owner what do you think?
Comment 2•5 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #1)
I'm inclined to accept this -- enabling xLTO across platforms is probably more important than regressing two perf tests on win32. bholley as the owner what do you think?
bloom-basic measures raw selector matching throughput, and was the microbenchmark I optimized hardest for stylo. I think a 25% regression on that metric deserves some investigation, perhaps resulting in some #[inline(always)] sprinkled within the inner loop. I can't drive this investigation but I can assist, so whoever does that should ping me to get started.
It ought to be relatively straightforward to take two builds and profile the test. bloom-basic.html can run on its own without any talos plumbing.
Comment 4•5 years ago
|
||
(In reply to :dmajor from comment #3)
It ought to be relatively straightforward to take two builds and profile the test. bloom-basic.html can run on its own without any talos plumbing.
Yes. There's even a copy online at [1] though I'm not sure if the in-tree tests are still identical.
One important thing to check would be the performance running parallel vs serial. I believe the numbers in talos are for a parallel run, but checking the sequential performance is probably a better starting point. You can trigger sequential execution by ctrl-clicking to load in a background tab or launching with STYLO_THREADS=1 in your env.
[1] https://bholley.net/testcases/style-perf-tests/perf-reftest/bloom-basic.html
I can repro this, serial and parallel, between the two nightlies that hgweb links to in https://hg.mozilla.org/integration/autoland/rev/402b26b7e514ffc7a4c3e5ff289a221d92fd3e38. (Since autoland doesn't publish symbols anymore, I have to go to m-c builds or use nightlies.)
Comment 6•5 years ago
•
|
||
Nice.
So as for some general context here: the basic path of this microbenchmark is to generate a huge number of selectors which can all be trivially rejected with the bloom filter (looking up a precomputed hash in an array).
The rejection happens at [1], so it's possible that's gotten slower somehow. It's also worth noting that I found this test to be very sensitive to the size of the Rule [2] struct, since that's the type in the giant array that we're iterating over when matching, and thus making it smaller increases the number of matches we can perform on a cache line before the next miss. It seems unlikely that xLTO would impact memory layout, but it's certainly an angle to consider.
[1] https://searchfox.org/mozilla-central/rev/38c88cbf4be87dfa0636d15a5d3599a8ea0d1a72/servo/components/selectors/matching.rs#196
[2] https://searchfox.org/mozilla-central/rev/38c88cbf4be87dfa0636d15a5d3599a8ea0d1a72/servo/components/style/stylist.rs#2278
So it all comes down to this one large function (large in terms of codegen, inlined stuff):
xul.dll!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*><itself>, 1260
xul.dll!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*><itself>, 1578
I have a list of hot instruction addresses in each build, I'll try to map them back to disassembly and/or source...
Brace yourself.
In both builds, there's a point in a hot loop where the compiler has inserted some padding in order to align an upcoming instruction. But, for whatever reason, the fast build used a seven-byte nop instruction, and the slow build used seven one-byte nop instructions.
xul!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*>+0x49 [z:\task_1563488614\build\src\servo\components\style\selector_map.rs @ 274]:
10f83589 0f1f8000000000 nop dword ptr [eax]
became
xul!style::selector_map::SelectorMap<style::stylist::Rule>::get_matching_rules<style::gecko::wrapper::GeckoElement,mut closure*>+0x49 [z:\task_1564486738\build\src\servo\components\style\selector_map.rs @ 274]:
14414359 90 nop
1441435a 90 nop
1441435b 90 nop
1441435c 90 nop
1441435d 90 nop
1441435e 90 nop
1441435f 90 nop
I was super skeptical that this would have any impact on performance, until I hex-edited the nops in a live process, reverting them to the old style, and the regression disappeared!
The single-byte nops are emitted because when doing the code-generation at link time, we've lost some of the information about the target CPU, so the backend thinks the target doesn't support multi-byte nops: https://github.com/llvm/llvm-project/blob/2463239777bfb41f63bb50d3eb931719612ddd0a/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp#L341-L347
Assignee | ||
Comment 10•5 years ago
|
||
By default, the linker chooses a "generic" 32-bit CPU to optimize for,
and LLVM's "generic" 32-bit CPU model doesn't include some features that
are helpful for performance on microbenchmarks. We explicitly specify a
CPU model to ensure the model we want is selected.
Assignee | ||
Comment 11•5 years ago
|
||
The patch in comment 10 makes the multi-byte nops come back for the function in question. The function does not appear to be as large as dmajor's numbers in comment 7 (it's only about 470 bytes, at least according to the crashreporter's xul.sym file), though, which I don't understand.
Comment 12•5 years ago
|
||
Nice! Your try build looks good to me. The numbers in comment 7 were sample counts, so don't worry about the function length.
143f7529 0f1f8000000000 nop dword ptr [eax]
Updated•5 years ago
|
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment 15•5 years ago
|
||
== Change summary for alert #22296 (as of Tue, 06 Aug 2019 12:04:19 GMT) ==
Improvements:
23% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 55.25 -> 42.64
21% perf_reftest_singletons bloom-basic-2.html windows7-32-shippable opt e10s stylo 54.48 -> 42.93
20% perf_reftest_singletons bloom-basic.html windows7-32-shippable opt e10s stylo 53.97 -> 43.02
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=22296
Updated•5 years ago
|
Updated•5 years ago
|
Updated•3 years ago
|
Description
•