5.58 - 196.87% perf_reftest_singletons / tp5o / tsvg_static / tsvgr_opacity (linux64-shippable|-qr, macosx1014-64-shippable, windows10-64-shippable|-qr|7-32-shippable) regression on push 5dd01231a71cfd24ae09123ce27d1c75f6b30e03 (Tue November 26 2019)
Categories
(Testing :: Talos, defect, P3)
Tracking
(firefox70 unaffected, firefox71 unaffected)
| Tracking | Status | |
|---|---|---|
| firefox70 | --- | unaffected |
| firefox71 | --- | unaffected |
People
(Reporter: alexandrui, Unassigned)
References
(Depends on 1 open bug, Regression)
Details
(4 keywords)
Attachments
(1 file)
|
623.55 KB,
image/png
|
Details |
Talos has detected a Firefox performance regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
197% perf_reftest_singletons slow-selector-1.html linux64-shippable opt e10s stylo 0.03 -> 0.08
188% perf_reftest_singletons only-children-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08
169% perf_reftest_singletons slow-selector-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08
144% perf_reftest_singletons slow-selector-2.html windows7-32-shippable opt e10s stylo 0.03 -> 0.07
141% perf_reftest_singletons slow-selector-2.html linux64-shippable-qr opt e10s stylo 0.03 -> 0.08
140% perf_reftest_singletons slow-selector-1.html windows10-64-shippable opt e10s stylo 0.03 -> 0.08
139% perf_reftest_singletons only-children-1.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
135% tp5o windows7-32-shippable opt e10s stylo 116.37 -> 273.18
133% tp5o windows10-64-shippable-qr opt e10s stylo 115.85 -> 270.08
132% perf_reftest_singletons slow-selector-2.html linux64-shippable opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-1.html linux64-shippable-qr opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-1.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-2.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83
125% tp5o windows10-64-shippable opt e10s stylo 120.06 -> 270.44
124% tp5o_webext windows10-64-shippable opt e10s stylo 158.04 -> 354.15
122% tp5o_webext windows10-64-shippable-qr opt e10s stylo 155.21 -> 343.96
121% perf_reftest_singletons only-children-1.html linux64-shippable opt e10s stylo 0.04 -> 0.08
120% tp5o_webext windows10-64-shippable-qr opt e10s stylo 155.54 -> 342.60
118% perf_reftest_singletons only-children-1.html linux64-shippable-qr opt e10s stylo 0.04 -> 0.08
118% tp5o_webext windows7-32-shippable opt e10s stylo 154.33 -> 336.25
115% perf_reftest_singletons display-none-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.13
112% tp5o macosx1014-64-shippable opt e10s stylo 170.86 -> 361.56
110% perf_reftest_singletons slow-selector-2.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
108% perf_reftest_singletons only-children-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
107% perf_reftest_singletons display-none-1.html windows7-32-shippable opt e10s stylo 0.04 -> 0.08
107% perf_reftest_singletons slow-selector-2.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons display-none-1.html linux64-shippable-qr opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons display-none-1.html windows10-64-shippable-qr opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons only-children-1.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons slow-selector-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
100% perf_reftest_singletons display-none-1.html linux64-shippable opt e10s stylo 0.04 -> 0.08
100% perf_reftest_singletons display-none-1.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
66% tp5o_webext linux64-shippable-qr opt e10s stylo 170.53 -> 282.40
62% tp5o_webext linux64-shippable opt e10s stylo 171.48 -> 278.09
53% tp5o linux64-shippable opt e10s stylo 130.95 -> 199.72
51% tp5o linux64-shippable-qr opt e10s stylo 130.23 -> 196.47
39% tp5o_webext responsiveness windows10-64-shippable-qr opt e10s stylo 1.57 -> 2.19
39% tp5o_webext responsiveness windows10-64-shippable opt e10s stylo 1.59 -> 2.21
36% tp5o_webext responsiveness windows7-32-shippable opt e10s stylo 1.44 -> 1.97
31% tp5o_webext responsiveness linux64-shippable-qr opt e10s stylo 2.43 -> 3.19
30% tp5o_webext responsiveness linux64-shippable opt e10s stylo 2.02 -> 2.63
18% tsvg_static windows10-64-shippable opt e10s stylo 37.83 -> 44.71
17% tsvg_static windows7-32-shippable opt e10s stylo 43.63 -> 51.14
10% tp5o_scroll windows10-64-shippable-qr opt e10s stylo 1.48 -> 1.62
8% tp5o_scroll linux64-shippable-qr opt e10s stylo 2.09 -> 2.26
8% tsvgr_opacity windows10-64-shippable opt e10s stylo 106.16 -> 114.41
6% tp5o_scroll linux64-shippable opt e10s stylo 1.10 -> 1.16
Improvements:
15% tsvg_static linux64-shippable opt e10s stylo 54.96 -> 46.65
6% tsvg_static macosx1014-64-shippable opt e10s stylo 48.50 -> 45.49
3% tsvg_static linux64-shippable-qr opt e10s stylo 46.43 -> 44.86
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=24166
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos
For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling
| Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I am investigating this with Rob Wood. Not sure what exactly needs to be done at this point.
Comment 2•5 years ago
|
||
The priority flag is not set for this bug.
:rwood, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 3•5 years ago
|
||
(In reply to Benjamin Forehand Jr[:b4hand] from comment #1)
I am investigating this with Rob Wood. Not sure what exactly needs to be done at this point.
Hey Benjamin, did you find any way to make wptserve run a bit faster, at least when it comes to the perf_reftest_singletons test? If not, IMO it is probably fine to accept this as the new baseline, as it's not a regression in that it's just different timings because of the switch from mozhttpd ==> wptserve.
Comment 4•5 years ago
|
||
I looked but I couldn't see what was slowing it down. I can try and look again on Monday just to be sure. I'll tag myself for an update on Monday.
Comment 5•5 years ago
|
||
Hey Rob, I couldn't find any more info on why these regression numbers would happen. If you think it is fine then I all for it.
Comment 6•5 years ago
|
||
Comment 7•5 years ago
•
|
||
Running a tp5 based talos test locally (|mach talos-test -a tp5o_webext --enable-webrender|) shows obvious issues, most likely caused by the new HTTP server having a different mime type for PNG extension. It opens 125 download dialogs that stay open on top of the content until the test finishes.
Removing the page that opened them (bild.de) from the manifest, speeds the test up by 20% on my MBP. This probably differs based on platform and configuration used. The proper fix for this particular issue would be to ensure that the mime types match and/or to ensure that the test suite does not try to save the files.
Updated•5 years ago
|
Comment 8•5 years ago
|
||
Yeah we fixed this, are you using the updated pageset?
Updated•5 years ago
|
Comment 9•5 years ago
|
||
(In reply to Benjamin Forehand Jr[:b4hand] from comment #8)
Yeah we fixed this, are you using the updated pageset?
I am, or at least I was supposed to. To verify this I removed the tp5n directory from my working copy and let talos fetch the new pages. The download dialogs went away and the score stayed roughly the same.
I also tried this on my second working copy (without removing the tp5n directory this time), and this time the same page (bild.de) times out:
18:44:08 INFO - PID 96127 | __FAILTimeout in tp5n__FAIL
18:44:08 INFO - PID 96127 | __FAILTimeout (3/3) exceeded on http://127.0.0.1:54504/tests/tp5n/bild.de/www.bild.de/index.html__FAIL
18:44:08 INFO - PID 96127 | console.error: "You should not call finishTest without having first initted the Profiler"
It looks like I hit an unrelated bug with pageset update failing.
Comment 10•5 years ago
|
||
These are zips with profiles from treehearder for some of these tests. I didn't get the tsvg tests, but I can if needed. These will include t5po_webext, reft_test_singletons, and tp5o_webext.
mozhttpd:
https://drive.google.com/open?id=1HLpCkiWcgr7UICfmzrntHTo83jlaEgeD
wptserve:
https://drive.google.com/open?id=1Jvt7oTD0oVRUawaLabMINf2RTxYrVs2d
Updated•5 years ago
|
Comment 11•5 years ago
|
||
(In reply to Benjamin Forehand Jr[:b4hand] from comment #10)
These are zips with profiles from treehearder for some of these tests. I didn't get the tsvg tests, but I can if needed. These will include t5po_webext, reft_test_singletons, and tp5o_webext.
mozhttpd:
https://drive.google.com/open?id=1HLpCkiWcgr7UICfmzrntHTo83jlaEgeDwptserve:
https://drive.google.com/open?id=1Jvt7oTD0oVRUawaLabMINf2RTxYrVs2d
Thanks Benjamin!
Andrew, would you have some time to have a look at the above profiles? We are wondering why switching the talos webserver from mozhttpd to wptserve has caused such large regressions, and if it is ok to accept the new baselines or if we need to revert this change. Thanks!
Comment 12•5 years ago
|
||
Sorry for the delay, although I'm not sure I'll be of much help here.
I've put up profiles from this one:
188% perf_reftest_singletons only-children-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08
wptserve (gecko 73)
https://perfht.ml/3bDatv8
mozhttpd (gecko 72)
https://perfht.ml/2OPctXq
But I don't think I can find the issue there, the test is so short there is only 1 sample.
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83
But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.
Updated•5 years ago
|
Comment 13•5 years ago
|
||
:b4hand are you still working on this?
Updated•5 years ago
|
Comment 14•5 years ago
|
||
(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #13)
:b4hand are you still working on this?
No not at the moment.
Comment 15•5 years ago
|
||
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83But I don't see a profile for that in the
profile_tp5o_webext_osx.zipfrom the provided links.
If we can find profiles for a longer running test, that may help.
Where can this job be found on Treeherder?
Comment 16•5 years ago
|
||
Hey James, maybe you have an idea or some feedback. This regression for Talos started when we moved from httpd.js to wptserve. As it shows there is a huge decrease in performance by delivering the pages. Are you aware of any performance bottlenecks in wptserve or is that just the move to another process? httpd.js runs inside of Firefox so maybe that adds that extra time?
Comment 17•5 years ago
|
||
So I don't know off the top of my head, but I can suggest some things to look for:
- What does the network profile look like in each case? Are we fetching more resources simultaneously with httpd.js vs wptserve. Is https.js just faster at producing the responses?
- What does a profile of wptserve look like? Can we get it running under pyspy or vmprof in order to see if there are any obvious bottlenecks?
"wptserve is slow" doesn't seem like a particuarly controversial proposition, but if we can figure out how to speed it up that should pay dividends across several testsuites.
Comment 18•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #15)
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83But I don't see a profile for that in the
profile_tp5o_webext_osx.zipfrom the provided links.
If we can find profiles for a longer running test, that may help.Where can this job be found on Treeherder?
Henrik, to be honest, I'm not sure.
I was looking into the profiles shared here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1600034#c10
If this work is a priority, I think the approach suggested by James in Comment 17 would be the best.
Comment 19•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #15)
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83But I don't see a profile for that in the
profile_tp5o_webext_osx.zipfrom the provided links.
If we can find profiles for a longer running test, that may help.Where can this job be found on Treeherder?
The original alert from comment 0 is https://treeherder.mozilla.org/perf.html#/alerts?id=24166. From here you can link to a graph for any of the alerts, and then zoom into the highlighted alert. For example https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1926850,984162607&series=autoland,1926850,1,1&timerange=31536000&zoom=1574796064638,1574809959298,137.15869748709474,379.8505568343503
Usually, clicking a data point will show a popup with a link to the job. I'm currently not seeing the job link for some reason. You can alternatively click the revision, and via hg.mozilla.org click through to the Treeherder results view. I also tried this myself but no jobs were shown. It's possible that this data has expired. Ionut do you know?
Looking at a more recent push, the following should filter Treeherder to show the jobs responsible for the tp5o_webext test: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=t%28g5%29
Comment 20•5 years ago
•
|
||
(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #19)
[...]
Usually, clicking a data point will show a popup with a link to the job. I'm currently not seeing the job link for some reason. You can alternatively click the revision, and via hg.mozilla.org click through to the Treeherder results view. I also tried this myself but no jobs were shown. It's possible that this data has expired. Ionut do you know?
That's precisely the case. Treeherder jobs older than 4 months are removed from the database.
Thus, Perfherder just isn't able to show them.
Updated•5 years ago
|
Comment 21•5 years ago
|
||
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83But I don't see a profile for that in the
profile_tp5o_webext_osx.zipfrom the provided links.
If we can find profiles for a longer running test, that may help.
There are lots of profiles in that zip file, they are for individual page loads. How is the number (550.83) computed for this tp5o_webext test suite?
Comment 22•5 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #21)
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)
This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83But I don't see a profile for that in the
profile_tp5o_webext_osx.zipfrom the provided links.
If we can find profiles for a longer running test, that may help.There are lots of profiles in that zip file, they are for individual page loads. How is the number (550.83) computed for this tp5o_webext test suite?
Florian, unfortunately I don't know, I'm really not familiar with any of the talos tests.
When looking for the profiles I was hoping to find some obvious reason why that test had doubled in its result -- waiting on a resource or similar.
Comment 23•5 years ago
|
||
I've run tp5o_webext locally, narrowed down to a single site with and without the regressing patch backed out:
- https://share.firefox.dev/2Djirxl was using mozhttpd as the webserver for the test (5dd01231a71cfd24ae09123ce27d1c75f6b30e03 backed out)
- https://share.firefox.dev/2DhHlxe was using wptserve as the webserver for the test
Comment 24•5 years ago
|
||
(In reply to Florian Quèze [:florian] (PTO until August 10th) from comment #21)
How is the number (550.83) computed for this tp5o_webext test suite?
I believe this is the time taken for the "load" event to be fired.
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #22)
When looking for the profiles I was hoping to find some obvious reason why that test had doubled in its result -- waiting on a resource or similar.
I've looked over the profiles gathered in comment 23 and whilst the network track looks quite different between them, I'm unsure how to interpret the differences. Do you see anything obvious?
Comment 25•5 years ago
|
||
Gijs suggested looking at the http responses of the two webservers to see if there are any differences. I noticed that mozhttpd provided http 1.0 responses with content-length, but wptserver was providing http 1.1 responses without content-length. Adding the content-length header in wptserve appears to have a positive impact on performance locally. I have submitted a patch in bug 1657445. Note that as Talos is using wptserve from PyPI we'll need a new release with this fix and then bump the Talos requirements.
Comment 26•5 years ago
|
||
(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #23)
I've run tp5o_webext locally, narrowed down to a single site with and without the regressing patch backed out:
- https://share.firefox.dev/2Djirxl was using mozhttpd as the webserver for the test (5dd01231a71cfd24ae09123ce27d1c75f6b30e03 backed out)
This profile shows duplicated network markers without duration; is this one of our known network marker bugs?
Updated•5 years ago
|
Comment 27•5 years ago
|
||
This looks different to our existing bugs but I can't be sure. I think that for most of our existing bugs we lack the end marker, but here we lack the start marker.
I've seen this pattern happening when reloading a page with some long-running requests, which seemed right for this case, but here this doesn't seem to fit well. Difficult to say without knowing the context better!
Also the pattern looks different (and more confusing) for the main process vs the content process.
Updated•5 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Description
•