Closed Bug 1600034 Opened 5 years ago Closed 3 years ago

5.58 - 196.87% perf_reftest_singletons / tp5o / tsvg_static / tsvgr_opacity (linux64-shippable|-qr, macosx1014-64-shippable, windows10-64-shippable|-qr|7-32-shippable) regression on push 5dd01231a71cfd24ae09123ce27d1c75f6b30e03 (Tue November 26 2019)

Categories

(Testing :: Talos, defect, P3)

defect

Tracking

(firefox70 unaffected, firefox71 unaffected)

RESOLVED INCOMPLETE
Tracking Status
firefox70 --- unaffected
firefox71 --- unaffected

People

(Reporter: alexandrui, Unassigned)

References

(Depends on 1 open bug, Regression)

Details

(4 keywords)

Attachments

(1 file)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=5dd01231a71cfd24ae09123ce27d1c75f6b30e03

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

197% perf_reftest_singletons slow-selector-1.html linux64-shippable opt e10s stylo 0.03 -> 0.08
188% perf_reftest_singletons only-children-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08
169% perf_reftest_singletons slow-selector-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08
144% perf_reftest_singletons slow-selector-2.html windows7-32-shippable opt e10s stylo 0.03 -> 0.07
141% perf_reftest_singletons slow-selector-2.html linux64-shippable-qr opt e10s stylo 0.03 -> 0.08
140% perf_reftest_singletons slow-selector-1.html windows10-64-shippable opt e10s stylo 0.03 -> 0.08
139% perf_reftest_singletons only-children-1.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
135% tp5o windows7-32-shippable opt e10s stylo 116.37 -> 273.18
133% tp5o windows10-64-shippable-qr opt e10s stylo 115.85 -> 270.08
132% perf_reftest_singletons slow-selector-2.html linux64-shippable opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-1.html linux64-shippable-qr opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-1.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
129% perf_reftest_singletons slow-selector-2.html windows10-64-shippable-qr opt e10s stylo 0.03 -> 0.08
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83
125% tp5o windows10-64-shippable opt e10s stylo 120.06 -> 270.44
124% tp5o_webext windows10-64-shippable opt e10s stylo 158.04 -> 354.15
122% tp5o_webext windows10-64-shippable-qr opt e10s stylo 155.21 -> 343.96
121% perf_reftest_singletons only-children-1.html linux64-shippable opt e10s stylo 0.04 -> 0.08
120% tp5o_webext windows10-64-shippable-qr opt e10s stylo 155.54 -> 342.60
118% perf_reftest_singletons only-children-1.html linux64-shippable-qr opt e10s stylo 0.04 -> 0.08
118% tp5o_webext windows7-32-shippable opt e10s stylo 154.33 -> 336.25
115% perf_reftest_singletons display-none-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.13
112% tp5o macosx1014-64-shippable opt e10s stylo 170.86 -> 361.56
110% perf_reftest_singletons slow-selector-2.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
108% perf_reftest_singletons only-children-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
107% perf_reftest_singletons display-none-1.html windows7-32-shippable opt e10s stylo 0.04 -> 0.08
107% perf_reftest_singletons slow-selector-2.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons display-none-1.html linux64-shippable-qr opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons display-none-1.html windows10-64-shippable-qr opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons only-children-1.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
104% perf_reftest_singletons slow-selector-1.html macosx1014-64-shippable opt e10s stylo 0.06 -> 0.12
100% perf_reftest_singletons display-none-1.html linux64-shippable opt e10s stylo 0.04 -> 0.08
100% perf_reftest_singletons display-none-1.html windows10-64-shippable opt e10s stylo 0.04 -> 0.08
66% tp5o_webext linux64-shippable-qr opt e10s stylo 170.53 -> 282.40
62% tp5o_webext linux64-shippable opt e10s stylo 171.48 -> 278.09
53% tp5o linux64-shippable opt e10s stylo 130.95 -> 199.72
51% tp5o linux64-shippable-qr opt e10s stylo 130.23 -> 196.47
39% tp5o_webext responsiveness windows10-64-shippable-qr opt e10s stylo 1.57 -> 2.19
39% tp5o_webext responsiveness windows10-64-shippable opt e10s stylo 1.59 -> 2.21
36% tp5o_webext responsiveness windows7-32-shippable opt e10s stylo 1.44 -> 1.97
31% tp5o_webext responsiveness linux64-shippable-qr opt e10s stylo 2.43 -> 3.19
30% tp5o_webext responsiveness linux64-shippable opt e10s stylo 2.02 -> 2.63
18% tsvg_static windows10-64-shippable opt e10s stylo 37.83 -> 44.71
17% tsvg_static windows7-32-shippable opt e10s stylo 43.63 -> 51.14
10% tp5o_scroll windows10-64-shippable-qr opt e10s stylo 1.48 -> 1.62
8% tp5o_scroll linux64-shippable-qr opt e10s stylo 2.09 -> 2.26
8% tsvgr_opacity windows10-64-shippable opt e10s stylo 106.16 -> 114.41
6% tp5o_scroll linux64-shippable opt e10s stylo 1.10 -> 1.16

Improvements:

15% tsvg_static linux64-shippable opt e10s stylo 54.96 -> 46.65
6% tsvg_static macosx1014-64-shippable opt e10s stylo 48.50 -> 45.49
3% tsvg_static linux64-shippable-qr opt e10s stylo 46.43 -> 44.86

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=24166

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling

Flags: needinfo?(bforehand)
Component: Performance → Talos

I am investigating this with Rob Wood. Not sure what exactly needs to be done at this point.

Flags: needinfo?(bforehand)

The priority flag is not set for this bug.
:rwood, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(rwood)

(In reply to Benjamin Forehand Jr[:b4hand] from comment #1)

I am investigating this with Rob Wood. Not sure what exactly needs to be done at this point.

Hey Benjamin, did you find any way to make wptserve run a bit faster, at least when it comes to the perf_reftest_singletons test? If not, IMO it is probably fine to accept this as the new baseline, as it's not a regression in that it's just different timings because of the switch from mozhttpd ==> wptserve.

Flags: needinfo?(rwood) → needinfo?(bforehand)
Priority: -- → P1

I looked but I couldn't see what was slowing it down. I can try and look again on Monday just to be sure. I'll tag myself for an update on Monday.

Flags: needinfo?(bforehand)

Hey Rob, I couldn't find any more info on why these regression numbers would happen. If you think it is fine then I all for it.

Attached image tp5.png

Running a tp5 based talos test locally (|mach talos-test -a tp5o_webext --enable-webrender|) shows obvious issues, most likely caused by the new HTTP server having a different mime type for PNG extension. It opens 125 download dialogs that stay open on top of the content until the test finishes.

Removing the page that opened them (bild.de) from the manifest, speeds the test up by 20% on my MBP. This probably differs based on platform and configuration used. The proper fix for this particular issue would be to ensure that the mime types match and/or to ensure that the test suite does not try to save the files.

Flags: needinfo?(bforehand)

Yeah we fixed this, are you using the updated pageset?

Flags: needinfo?(bforehand)
Flags: needinfo?(mikokm)

(In reply to Benjamin Forehand Jr[:b4hand] from comment #8)

Yeah we fixed this, are you using the updated pageset?

I am, or at least I was supposed to. To verify this I removed the tp5n directory from my working copy and let talos fetch the new pages. The download dialogs went away and the score stayed roughly the same.

I also tried this on my second working copy (without removing the tp5n directory this time), and this time the same page (bild.de) times out:

18:44:08     INFO -  PID 96127 | __FAILTimeout in tp5n__FAIL
18:44:08     INFO -  PID 96127 | __FAILTimeout (3/3) exceeded on http://127.0.0.1:54504/tests/tp5n/bild.de/www.bild.de/index.html__FAIL
18:44:08     INFO -  PID 96127 | console.error: "You should not call finishTest without having first initted the Profiler"

It looks like I hit an unrelated bug with pageset update failing.

Flags: needinfo?(mikokm)

These are zips with profiles from treehearder for some of these tests. I didn't get the tsvg tests, but I can if needed. These will include t5po_webext, reft_test_singletons, and tp5o_webext.

mozhttpd:
https://drive.google.com/open?id=1HLpCkiWcgr7UICfmzrntHTo83jlaEgeD

wptserve:
https://drive.google.com/open?id=1Jvt7oTD0oVRUawaLabMINf2RTxYrVs2d

Assignee: nobody → bforehand
Status: NEW → ASSIGNED

(In reply to Benjamin Forehand Jr[:b4hand] from comment #10)

These are zips with profiles from treehearder for some of these tests. I didn't get the tsvg tests, but I can if needed. These will include t5po_webext, reft_test_singletons, and tp5o_webext.

mozhttpd:
https://drive.google.com/open?id=1HLpCkiWcgr7UICfmzrntHTo83jlaEgeD

wptserve:
https://drive.google.com/open?id=1Jvt7oTD0oVRUawaLabMINf2RTxYrVs2d

Thanks Benjamin!

Andrew, would you have some time to have a look at the above profiles? We are wondering why switching the talos webserver from mozhttpd to wptserve has caused such large regressions, and if it is ok to accept the new baselines or if we need to revert this change. Thanks!

Flags: needinfo?(acreskey)

Sorry for the delay, although I'm not sure I'll be of much help here.

I've put up profiles from this one:
188% perf_reftest_singletons only-children-1.html windows7-32-shippable opt e10s stylo 0.03 -> 0.08

wptserve (gecko 73)
https://perfht.ml/3bDatv8

mozhttpd (gecko 72)
https://perfht.ml/2OPctXq

But I don't think I can find the issue there, the test is so short there is only 1 sample.

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

Flags: needinfo?(acreskey)
Version: Version 3 → unspecified

:b4hand are you still working on this?

Flags: needinfo?(bforehand)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #13)

:b4hand are you still working on this?

No not at the moment.

Flags: needinfo?(bforehand)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

Where can this job be found on Treeherder?

Flags: needinfo?(acreskey)

Hey James, maybe you have an idea or some feedback. This regression for Talos started when we moved from httpd.js to wptserve. As it shows there is a huge decrease in performance by delivering the pages. Are you aware of any performance bottlenecks in wptserve or is that just the move to another process? httpd.js runs inside of Firefox so maybe that adds that extra time?

Flags: needinfo?(james)

So I don't know off the top of my head, but I can suggest some things to look for:

  • What does the network profile look like in each case? Are we fetching more resources simultaneously with httpd.js vs wptserve. Is https.js just faster at producing the responses?
  • What does a profile of wptserve look like? Can we get it running under pyspy or vmprof in order to see if there are any obvious bottlenecks?

"wptserve is slow" doesn't seem like a particuarly controversial proposition, but if we can figure out how to speed it up that should pay dividends across several testsuites.

Flags: needinfo?(james)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #15)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

Where can this job be found on Treeherder?

Henrik, to be honest, I'm not sure.
I was looking into the profiles shared here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1600034#c10

If this work is a priority, I think the approach suggested by James in Comment 17 would be the best.

Flags: needinfo?(acreskey)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #15)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

Where can this job be found on Treeherder?

The original alert from comment 0 is https://treeherder.mozilla.org/perf.html#/alerts?id=24166. From here you can link to a graph for any of the alerts, and then zoom into the highlighted alert. For example https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1926850,984162607&series=autoland,1926850,1,1&timerange=31536000&zoom=1574796064638,1574809959298,137.15869748709474,379.8505568343503

Usually, clicking a data point will show a popup with a link to the job. I'm currently not seeing the job link for some reason. You can alternatively click the revision, and via hg.mozilla.org click through to the Treeherder results view. I also tried this myself but no jobs were shown. It's possible that this data has expired. Ionut do you know?

Looking at a more recent push, the following should filter Treeherder to show the jobs responsible for the tp5o_webext test: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=t%28g5%29

Flags: needinfo?(igoldan)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #19)

[...]
Usually, clicking a data point will show a popup with a link to the job. I'm currently not seeing the job link for some reason. You can alternatively click the revision, and via hg.mozilla.org click through to the Treeherder results view. I also tried this myself but no jobs were shown. It's possible that this data has expired. Ionut do you know?

That's precisely the case. Treeherder jobs older than 4 months are removed from the database.
Thus, Perfherder just isn't able to show them.

Flags: needinfo?(igoldan)
Assignee: bforehand → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

There are lots of profiles in that zip file, they are for individual page loads. How is the number (550.83) computed for this tp5o_webext test suite?

(In reply to Florian Quèze [:florian] from comment #21)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #12)

This one seems like a good one:
126% tp5o_webext macosx1014-64-shippable opt e10s stylo 244.02 -> 550.83

But I don't see a profile for that in the profile_tp5o_webext_osx.zip from the provided links.
If we can find profiles for a longer running test, that may help.

There are lots of profiles in that zip file, they are for individual page loads. How is the number (550.83) computed for this tp5o_webext test suite?

Florian, unfortunately I don't know, I'm really not familiar with any of the talos tests.

When looking for the profiles I was hoping to find some obvious reason why that test had doubled in its result -- waiting on a resource or similar.

I've run tp5o_webext locally, narrowed down to a single site with and without the regressing patch backed out:

(In reply to Florian Quèze [:florian] (PTO until August 10th) from comment #21)

How is the number (550.83) computed for this tp5o_webext test suite?

I believe this is the time taken for the "load" event to be fired.

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #22)

When looking for the profiles I was hoping to find some obvious reason why that test had doubled in its result -- waiting on a resource or similar.

I've looked over the profiles gathered in comment 23 and whilst the network track looks quite different between them, I'm unsure how to interpret the differences. Do you see anything obvious?

Flags: needinfo?(acreskey)

Gijs suggested looking at the http responses of the two webservers to see if there are any differences. I noticed that mozhttpd provided http 1.0 responses with content-length, but wptserver was providing http 1.1 responses without content-length. Adding the content-length header in wptserve appears to have a positive impact on performance locally. I have submitted a patch in bug 1657445. Note that as Talos is using wptserve from PyPI we'll need a new release with this fix and then bump the Talos requirements.

Depends on: 1657445

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #23)

I've run tp5o_webext locally, narrowed down to a single site with and without the regressing patch backed out:

This profile shows duplicated network markers without duration; is this one of our known network marker bugs?

Flags: needinfo?(felash)
Flags: needinfo?(acreskey)

This looks different to our existing bugs but I can't be sure. I think that for most of our existing bugs we lack the end marker, but here we lack the start marker.

I've seen this pattern happening when reloading a page with some long-running requests, which seemed right for this case, but here this doesn't seem to fit well. Difficult to say without knowing the context better!

Also the pattern looks different (and more confusing) for the main process vs the content process.

Flags: needinfo?(felash)
Severity: normal → S3
Priority: P2 → P3
Has Regression Range: --- → yes
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: