1522827 - 9.5 - 20.38% raptor-tp6-reddit-firefox (linux64, linux64-qr, windows10-64, windows10-64-qr, windows7-32) regression on push d9cde93070b09b7f78a2d01208ed78cde55db092 (Wed Jan 23 2019)

Reporter

Description

•

6 years ago

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a02dd51c39da142f3cbe22569483eab33581d305&tochange=d9cde93070b09b7f78a2d01208ed78cde55db092

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

20% raptor-tp6-reddit-firefox windows10-64-qr opt 854.29 -> 1,028.38
15% raptor-tp6-reddit-firefox windows7-32 opt 829.79 -> 954.27
14% raptor-tp6-reddit-firefox windows10-64 opt 876.29 -> 1,000.94
12% raptor-tp6-reddit-firefox linux64-qr opt 846.29 -> 945.33
10% raptor-tp6-reddit-firefox linux64 opt 811.89 -> 889.00

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=18849

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Component: General → XPCOM

Product: Testing → Core

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Flags: needinfo?(nfroyd)

Ionuț Goldan [:igoldan]

Reporter

Comment 1

•

6 years ago

These are the Gecko profiles for raptor-tp6-reddit on Windows 7 (OPT builds):

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FLj-uNjPuSly2tYpe1N3MTg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FKaYn8SuXTZeDIPXWRF63zA%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

For raptor-tp6-reddit on Windows 10 Quantum Render (OPT builds):

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FOUuUQNc1TCu8yjmJnfsdkg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2Fe7SlYQEVRUyyELUui57SZg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

For raptor-tp6-reddit on Linux 64bit (OPT builds):

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FHgNLeH4YQeWeFS0Uu31Hbg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FNBLO-aC8TfmhF8qghrjJHg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-reddit-firefox.zip

Nathan Froyd [:froydnj]

Comment 2

•

6 years ago

This change makes zero sense to me. The commit in question is only removing code that never gets executed, but somehow that code is responsible for 10% performance? I don't buy it.

acreskey, I know you've been looking at raptor/performance. Do you have any suggestions as to what I should be looking for here?

Flags: needinfo?(nfroyd) → needinfo?(acreskey)

Andrew Creskey [:acreskey]

Comment 3

•

6 years ago

Those numbers look strange for dead code removal.
What about applying the reverse of this patch (i.e. re-adding the Scheduler code), and pushing that to try?

Flags: needinfo?(acreskey)

Ionuț Goldan [:igoldan]

Reporter

Comment 4

•

6 years ago

(In reply to Andrew Creskey from comment #3)

Those numbers look strange for dead code removal.
What about applying the reverse of this patch (i.e. re-adding the Scheduler code), and pushing that to try?

I've pushed this to try. On the Base side (left) there's the dead code removal, on the New there's the backout of bug 1485216:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f4a3e6e29701db51dd414dd903a324fa74517195&newProject=try&newRevision=097b669e8da95cc55c04e437949ab7291f8135b8&framework=10

Andrew Creskey [:acreskey]

Comment 5

•

6 years ago

Thanks :igoldan
Looking into the subtests for windows10-64 it's ttfi that looks to be improved by the backout:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f4a3e6e29701db51dd414dd903a324fa74517195&newProject=try&newRevision=097b669e8da95cc55c04e437949ab7291f8135b8&originalSignature=1822323&newSignature=1822323&framework=10

And if I have this right, this is the recent history for ttfi for that configuration:
https://treeherder.mozilla.org/perf.html#/graphs?series=try,1822326

It seems on some days that it runs from ~2500 to ~7500 ms?

Ionuț Goldan [:igoldan]

Reporter

Comment 6

•

6 years ago

Any updates on this matter?

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Comment 7

•

6 years ago

:igoldan, I'm not sure how I can help, but from looking at your rollback it's the ttfi subtest which is regressing.
But ttfi is such a noisy metric that I'm not sure if it's a good one to be concerned about:
https://treeherder.mozilla.org/perf.html#/graphs?series=try,1822326

All I can suggest it to get a few more runs in on that rollback?

Andrew Creskey [:acreskey]

Updated

•

6 years ago

Flags: needinfo?(acreskey)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Depends on: 1535551

Ionuț Goldan [:igoldan]

Reporter

Comment 8

•

6 years ago

(In reply to Andrew Creskey from comment #7)

:igoldan, I'm not sure how I can help, but from looking at your rollback it's the ttfi subtest which is regressing.
But ttfi is such a noisy metric that I'm not sure if it's a good one to be concerned about.

ttfi used to be a stable metric. Unfortunately, landing of bug 1485216 caused it to become very noisy for a full week. On OSX, ttfi even remained noisy since then.

Somewhere around January 30, a patch landed (don't yet know which) which eliminated most of this noise on Windows & Linux platforms. It also partially brought down ttfi to previous baselines. This confirms that bug 1485216 caused a real regression.

Once we find out what bug re stabilized ttfi, we'll have to see whether they're related somehow.
If they're not, then we need to consider getting back the 10 to 20% raptor losses.

Andrew Creskey [:acreskey]

Comment 9

•

6 years ago

Good investigation :igoldan, let me know if I can help.

Ionuț Goldan [:igoldan]

Reporter

Comment 10

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #8)

(In reply to Andrew Creskey from comment #7)

:igoldan, I'm not sure how I can help, but from looking at your rollback it's the ttfi subtest which is regressing.
But ttfi is such a noisy metric that I'm not sure if it's a good one to be concerned about.

ttfi used to be a stable metric. Unfortunately, landing of bug 1485216 caused it to become very noisy for a full week. On OSX, ttfi even remained noisy since then.

Somewhere around January 30, a patch landed (don't yet know which) which eliminated most of this noise on Windows & Linux platforms.

I looked up the patch and it is this one. One of bug 1523158, bug 1521786, bug 1506949 made it.

Is any of them related to bug 1485216?

Flags: needinfo?(acreskey)

Ionuț Goldan [:igoldan]

Reporter

Comment 11

•

6 years ago

Olli Pettay says bug 1521786 influenced fcp, while bug 1506949 influenced loadtime.

Bug 1523158 has nothing to do with the Raptor tests from comment 0.

Andrew Creskey [:acreskey]

Comment 12

•

6 years ago

I can see how bug 1521786 could impact ttfi -- we only start looking for ttfi after fcp has occurred.
So making fcp come sooner would give us a larger window for ttfi to be detected.

I'm not 100% sure on bug 1506949 - but if it affects the idle queue then I can see how it could impact ttfi, since ttfi requires 50ms of no activity on the idle queue.

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Comment 13

•

6 years ago

So I'm not very familiar with the event queues, but I think that the removal of the LabeledEventQueue in bug 1485216 may have caused the instability and regression in ttfi.

We now create one event queue for content process main threads, but previously we created LabeledEventQueue, which is a collection of queues.
See:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l21.87

Some description here:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l15.45

Apparently LabeledEventQueue was problematic performance wise.

:igoldan -- would it be useful if I reverted just the LabeledEventQueue change of the commit and tested that?

Ionuț Goldan [:igoldan]

Reporter

Comment 14

•

6 years ago

(In reply to Andrew Creskey from comment #13)

:igoldan -- would it be useful if I reverted just the LabeledEventQueue change of the commit and tested that?

Yes, let's try this out.

Andrew Creskey [:acreskey]

Comment 15

•

6 years ago

Ah, I didn't realize that the LabeledEventQueue depends on the Scheduler.
So I'll put them both back in and we can toggle the main thread creation logic here in different revisions to see if that's the source of the perf difference:
https://searchfox.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#226

Andrew Creskey [:acreskey]

Comment 16

•

6 years ago

So my idea of re-adding the LabeledEventQueue in Comment 13 was overly optimistic.
Other dependencies of that system have also been removed -- See Bug 1525031.

I think we should get the input of someone more familiar with the event queue scheduling to see if my Comment 13 makes sense.
:froydnj ?

Flags: needinfo?(acreskey) → needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 17

•

6 years ago

(In reply to Andrew Creskey from comment #16)

So my idea of re-adding the LabeledEventQueue in Comment 13 was overly optimistic.
Other dependencies of that system have also been removed -- See Bug 1525031.

I think we should get the input of someone more familiar with the event queue scheduling to see if my Comment 13 makes sense.
:froydnj ?

LabeledEventQueue was only used if certain preferences were set:

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l21.86

and we never set those preferences. Unless the test infrastructure somehow was...?

Flags: needinfo?(nfroyd)

Andrew Creskey [:acreskey]

Comment 18

•

6 years ago

(In reply to Nathan Froyd [:froydnj] from comment #17)

(In reply to Andrew Creskey from comment #16)

So my idea of re-adding the LabeledEventQueue in Comment 13 was overly optimistic.
Other dependencies of that system have also been removed -- See Bug 1525031.

I think we should get the input of someone more familiar with the event queue scheduling to see if my Comment 13 makes sense.
:froydnj ?

LabeledEventQueue was only used if certain preferences were set:

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l21.86

and we never set those preferences. Unless the test infrastructure somehow was...?

As far as I can tell, the UseMultipleQueues pref was set, but the codepath is very windy.

The pref is defaulted to true here:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l7.13

Scheduler::GetPrefs() return this pref:
https://hg.mozilla.org/integration/mozilla-inbound/rev/784f80261f91#l2.731

And the result of Scheduler::GetPrefs() is eventually passed into Scheduler::SetPrefs()
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9cde93070b0#l3.32

Nathan Froyd [:froydnj]

Comment 19

•

6 years ago

Ugh, ugh, ugh. So bits of the Quantum DOM work were having a performance effect. I think what we were (inadvertently?) were doing before were grouping runnables related to a particular tab/document/docgroup together, and then attempting to prioritize that group before looking at other things.

I guess we should try putting bug 1525031 and (parts of) bug 1485216 back in. Or we can come up with an alternative mechanism to prioritize events based on their associated tabs.

Andrew Creskey [:acreskey]

Comment 20

•

6 years ago

I know that we treat ttfi equally in the geomean raptor tp6 scores that this regressed.
But the other pageload metrics were not regressed, from what I saw.

It might be worth discussing this with an expert, e.g. :jesup, to get his opinion.

Shifting ttfi while improving architecture might not be a bad tradeoff, particularly if the scheduling in general is going to be given a lot of focus in the near future.

Ionuț Goldan [:igoldan]

Reporter

Comment 21

•

6 years ago

(In reply to Andrew Creskey from comment #20)

It might be worth discussing this with an expert, e.g. :jesup, to get his opinion.

Randell, could you provide some assistance here?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 22

•

6 years ago

I'm on PTO (sorry I missed this), but...

TTFI is very sensitive to any event that takes 50ms to run; if you push a 50+ms event to later in order to run <50ms events earlier (instead of the other way around), it will delay TTFI by that amount. So changes to ordering, especially in events that happen after "load" can have a big impact on TTFI. The question is what events are getting pushed - if they're events related to the tab that just loaded, that might be the cause. And events related to the just-loaded-tab actually are more important here for user experience - while TTFI/TTI are problematic measurements, they do try to get at an important part of user experience - when a page becomes smoothly responsive.

This is an aspect of Scheduling; I had thought per the removals referenced here that the scheduling code we had was turned off; apparently it wasn't (but is now removed).

We might need/want to reintroduce it, and as part of the ongoing scheduling work/discussions we may need to consider using it to implement some of the scheduling decisions we make. We also need to think how this will work (and if it can) in a Fission world, which may mean dumping the current structure and implementing some master cross-process structure (Quantum scheduling was mostly focused on making multiple tabs play better in the same processes)

NI smaug and bas for visibility for the Scheduling meetings

Flags: needinfo?(rjesup)

Flags: needinfo?(bugs)

Flags: needinfo?(bas)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 23

•

5 years ago

It sounds like this regression was primarily related to ttfi. Due to excessive noise and uncertain value we were getting from the ttfi measurement, we disabled it in bug 1536874. Even if this regression was now fixed, we wouldn't see an improvement in these tests. Randell: do you think we can close this bug, or is there value in keeping it open?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 24

•

5 years ago

There is value - this impacts the scheduling work, and while ttfi is not something we're monitoring currently, it is something (or close to something) we care about; it is an indirect (and problematic) look at a form of jank.

Flags: needinfo?(rjesup)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 25

•

5 years ago

I'm not sure our TTFI measures jank very accurately. If user was interacting with the page, there would be pending input events and we'd
yield certain slow operations sooner (like reflow and DOM creation).

Flags: needinfo?(bugs)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 26

•

5 years ago

At this point we're not measuring TTFI and I don't believe there are immediate plans to reintroduce it. With this in mind, are we going to be able to resolve this regression?

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

5 years ago

status-firefox69: --- → affected

status-firefox70: --- → affected

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox69: affected → wontfix

status-firefox70: affected → wontfix

status-firefox71: --- → fix-optional

status-firefox-esr60: --- → unaffected

status-firefox-esr68: --- → wontfix

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Keywords: perf-alert

Florin Strugariu [:Bebe]

Comment 27

•

5 years ago

Closing this as WONTFIX as we are not running TTFI tests for some time now.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

Bas Schouten (:bas.schouten)

Updated

•

3 years ago

Flags: needinfo?(bas)