1724199 - 10.09 - 9.23% sessionrestore_no_auto_restore / sessionrestore_no_auto_restore + 1 more (Windows) regression on Tue August 3 2021

Reporter

Description

•

3 years ago

Perfherder has detected a talos performance regression from push e89d64559e5b7cb7191b1a019e45d6a30eb3090f. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio	Suite	Test	Platform	Options	Absolute values (old vs new)
10%	sessionrestore_no_auto_restore	sessionrestore_no_auto_restore	windows10-64-shippable-qr	e10s stylo webrender	515.54 -> 567.58
9%	sessionrestore	sessionrestore	windows10-64-shippable-qr	e10s stylo webrender	461.00 -> 504.08
9%	sessionrestore_no_auto_restore	sessionrestore_no_auto_restore	windows10-64-shippable-qr	e10s stylo webrender	519.00 -> 566.92

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(nika)

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

3 years ago

Set release status flags based on info from the regressing bug 1715773

status-firefox90: --- → unaffected

status-firefox91: --- → unaffected

status-firefox-esr78: --- → unaffected

Mike Conley (:mconley) (:⚙️)

Comment 2

•

3 years ago

•

Edited

I captured some profiles for these tests on try. Here's a brief review of what the tests measure, followed by profile links:

sessionrestore_no_auto_restore:

This test works by measuring the time from process start to the "sessionRestored" StartupTimeline event (which also shows up in the Marker Chart).

Baseline (no regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FFIBZE5rhRMyPYJPulYYIfw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore_no_auto_restore.zip

Patch (regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FQQP1VhVSR6SEJCHqHkkhWA%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore_no_auto_restore.zip

sessionrestore:

This test works by measuring the time from process start to the time that the last restored tab reported having successfully restored its state.

Baseline (no regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FFIBZE5rhRMyPYJPulYYIfw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore.zip

Patch (regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FQQP1VhVSR6SEJCHqHkkhWA%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore.zip

Mike Conley (:mconley) (:⚙️)

Comment 3

•

3 years ago

So looking at the profiles, I'm pretty stumped. For the sessionrestore_no_auto_restore test, it looks like we're consistently reaching the sessionRestored observer notification faster with the patch, than without it.

process start time doesn't get recorded in the profiler. I'm going to try modifying the test to use the "main" timestamp instead, since that shows up in the timeline, to see if we still get disagreement here.

Mike Conley (:mconley) (:⚙️)

Comment 4

•

3 years ago

So, my suspicion that the delta seems to be related to an increase of time between the "process start" and "main" timestamps seems to be confirmed by these two try pushes:

Baseline (no regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=2127cc729d5cfe518b85a400ab5a9bd6e219ba95

Patch (supposed regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=947e71f71853d9aa2c86595ab605634abcffc284

The only difference with these pushes is that I modified the talos test to measure the time between "main" and the session having been restored. With that change, comparechooser shows no significant difference:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=2127cc729d5cfe518b85a400ab5a9bd6e219ba95&newProject=try&newRevision=947e71f71853d9aa2c86595ab605634abcffc284&framework=1&page=1

Mike Conley (:mconley) (:⚙️)

Comment 5

•

3 years ago

The conclusion is kind of strange then... somehow, it seems that bug 1715773 caused an increase of time between process start and main... Oddly, this increase isn't being detected by ts_paint, which also uses the process start timestamp: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/testing/talos/talos/startup_test/tspaint_test.html#29

It also isn't detected by startup_about_home_paint, which also ultimately reads a measurement that compares against process start: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/browser/modules/AboutNewTab.jsm#239-244

Here's ts_paint and startup_about_home_paint graphed with the landing of the supposed regressor highlighted: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=e89d64559e5b&series=autoland,2813231,1,1&series=autoland,2813238,1,1&timerange=1209600

No bump.

So my conclusion is that... somehow, the sessionrestore tests result in a larger gap between process start and main. I don't know how that's possible.

Chris Peterson [:cpeterson]

Comment 6

•

3 years ago

Nika wonders whether the code changes in https://hg.mozilla.org/integration/autoland/rev/e89d64559e5b7cb7191b1a019e45d6a30eb3090f cause a binary layout change.

The only change that might affect pre-main execution is the removal of unused Atom("preloadedState", "preloadedState") in StaticAtoms.py. But we add and remove static atoms all the time without seeing a regression like this one.

status-firefox93: --- → affected

Priority: -- → P2

Julien Cristau [:jcristau] (back April 22)

Comment 7

•

3 years ago

What's the next step here? Should we consider backing the change out of 92?

Flags: needinfo?(cpeterson)

Mike Conley (:mconley) (:⚙️)

Comment 8

•

3 years ago

Hey gerald, we've got an interesting Talos regression here where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?

Flags: needinfo?(nika)

Flags: needinfo?(gsquelart)

Flags: needinfo?(cpeterson)

Chris Peterson [:cpeterson]

Comment 9

•

3 years ago

As mconley noted in comment 5, this regression is not seen in the ts_paint or startup_about_home_paint benchmarks. The Perfherder graph below shows sessionrestore_no_auto_restore's test results regressing alongside unchanged test results from ts_paint or startup_about_home_paint. The sessionrestore_no_auto_restore results get noisier and shift from ~500 ms to ~600 ms.

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=e89d64559e5b&series=autoland,2813231,1,1&series=autoland,2813238,1,1&series=mozilla-central,2826471,1,1&timerange=2592000

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 10

•

3 years ago

(In reply to Mike Conley (:mconley) (:⚙️) from comment #8)

Talos regression where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?

The Base Profiler was deactivated in Talos tests because of intermittents, see bug 1648325. I doubt I will be able to work on this any time soon, sorry.
In the meantime, you could try commenting-out (#) this line in your own Talos runs: https://searchfox.org/mozilla-central/rev/d3683dbb252506400c71256ef3994cdbdfb71ada/testing/talos/talos/gecko_profile.py#96 Good luck!

Flags: needinfo?(gsquelart)

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox-esr91: --- → unaffected

tracking-firefox92: --- → +

tracking-firefox93: --- → +

Mike Conley (:mconley) (:⚙️)

Updated

•

3 years ago

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Updated

•

3 years ago

Assignee: nobody → mconley

Mike Conley (:mconley) (:⚙️)

Comment 11

•

3 years ago

I gathered some profiles from the pre and post patch revisions using the base profiler. I could only tease out profiles for the sessionrestore tests, and not the sessionrestore_no_autorestore test. These are the profiles for the cycle 0 instances of those tests:

Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP

A few things jump out at me:

The GPU process seems to take longer to get launched in the regression profile
Some graphics-related system calls seem to be slower too.. NtUserDisplayConfigGetDeviceInfo takes 15 samples in the patch profile when opening the initial window. That system call doesn't even show up in the base profile samples.

Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?

Flags: needinfo?(mconley) → needinfo?(nika)

Mike Conley (:mconley) (:⚙️)

Comment 12

•

3 years ago

To be clear to folks following along, I'm still not convinced that there's a real-world regression here. Note comment 5, and how when I adjusted the test to measure the time between "main" and session restored, the regression disappeared.

Nika Layzell [:nika] (ni? for response)

Comment 13

•

3 years ago

(In reply to Mike Conley (:mconley) (:⚙️) from comment #11)

Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP

In these two tests, with GOOD, the "main" is at 44ms, and with BAD it's at 41ms, which doesn't really line up with the observation that the time between process start and "main" has increased, afaict (or perhaps I'm reading the numbers wrong).

I noticed that the time delta between sessionRestoreInit and sessionRestored is about the same in both the GOOD and BAD profiles, with sessionRestoreInit starting at 194ms with the GOOD profile, and 255ms with the BAD profile. That totals to about 61ms of time difference in the first few ms of process startup.

Some other probes also interestingly had a similar situation, with earlyBlankWindowVisible starting at 145ms with GOOD, and 205ms with BAD. The earlyBlankFirstPaint probe was also ~60ms faster in the GOOD situation compared to the BAD one.

Comparing the markers during just the earlyBlankFirstPaint marker (https://share.firefox.dev/3sPJSUH, https://share.firefox.dev/3yoR23w), I noticed the BrowserGlue.jsm import took ~60ms longer, and that there is an interesting jump in the time stamp for markers around when "NewThread - VsyncIOThread" and "NewThread - IPC Launch" occur, which lines roughly up with your observation that the GPU process (which is probably the first process to be launched?) is not being launched until slightly later. There's another gap around the "activate" DOMEvent, which fires ~6ms after the "NewThread - TRR Background" marker in the GOOD profile, but 40ms after in the BAD profile.

Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?

I don't see anything in the profiles which points at how that patch could cause this regression. All of the modified code shouldn't even be running yet at this point. We're barely starting the load of browser.xhtml (I highly doubt we're executing tabbrowser.js yet), and even DocumentLoadListener will have bailed out early due to a non-content BrowsingContext before it got far enough in the process selection logic to hit the removed code. The changes with the static atom table still seem like possibly the only cause of the issue.

:mconley, would it be possible to do perf tests between the pre-patch version & a version with the patch applied, but the change to StaticAtoms.py reverted? It might be interesting to see if re-adding that atom changes anything, as I have no idea what else could.

Flags: needinfo?(nika) → needinfo?(mconley)

Steven DeTar [:sdetar]

Updated

•

3 years ago

status-firefox92: affected → wontfix

Mike Conley (:mconley) (:⚙️)

Comment 14

•

3 years ago

Waiting for try results: https://treeherder.mozilla.org/jobs?repo=try&revision=2406f69be8f65c5fe2702adca7a88c149825ffb0

Mike Conley (:mconley) (:⚙️)

Comment 15

•

3 years ago

Sorry for the delay. So I pushed a version of the patch that puts the atom back, and it doesn't look like it moved the needle. I'm going to try putting back each part of the patch piece by piece to see which part finally "fixes" things... Once we identify that chunk, I'll throw some markers around it to see how it's involved with the Talos tests.

Pascal Chevrel:pascalc

Updated

•

3 years ago

status-firefox93: affected → wontfix

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Status: NEW → ASSIGNED

Mike Conley (:mconley) (:⚙️)

Comment 16

•

3 years ago

I'm admitting defeat on this one. I have not had time to figure this out, and it's unlikely I'm going to get it anytime soon.

I think the refactor in bug 1715773 made sense, and I think we should keep them.

Assignee: mconley → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(mconley)

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes