Open Bug 1724199 Opened 3 years ago Updated 3 years ago

10.09 - 9.23% sessionrestore_no_auto_restore / sessionrestore_no_auto_restore + 1 more (Windows) regression on Tue August 3 2021

Categories

(Core :: DOM: Navigation, defect, P2)

defect

Tracking

()

Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox90 --- unaffected
firefox91 --- unaffected
firefox92 + wontfix
firefox93 + wontfix

People

(Reporter: aesanu, Unassigned)

References

(Regression)

Details

(4 keywords)

Perfherder has detected a talos performance regression from push e89d64559e5b7cb7191b1a019e45d6a30eb3090f. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
10% sessionrestore_no_auto_restore sessionrestore_no_auto_restore windows10-64-shippable-qr e10s stylo webrender 515.54 -> 567.58
9% sessionrestore sessionrestore windows10-64-shippable-qr e10s stylo webrender 461.00 -> 504.08
9% sessionrestore_no_auto_restore sessionrestore_no_auto_restore windows10-64-shippable-qr e10s stylo webrender 519.00 -> 566.92

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(nika)

Set release status flags based on info from the regressing bug 1715773

So looking at the profiles, I'm pretty stumped. For the sessionrestore_no_auto_restore test, it looks like we're consistently reaching the sessionRestored observer notification faster with the patch, than without it.

process start time doesn't get recorded in the profiler. I'm going to try modifying the test to use the "main" timestamp instead, since that shows up in the timeline, to see if we still get disagreement here.

So, my suspicion that the delta seems to be related to an increase of time between the "process start" and "main" timestamps seems to be confirmed by these two try pushes:

Baseline (no regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=2127cc729d5cfe518b85a400ab5a9bd6e219ba95

Patch (supposed regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=947e71f71853d9aa2c86595ab605634abcffc284

The only difference with these pushes is that I modified the talos test to measure the time between "main" and the session having been restored. With that change, comparechooser shows no significant difference:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=2127cc729d5cfe518b85a400ab5a9bd6e219ba95&newProject=try&newRevision=947e71f71853d9aa2c86595ab605634abcffc284&framework=1&page=1

The conclusion is kind of strange then... somehow, it seems that bug 1715773 caused an increase of time between process start and main... Oddly, this increase isn't being detected by ts_paint, which also uses the process start timestamp: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/testing/talos/talos/startup_test/tspaint_test.html#29

It also isn't detected by startup_about_home_paint, which also ultimately reads a measurement that compares against process start: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/browser/modules/AboutNewTab.jsm#239-244

Here's ts_paint and startup_about_home_paint graphed with the landing of the supposed regressor highlighted: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=e89d64559e5b&series=autoland,2813231,1,1&series=autoland,2813238,1,1&timerange=1209600

No bump.

So my conclusion is that... somehow, the sessionrestore tests result in a larger gap between process start and main. I don't know how that's possible.

Nika wonders whether the code changes in https://hg.mozilla.org/integration/autoland/rev/e89d64559e5b7cb7191b1a019e45d6a30eb3090f cause a binary layout change.

The only change that might affect pre-main execution is the removal of unused Atom("preloadedState", "preloadedState") in StaticAtoms.py. But we add and remove static atoms all the time without seeing a regression like this one.

Priority: -- → P2

What's the next step here? Should we consider backing the change out of 92?

Flags: needinfo?(cpeterson)

Hey gerald, we've got an interesting Talos regression here where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?

Flags: needinfo?(nika)
Flags: needinfo?(gsquelart)
Flags: needinfo?(cpeterson)

As mconley noted in comment 5, this regression is not seen in the ts_paint or startup_about_home_paint benchmarks. The Perfherder graph below shows sessionrestore_no_auto_restore's test results regressing alongside unchanged test results from ts_paint or startup_about_home_paint. The sessionrestore_no_auto_restore results get noisier and shift from ~500 ms to ~600 ms.

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=e89d64559e5b&series=autoland,2813231,1,1&series=autoland,2813238,1,1&series=mozilla-central,2826471,1,1&timerange=2592000

(In reply to Mike Conley (:mconley) (:⚙️) from comment #8)

Talos regression where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?

The Base Profiler was deactivated in Talos tests because of intermittents, see bug 1648325. I doubt I will be able to work on this any time soon, sorry.
In the meantime, you could try commenting-out (#) this line in your own Talos runs: https://searchfox.org/mozilla-central/rev/d3683dbb252506400c71256ef3994cdbdfb71ada/testing/talos/talos/gecko_profile.py#96 Good luck!

Flags: needinfo?(gsquelart)
Flags: needinfo?(mconley)
Assignee: nobody → mconley

I gathered some profiles from the pre and post patch revisions using the base profiler. I could only tease out profiles for the sessionrestore tests, and not the sessionrestore_no_autorestore test. These are the profiles for the cycle 0 instances of those tests:

Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP

A few things jump out at me:

  1. The GPU process seems to take longer to get launched in the regression profile
  2. Some graphics-related system calls seem to be slower too.. NtUserDisplayConfigGetDeviceInfo takes 15 samples in the patch profile when opening the initial window. That system call doesn't even show up in the base profile samples.

Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?

Flags: needinfo?(mconley) → needinfo?(nika)

To be clear to folks following along, I'm still not convinced that there's a real-world regression here. Note comment 5, and how when I adjusted the test to measure the time between "main" and session restored, the regression disappeared.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #11)

Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP

In these two tests, with GOOD, the "main" is at 44ms, and with BAD it's at 41ms, which doesn't really line up with the observation that the time between process start and "main" has increased, afaict (or perhaps I'm reading the numbers wrong).

I noticed that the time delta between sessionRestoreInit and sessionRestored is about the same in both the GOOD and BAD profiles, with sessionRestoreInit starting at 194ms with the GOOD profile, and 255ms with the BAD profile. That totals to about 61ms of time difference in the first few ms of process startup.

Some other probes also interestingly had a similar situation, with earlyBlankWindowVisible starting at 145ms with GOOD, and 205ms with BAD. The earlyBlankFirstPaint probe was also ~60ms faster in the GOOD situation compared to the BAD one.

Comparing the markers during just the earlyBlankFirstPaint marker (https://share.firefox.dev/3sPJSUH, https://share.firefox.dev/3yoR23w), I noticed the BrowserGlue.jsm import took ~60ms longer, and that there is an interesting jump in the time stamp for markers around when "NewThread - VsyncIOThread" and "NewThread - IPC Launch" occur, which lines roughly up with your observation that the GPU process (which is probably the first process to be launched?) is not being launched until slightly later. There's another gap around the "activate" DOMEvent, which fires ~6ms after the "NewThread - TRR Background" marker in the GOOD profile, but 40ms after in the BAD profile.

Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?

I don't see anything in the profiles which points at how that patch could cause this regression. All of the modified code shouldn't even be running yet at this point. We're barely starting the load of browser.xhtml (I highly doubt we're executing tabbrowser.js yet), and even DocumentLoadListener will have bailed out early due to a non-content BrowsingContext before it got far enough in the process selection logic to hit the removed code. The changes with the static atom table still seem like possibly the only cause of the issue.

:mconley, would it be possible to do perf tests between the pre-patch version & a version with the patch applied, but the change to StaticAtoms.py reverted? It might be interesting to see if re-adding that atom changes anything, as I have no idea what else could.

Flags: needinfo?(nika) → needinfo?(mconley)

Sorry for the delay. So I pushed a version of the patch that puts the atom back, and it doesn't look like it moved the needle. I'm going to try putting back each part of the patch piece by piece to see which part finally "fixes" things... Once we identify that chunk, I'll throw some markers around it to see how it's involved with the Talos tests.

Status: NEW → ASSIGNED

I'm admitting defeat on this one. I have not had time to figure this out, and it's unlikely I'm going to get it anytime soon.

I think the refactor in bug 1715773 made sense, and I think we should keep them.

Assignee: mconley → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(mconley)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.