10.09 - 9.23% sessionrestore_no_auto_restore / sessionrestore_no_auto_restore + 1 more (Windows) regression on Tue August 3 2021
Categories
(Core :: DOM: Navigation, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox-esr91 | --- | unaffected |
firefox90 | --- | unaffected |
firefox91 | --- | unaffected |
firefox92 | + | wontfix |
firefox93 | + | wontfix |
People
(Reporter: aesanu, Unassigned)
References
(Regression)
Details
(4 keywords)
Perfherder has detected a talos performance regression from push e89d64559e5b7cb7191b1a019e45d6a30eb3090f. As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
Ratio | Suite | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|---|
10% | sessionrestore_no_auto_restore | sessionrestore_no_auto_restore | windows10-64-shippable-qr | e10s stylo webrender | 515.54 -> 567.58 |
9% | sessionrestore | sessionrestore | windows10-64-shippable-qr | e10s stylo webrender | 461.00 -> 504.08 |
9% | sessionrestore_no_auto_restore | sessionrestore_no_auto_restore | windows10-64-shippable-qr | e10s stylo webrender | 519.00 -> 566.92 |
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.
For more information on performance sheriffing please see our FAQ.
Comment 1•3 years ago
|
||
Set release status flags based on info from the regressing bug 1715773
Comment 2•3 years ago
•
|
||
I captured some profiles for these tests on try. Here's a brief review of what the tests measure, followed by profile links:
sessionrestore_no_auto_restore:
This test works by measuring the time from process start to the "sessionRestored" StartupTimeline event (which also shows up in the Marker Chart).
Baseline (no regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FFIBZE5rhRMyPYJPulYYIfw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore_no_auto_restore.zip
sessionrestore:
This test works by measuring the time from process start to the time that the last restored tab reported having successfully restored its state.
Baseline (no regression): https://profiler.firefox.com/from-url/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FFIBZE5rhRMyPYJPulYYIfw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_sessionrestore.zip
Comment 3•3 years ago
|
||
So looking at the profiles, I'm pretty stumped. For the sessionrestore_no_auto_restore test, it looks like we're consistently reaching the sessionRestored observer notification faster with the patch, than without it.
process start time doesn't get recorded in the profiler. I'm going to try modifying the test to use the "main" timestamp instead, since that shows up in the timeline, to see if we still get disagreement here.
Comment 4•3 years ago
|
||
So, my suspicion that the delta seems to be related to an increase of time between the "process start" and "main" timestamps seems to be confirmed by these two try pushes:
Baseline (no regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=2127cc729d5cfe518b85a400ab5a9bd6e219ba95
Patch (supposed regression): https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=ePs0UU_CTfy-XXRL7Uqbzw.0&revision=947e71f71853d9aa2c86595ab605634abcffc284
The only difference with these pushes is that I modified the talos test to measure the time between "main" and the session having been restored. With that change, comparechooser shows no significant difference:
Comment 5•3 years ago
|
||
The conclusion is kind of strange then... somehow, it seems that bug 1715773 caused an increase of time between process start and main... Oddly, this increase isn't being detected by ts_paint, which also uses the process start timestamp: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/testing/talos/talos/startup_test/tspaint_test.html#29
It also isn't detected by startup_about_home_paint, which also ultimately reads a measurement that compares against process start: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/browser/modules/AboutNewTab.jsm#239-244
Here's ts_paint and startup_about_home_paint graphed with the landing of the supposed regressor highlighted: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=e89d64559e5b&series=autoland,2813231,1,1&series=autoland,2813238,1,1&timerange=1209600
No bump.
So my conclusion is that... somehow, the sessionrestore tests result in a larger gap between process start and main. I don't know how that's possible.
Comment 6•3 years ago
|
||
Nika wonders whether the code changes in https://hg.mozilla.org/integration/autoland/rev/e89d64559e5b7cb7191b1a019e45d6a30eb3090f cause a binary layout change.
The only change that might affect pre-main execution is the removal of unused Atom("preloadedState", "preloadedState")
in StaticAtoms.py. But we add and remove static atoms all the time without seeing a regression like this one.
Comment 7•3 years ago
|
||
What's the next step here? Should we consider backing the change out of 92?
Comment 8•3 years ago
|
||
Hey gerald, we've got an interesting Talos regression here where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?
Comment 9•3 years ago
|
||
As mconley noted in comment 5, this regression is not seen in the ts_paint or startup_about_home_paint benchmarks. The Perfherder graph below shows sessionrestore_no_auto_restore's test results regressing alongside unchanged test results from ts_paint or startup_about_home_paint. The sessionrestore_no_auto_restore results get noisier and shift from ~500 ms to ~600 ms.
(In reply to Mike Conley (:mconley) (:⚙️) from comment #8)
Talos regression where visibility into what's going on before XUL loads might be interesting. Do you know if it's possible to request Base Profiler profiles from Talos? Maybe there's just an environment variable we can set to get them?
The Base Profiler was deactivated in Talos tests because of intermittents, see bug 1648325. I doubt I will be able to work on this any time soon, sorry.
In the meantime, you could try commenting-out (#
) this line in your own Talos runs: https://searchfox.org/mozilla-central/rev/d3683dbb252506400c71256ef3994cdbdfb71ada/testing/talos/talos/gecko_profile.py#96 Good luck!
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 11•3 years ago
|
||
I gathered some profiles from the pre and post patch revisions using the base profiler. I could only tease out profiles for the sessionrestore
tests, and not the sessionrestore_no_autorestore
test. These are the profiles for the cycle 0 instances of those tests:
Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP
A few things jump out at me:
- The GPU process seems to take longer to get launched in the regression profile
- Some graphics-related system calls seem to be slower too..
NtUserDisplayConfigGetDeviceInfo
takes 15 samples in the patch profile when opening the initial window. That system call doesn't even show up in the base profile samples.
Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?
Comment 12•3 years ago
|
||
To be clear to folks following along, I'm still not convinced that there's a real-world regression here. Note comment 5, and how when I adjusted the test to measure the time between "main" and session restored, the regression disappeared.
Comment 13•3 years ago
|
||
(In reply to Mike Conley (:mconley) (:⚙️) from comment #11)
Base (no regression): https://share.firefox.dev/3DgV0yW
Patch (regression): https://share.firefox.dev/3mu5oNP
In these two tests, with GOOD, the "main" is at 44ms, and with BAD it's at 41ms, which doesn't really line up with the observation that the time between process start and "main" has increased, afaict (or perhaps I'm reading the numbers wrong).
I noticed that the time delta between sessionRestoreInit
and sessionRestored
is about the same in both the GOOD and BAD profiles, with sessionRestoreInit
starting at 194ms with the GOOD profile, and 255ms with the BAD profile. That totals to about 61ms of time difference in the first few ms of process startup.
Some other probes also interestingly had a similar situation, with earlyBlankWindowVisible
starting at 145ms with GOOD, and 205ms with BAD. The earlyBlankFirstPaint
probe was also ~60ms faster in the GOOD situation compared to the BAD one.
Comparing the markers during just the earlyBlankFirstPaint
marker (https://share.firefox.dev/3sPJSUH, https://share.firefox.dev/3yoR23w), I noticed the BrowserGlue.jsm
import took ~60ms longer, and that there is an interesting jump in the time stamp for markers around when "NewThread - VsyncIOThread" and "NewThread - IPC Launch" occur, which lines roughly up with your observation that the GPU process (which is probably the first process to be launched?) is not being launched until slightly later. There's another gap around the "activate" DOMEvent, which fires ~6ms after the "NewThread - TRR Background" marker in the GOOD profile, but 40ms after in the BAD profile.
Hey nika... does anything meaningful jump out at you from these profiles? Any thoughts on how the patch in bug 1715773 could have possibly affected GPU process startup time?
I don't see anything in the profiles which points at how that patch could cause this regression. All of the modified code shouldn't even be running yet at this point. We're barely starting the load of browser.xhtml
(I highly doubt we're executing tabbrowser.js
yet), and even DocumentLoadListener will have bailed out early due to a non-content BrowsingContext before it got far enough in the process selection logic to hit the removed code. The changes with the static atom table still seem like possibly the only cause of the issue.
:mconley, would it be possible to do perf tests between the pre-patch version & a version with the patch applied, but the change to StaticAtoms.py
reverted? It might be interesting to see if re-adding that atom changes anything, as I have no idea what else could.
Updated•3 years ago
|
Comment 14•3 years ago
|
||
Waiting for try results: https://treeherder.mozilla.org/jobs?repo=try&revision=2406f69be8f65c5fe2702adca7a88c149825ffb0
Comment 15•3 years ago
|
||
Sorry for the delay. So I pushed a version of the patch that puts the atom back, and it doesn't look like it moved the needle. I'm going to try putting back each part of the patch piece by piece to see which part finally "fixes" things... Once we identify that chunk, I'll throw some markers around it to see how it's involved with the Talos tests.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 16•3 years ago
|
||
I'm admitting defeat on this one. I have not had time to figure this out, and it's unlikely I'm going to get it anytime soon.
I think the refactor in bug 1715773 made sense, and I think we should keep them.
Updated•3 years ago
|
Description
•