Closed Bug 1675461 Opened 4 years ago Closed 4 years ago

3.26 - 53.4% about_newtab_with_snippets responsiveness / sessionrestore / tp5o (linux64, windows10-64, windows7-32) regression on push 3552fd66bb76eb14aa21b0bda624a30b83dd9dcd (Mon November 2 2020)

Categories

(Testing :: Performance, defect, P3)

Firefox 84
defect

Tracking

(firefox-esr78 unaffected, firefox82 unaffected, firefox83 unaffected, firefox84 disabled, firefox85 fixed)

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox82 --- unaffected
firefox83 --- unaffected
firefox84 --- disabled
firefox85 --- fixed

People

(Reporter: alexandrui, Assigned: Gijs, NeedInfo)

References

(Regression)

Details

(4 keywords)

Attachments

(3 files)

Perfherder has detected a talos performance regression from push 3552fd66bb76eb14aa21b0bda624a30b83dd9dcd. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
53% tp5o responsiveness windows7-32-shippable e10s stylo 0.72 -> 1.10
36% tp5o responsiveness windows10-64-shippable e10s stylo 0.90 -> 1.23
32% tp5o responsiveness linux64-shippable-qr e10s stylo webrender-sw 0.99 -> 1.30
31% tp5o responsiveness linux64-shippable e10s stylo 0.89 -> 1.17
29% about_newtab_with_snippets responsiveness linux64-shippable-qr e10s stylo webrender-sw 0.10 -> 0.13
27% tp5o_webext responsiveness windows7-32-shippable e10s stylo 1.33 -> 1.69
23% tp5o_webext responsiveness windows10-64-shippable-qr e10s stylo webrender-sw 1.53 -> 1.88
22% tp5o_webext responsiveness windows10-64-shippable e10s stylo 1.54 -> 1.88
22% tp5o_webext responsiveness linux64-shippable-qr e10s stylo webrender 1.82 -> 2.22
12% tp5o_webext responsiveness windows7-32-shippable e10s stylo 1.41 -> 1.59
5% sessionrestore sessionrestore linux64-shippable-qr e10s stylo webrender-sw 521.96 -> 550.25
3% sessionrestore sessionrestore linux64-shippable e10s stylo 513.29 -> 530.00

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
15% tabswitch windows10-64-shippable-qr e10s stylo webrender-sw 10.51 -> 8.93
12% startup_about_home_paint_realworld_webextensions startup_about_home_paint_realworld_webextensions linux64-shippable e10s stylo 612.17 -> 539.33
10% startup_about_home_paint_realworld_webextensions startup_about_home_paint_realworld_webextensions windows10-64-shippable-qr e10s stylo webrender 641.54 -> 575.67

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(jaws)

Set release status flags based on info from the regressing bug 1673363

The current hypothesis is that this is about the fact that we'll try to show the toolbar for the about:blank page that loads before the pages loaded by sessionrestore or tp5o, and then hide it again once we load a page. With bug 1675143 / bug 1675147 fixed I'd expect this issue to go away.

Depends on: 1675143

(In reply to :Gijs (he/him) from comment #2)

The current hypothesis is that this is about the fact that we'll try to show the toolbar for the about:blank page that loads before the pages loaded by sessionrestore or tp5o, and then hide it again once we load a page. With bug 1675143 / bug 1675147 fixed I'd expect this issue to go away.

bug 1674199 was resolved and this doesn't seem fixed from looking at graphs. :-(

Priority: -- → P3

OK, so this isn't making much sense so far.

The responsiveness number is the sum of a series of numbers squared divided by some large number

The wiki says:

The score on this benchmark is proportional to the sum of squares of all event delays that exceed a 20ms threshold. Lower is better.

It seems with the patch, mostly the number of these events has gone up (around 1270 after, around 1190 before, on Linux non-webrender). It's not clear to me what causes this to happen, especially because the entire code change in the regressing patch amounts to https://hg.mozilla.org/mozilla-central/diff/3552fd66bb76eb14aa21b0bda624a30b83dd9dcd/browser/base/content/browser.js . Given that the majority of this is behind an if condition that will be false while the tp5o test runs, that leaves literally the change:

+        let { currentURI } = gBrowser;
         isVisible = BookmarkingUI.isOnNewTabPage({
-          currentURI: gBrowser.currentURI,
+          currentURI,

I don't see how that could possibly have changed the test outcome by this much. This code is in the onLocationChange path, so I guess the test repeatedly reloads things and so that might be relatively hot (in this automated test). But the magnitude of the change is bizarre. I also don't understand how this would have materially impacted the number of delayed events.

The contacts for this test are listed as Jim Mathies and Overholt - that feels out of date, but in the off-chance it's not, can you shed any light, or forward to someone who you think would be able to?

(I did doublecheck and it does seem clear that the regression range is correct, which adds to my puzzlement. I've tripped some gecko profiler jobs on the regressing push and the preceding one, but I suspect that if this is about the number of events delayed by just over 20ms instead of just under 20ms, the profiler overhead will be such that the result won't help us.)

Flags: needinfo?(overholt)

Maybe Bas has an idea?

Flags: needinfo?(overholt) → needinfo?(bas)

(In reply to :Gijs (he/him) from comment #4)

OK, so this isn't making much sense so far.

The responsiveness number is the sum of a series of numbers squared divided by some large number

The wiki says:

The score on this benchmark is proportional to the sum of squares of all event delays that exceed a 20ms threshold. Lower is better.

It seems with the patch, mostly the number of these events has gone up (around 1270 after, around 1190 before, on Linux non-webrender). It's not clear to me what causes this to happen, especially because the entire code change in the regressing patch amounts to https://hg.mozilla.org/mozilla-central/diff/3552fd66bb76eb14aa21b0bda624a30b83dd9dcd/browser/base/content/browser.js . Given that the majority of this is behind an if condition that will be false while the tp5o test runs, that leaves literally the change:

+        let { currentURI } = gBrowser;
         isVisible = BookmarkingUI.isOnNewTabPage({
-          currentURI: gBrowser.currentURI,
+          currentURI,

I don't see how that could possibly have changed the test outcome by this much. This code is in the onLocationChange path, so I guess the test repeatedly reloads things and so that might be relatively hot (in this automated test). But the magnitude of the change is bizarre. I also don't understand how this would have materially impacted the number of delayed events.

The contacts for this test are listed as Jim Mathies and Overholt - that feels out of date, but in the off-chance it's not, can you shed any light, or forward to someone who you think would be able to?

(I did doublecheck and it does seem clear that the regression range is correct, which adds to my puzzlement. I've tripped some gecko profiler jobs on the regressing push and the preceding one, but I suspect that if this is about the number of events delayed by just over 20ms instead of just under 20ms, the profiler overhead will be such that the result won't help us.)

I honestly have never paid much attention to the responsiveness tests, so I'm going by your description, but a couple of thoughts:

  • If the event delay of a certain even hovers around 20ms, the count could quite easily grow or shrink quite a bit based on a really small change. This seems like a somewhat problematic aspect of the test. In the extreme example, if you had 1000 events that took 19.9ms, you could regress them by 1% and -drastically- regress the test.

  • Another observation is that this change makes the code inside the switch quite a bit more branchy and complicated. It seems possible to me that this could maybe cause a disproportionate effect on JS execution time. We know that there's numerous of these types of performance cliff effects when dealing with JS although this one isn't obvious to me. Ted, do you know if there could be anything to this?

Flags: needinfo?(bas) → needinfo?(tcampbell)

I graphed the distribution of individual responsiveness values from before and the patch (using linux64 perfherder graph). There is a minor effect of being close to the 20ms cutoff, but there is a pretty clear 5ms delay. I'll take a quick look at if there is any weird JIT behaviour since gBrowser is a pretty exotic object and destructuring is quite complicated under the hood.

Edit: The data is the ~1200 samples within a single run of each.

So dropping the domContentLoaded if condition fixes the regression, cf https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=a79686a94f811cf84a1d9dafa19142a756308db3&newProject=try&newRevision=34b8663d248a25e8745600446d1200e48786401d&framework=1 .

I still don't understand why -- the gBrowserInit.domContentLoaded prop should be set to true relatively early on, so the code in the condition won't run after that, and it seems weird for the code in that block to have an impact on the subsequent pageloads? Unless all the additional/slower events are coming in during that initial time period? That also seems far-fetched given comment #7.

Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Flags: needinfo?(jaws)

Oh! It would seem that the uriToLoadPromise initially returns about:home in the talos test, for the first window (which looks like it immediately closes or something? I haven't found the source for the tests so just going by logging from browser.js atm), and then about:blank in a second window (logging shows a different outerWindowID), and I guess we're going to try to show the bookmarks toolbar in the first window only to immediately close the window, which I guess maybe messes something up with the state of places or something? Still weird, but at least a bit of a better sense of why this seemingly innocuous change is causing this outsized response in the test. I'll dig in further after sleep...

Flags: needinfo?(tcampbell)

The root cause here is still eluding me.

We initially open the toolbar in the first window that the test opens, and then that window is closed. We never open the toolbar in the main window used for tp5o.

(base push on recent central)

Disabling the code that does this fixes the regression.

Disabling it and toggling the toolbar on/off a bit later in the test doesn't seem to regress.

Neither does accessing uriToLoadPromise

Logging shows we successfully clean up the places node and observer when the window is closed.

So the question is why the opening of the toolbar is having an effect at all here.

I just tried if toggling it on/off earlier makes a difference ( https://treeherder.mozilla.org/jobs?repo=try&revision=107d2d8b0cecc1021353bf3620dc0e40508b9c40 ) and if toggling it on but not off before closing the window makes a difference ( https://treeherder.mozilla.org/jobs?repo=try&revision=f3f998a64efc6ecb3a69ae616be5dd7aebd6cbb3 )

but it's definitely stab-in-the-dark type stuff - I'm not even really sure when the test extension code runs vs. the initial window. It's even possible that the parent process delay caused by the main-thread places access that showing the toolbar entails allowing the content process enough time to load and run a ton of about:home stuff which then eats CPU cycles and leads to responsiveness reduction in the parent. Who knows.

Locally, on Windows, the test is way too noisy for me to reliably reproduce anything, which is hindering my attempts at tracking down the root cause. I expect the same would be true for gecko profiles.

Best guess right now is that some places stuff stays active with the window closed (or the about:home stuff mentioned above runs) and that's causing event delays. Based on the graph in comment #7 I'm guessing the increased delays happen throughout the run, though if they are somehow concentrated in the first few pages (when maybe we're cleaning up the old window?) then maybe that explains things better - that feels like we should be able to check it with the test output, but I'm honestly not 100% sure how to do that.

On the whole, this feels pretty weird. The responsiveness side of the test is inadvertently measuring the cost of opening and immediately closing a window. The event tracer runs from process start - and although we can hack around that in the bookmarks code, really it feels like the test is measuring the wrong thing, and that's what we should fix instead - but I'm not sure how best to do that.

The simplest fix here would likely be changing the test framework part that launches the browser to do so by passing an about:blank URI, instead of loading about:home.

The more "proper" fix would be to have the event tracer start based on some call from the test that it's ready to start testing, and to give the first window plenty of time to clean up after itself.

(In reply to :Gijs (he/him) from comment #10)

On the whole, this feels pretty weird. The responsiveness side of the test is inadvertently measuring the cost of opening and immediately closing a window. The event tracer runs from process start - and although we can hack around that in the bookmarks code, really it feels like the test is measuring the wrong thing, and that's what we should fix instead - but I'm not sure how best to do that.

The simplest fix here would likely be changing the test framework part that launches the browser to do so by passing an about:blank URI, instead of loading about:home.

The more "proper" fix would be to have the event tracer start based on some call from the test that it's ready to start testing, and to give the first window plenty of time to clean up after itself.

To be explicit, this would fix the numerical regression here and retains the purposes of the responsiveness test and ensures numbers are more stable there - but obviously it would still be good to understand why exactly this change in the first window has this effect on the test, ie why does toggling the bookmarks toolbar in that window make things slower.

Still pretty stuck on this. Marco wrote some patches to fix some of the entanglement between history and bookmarks result nodes and observers in bug 1677408, which are nice but do not appear to fix the regression here.

I've confirmed that commenting out https://searchfox.org/mozilla-central/rev/277ab3925eae21b419167a34624ec0ab518d0c94/browser/base/content/browser-places.js#1159 (which constructs a places result node so we can load bookmarks etc.) fixes this.

I've also tried comparing the MOZ_LOG=mozStorage:5 output to see what SQL queries we're running in the regressing and non-regressing case, and although I noticed an increase in TRACE_TIME measures (which sqlite outputs to indicate which queries took more than 0ms to complete), for frecency updates (ie the UPDATE queries associated with moz_historyvisits_afterinsert_v2_trigger) I'm struggling to work out why - like, is that just another symptom of whatever is keeping the machine busy, or are such queries or other places busy-ness the cause here?

I'll attach both logs in the hope either Marco or :asuth can work out what the distinction is between the good and bad case? I didn't see anything completely obvious - there are some additional queries in the beginning of the "bad" run (when we create the toolbar and look for bookmarks in it), but nothing else jumped out at me - though the fact that many queries span multiple lines also means that doing any kinds of statistics on the output ("saw query type X N times") is not easily possible.

In both logs I've run the start of the tp5o job, until I noticed the browser stopped loading 163 .com . In both logs the initial browser run that initializes the profile is included (and should probably be disregarded, but I left it in in case it contains vital clues) and takes up a large amount of the log file (about half) because it's where all the tables etc. are created. Finally, in both logs there are a few marker lines that look roughly like this:

23:50:32     INFO -  PID 816025 | Hellohellohello about:home - true

which is where we hit the regressing bit of code, which in the "bad" log is enabled and will have enabled the toolbar (and you can see mozStorage output straight after it), and in the "good" log is disabled and so doesn't trigger showing the toolbar.

Flags: needinfo?(mak)
Flags: needinfo?(bugmail)
Attached file bad-moz-log.txt
Attached file good-moz-log.txt

Is it possible to provide links to the firefox profiler runs mentioned or obtain new profiler runs? Besides being interested in trying the new "comparing interface" teased at https://profiler.firefox.com/, it very much seems like the interesting aspects of any mozStorage involvement would be in terms of the results runnables from the async queries inducing contention on the main thread.

Aside: What is up with all of the "httpdisabledsdisabled" and "httpdisabled" URI schemes that places is being asked about?

Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)

Aside: What is up with all of the "httpdisabledsdisabled" and "httpdisabled" URI schemes that places is being asked about?

Link coloring, to avoid many of these requests we could implement a bloom filter (possibly shared with mobile) in BaseHistory.
These happen on a separate connection anyway, so they should not create contention.

The bad log shows us initializing the bookmarks service pretty much immediately, as visible from the bookmark roots creation, in the good log they are initialized quite later. Again this is likely the effect of showing the bookmarks toolbar. But it's part of the initial run, so it should not affect the next one.
Apart from that there's nothing particularly strange in the first part, from the point of view of Database access they seem equivalent, modulo the early bookmarks initialization.
The bookmarks toolbar may also be adding some javascript and GC+CC costs, compared to the good case.

Then we reach the "Running cycle 1/1 for tp5o test" logging, that is likely where we start the measurement. The previous instance has been closed (lots of db closing).
There can be some tail effect of GC+CC at this point?

Soon there is the first discrepancy, where MOZ_EVENT_TRACE sample happens much sooner in the good case.
The only things executing something at that point are cookies, localStorage and indexeddb.
But them we start opening a bunch of connections, because this is a new instance and those openings seem to be just happening at different timings. I/O is unpredictable, thus I'm not sure there's much we can do here, as you said we could add some artificial delay to ensure a more stable disk situation before starting the test.

One improvement I found from these logs is that PlacesExpiration is running a query to calculate limits, that is not immediately necessary, I filed bug 1677969 and will look into it. I must also note PlacesExpiration currently runs on a timer, that is not necessary anymore, but requires some rework that we are tracking in bug 1475582. I hope to find some time to look into that in the next days.
I suspect that due to an additional delay added by the toolbar in the initial run, expiration ends up running in the middle of these responsiveness measurements, and can skew them, depending on when it runs.
I suggest to try a run with places.history.expiration.interval_seconds = 3600 and check if it makes a difference.

Flags: needinfo?(mak)

Ah, I also see Places telemetry being collected, maybe an effect of Bug 1666545, do we disable telemetry for these tests?
I filed bug 1677977 to avoid collecting Places telemetry from disk if telemetry is disabled, it's not enough to not send it, we should not even collect it to save on I/O.
Again, skewing timings could mean the first idle happens at the "wrong" time now.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)

Is it possible to provide links to the firefox profiler runs mentioned or obtain new profiler runs? Besides being interested in trying the new "comparing interface" teased at https://profiler.firefox.com/, it very much seems like the interesting aspects of any mozStorage involvement would be in terms of the results runnables from the async queries inducing contention on the main thread.

Profiling jobs ran against the regressing cset:

Regressor

Previous

But to be honest, I've looked at them and I don't think they're very useful here - the profiler jobs for tp5o load each page 2 times instead of 25, and experience significant overhead from the profiler. Add to that that the regression is really about 1-5ms delays compared to the base, and the profiler resolution is 1ms, and the fact that the profiler splits its results so that each of its N pages * 2 pageloads is represented as a separate "cycle", it's almost impossible to find something that's actually meaningfully comparable. I asked about adding all the main thread sampling of these profiles together and comparing the result, but we apparently do not have tooling for that.

As for the contention, I don't think that the gecko profiler includes the mozStorage threads by default, so I guess to see that we'd need to manually run the profiler and tell it to sample those threads, or something?

Aside: What is up with all of the "httpdisabledsdisabled" and "httpdisabled" URI schemes that places is being asked about?

I think it's to avoid prefetches for the links on the page which would interfere with measurements and/or crash the build for making outside network connections or something? It's not 100% clear to me - the original test suite is very old (which you can also see if you run the test locally - the pages are taken from sometime in mid-2013, judging by the headlines on news sites about a US gov't shutdown during the Obama presidency).

(In reply to Marco Bonardo [:mak] from comment #18)

I suggest to try a run with places.history.expiration.interval_seconds = 3600 and check if it makes a difference.

I tried this as well as disabling history entirely, and neither appears to impact this regression (though disabling history globally improves the responsiveness metric a little both with and without this regression, which makes sense as we'll do strictly less work on the whole).

Thank you for the profiles! I agree they don't show a smoking gun, but for my purposes they do at least seem to show that there's no meaningful impact of mozStorage on the main thread of the parent process, or of IndexedDB/friends. That said, the Network tables for the profiles do seem to be showing somewhat different network panel timings, including some bi-modal values and intensely different "Waiting to transmit the response" AKA responseEnd phases in the content processes (likely due to the reflow jank). I wonder if the network proxy thing itself could be experiencing disk I/O contention from Places activity or something like that.

Your general theory in comment 10 that small changes in behavior can result in larger emergent changes in behavior seems quite reasonable to me; for example the 163.com cycle_0 bad profile seems to show an HTML tree flush bisecting two very large reflows whereas the good profile seems to have only a single continuous reflow. Unfortunately, it seems like it's a bigger undertaking to sample across many different profiler runs.

I think this leaves things to be just a places problem domain issue so I'll leave it to :mak's expertise.

I believe (and have some data that indicates) bug 1667237 is going to "fix" this (ie the numbers will go back to what they were, perhaps even improve slightly).

It does this by delaying places toolbar initialization until after we paint the window. This also reduces mainthread IO during early startup (postponing it) and reduces flicker a bit because we will show the toolbar (initially empty) and then load bookmarks - instead of first showing a half-size toolbar and then filling it with bookmarks. It also paves the way for asynchronous places view use in the toolbar (which would also populate "lazily", because it'd no longer do sync, mainthread IO). So I'm happy that that is a "good" change.

In terms of this bug, although it is the "right" fix for the sessionrestore regression (which I didn't really look at in as much detail but I'm 99% sure is about the regressing bug making us do the places initialization work sooner), it's arguably a wallpaper-type fix for the responsiveness one, where we've never really gotten to the bottom of why the places initialization had a long-term impact on responsiveness, despite a number of very clever people looking (thank you, everyone!). However, there are pretty limited avenues to do more here with current tooling, and I'm not convinced our time is well-spent digging into that very narrow problem more than we already have - if we wanted to look at responsiveness, I think e.g.:

  1. updating the pagesets from 2013 to something more recent/representative
  2. getting better profiling working for these runs (higher granularity and/or aggregating multiple runs)
  3. optimizing our JS code that responds to pageloads (e.g. bug 1674511, investigating moving reader mode to something that doesn't involve the main thread and DOM so much), or taking that code out of the critical path
  4. making more realistic tests (we're effectively reloading 24 of the 25 times we load each page, so the URL, security and bookmark state don't really change. We can optimize the frontend code for that, but that's not very realistic.)

would be better avenues. But I think that's best not done here, and perhaps not even on this test - in fact, perhaps we should write a raptor or browsertime-based equivalent and then retire this test, and we should use our newer responsiveness measures and retire the old event tracer based code that's used by tp5o in favour of mainthread event loop delays that I believe Randell wrote? I'm pretty sure the raptor code covers the "load high profile pages" type work here, though perhaps its pageset could be expanded using some of the types of sites in this set (thinking especially of heavy news front pages) ?

Bas, how do you feel about this, and what's the best way for me to integrate those ideas into our 2021 plans (where I know we're thinking about the responsiveness problem in general) - do I just file bugs? Something else?

Depends on: 1667237
Flags: needinfo?(bas)

Oh, and the regressing code here is disabled in late beta, and the change in bug 1667237 was a tiny bit tricky, so I'm not currently planning on uplifting that. Instead, I think we can mark 84 as "disabled" here once we hit late beta. We'll run a small experiment on 84 to evaluate the new bookmarking UX but I don't think the impact of this perf issue will be sufficient to necessitate uplift of 1667237.

Going to mark this fixed but keep needinfo for Bas for comment #23 .

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: