Closed Bug 1855274 Opened 2 years ago Closed 2 years ago

1.95 - 1.66% tsvgx / tsvgx (Linux) regression on Wed September 13 2023

Categories

(Toolkit :: Places, defect)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr115 --- unaffected
firefox118 --- unaffected
firefox119 --- wontfix
firefox120 --- wontfix

People

(Reporter: aglavic, Unassigned)

References

(Regression)

Details

(4 keywords)

Perfherder has detected a talos performance regression from push 3020232880d7d6afcce9b96e853dd23a46af5a25. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
2% tsvgx linux1804-64-shippable-qr e10s fission stylo webrender-sw 268.62 -> 273.87
2% tsvgx linux1804-64-shippable-qr e10s fission stylo webrender-sw 266.20 -> 270.63

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(mak)

I apologize for this very late update on a regression, it is my fault I incorrectly classified the culprit commit
You can disregard "let us know your plans within 3 business days, or the offending patch(es) may be backed out" given that we gave a very late notice of this being an issue, let us know when you can

Set release status flags based on info from the regressing bug 1846781

So, I suppose the test is visiting pages, and those visits end up triggering recalculations in history that may run during the measurement, being an asynchronous task.

Maybe history should be disabled to run these tests? I don't know much about tsvgx

Flags: needinfo?(mak) → needinfo?(dholbert)

(In reply to Marco Bonardo [:mak] from comment #3)

So, I suppose the test is visiting pages, and those visits end up triggering recalculations in history that may run during the measurement, being an asynchronous task.

They don't visit many pages; just the tests themselves (the various .xml files in https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/svgx which are really SVG files). For each file there, we use "ASAP mode" (unlimited frame rate) to see how many frames we can render, and we use that as a measurement of rendering performance.

I imagine this async history recalculation might be stealing some CPU cycles away that we were previously using to squeeze out a few more frames of rendering. So it manifests as a performance regression, since we're not rendering as many frames.

Flags: needinfo?(dholbert)

I think it'd be fine to disable history for tsvgx, but it's also maybe notable that this manifested as a performance regression for async history-related bookkeeping on a testsuite that's not really touching history much at all. (Though, maybe we're doing the same amount of history-bookkeeping-work as before and it just gets scheduled sooner?)

In case it's useful, I clicked the button on treeherder to generate some profiler runs, for the base and new pushes. Maybe those might reveal something interesting. But also, probably fine to just turn off history to reduce background noise and focus on performance of the code that we're interested in here.

If we do opt to turn off history, what's the appropriate pref/etc to do that?

(Alternately, is there a more-targeted pref that we could use to just turn off this async task, or force it to be executed synchronously, or scheduled differently [either immediately or far in the future]?)

Flags: needinfo?(mak)

(In reply to Daniel Holbert [:dholbert] from comment #5)

In case it's useful, I clicked the button on treeherder to generate some profiler runs, for the base and new pushes. Maybe those might reveal something interesting. But also, probably fine to just turn off history to reduce background noise and focus on performance of the code that we're interested in here.

(following up: I clicked around in some of the profiles, focusing on subtests hixie-005.xml and hixie-006.xml which perfherder shows as being most-affected, and I'm not seeing any samples that mention Places or History (no results when I type those into the filter bar in the "call tree" view, focusing the 127.0.0.1 track and the Parent Process track). But it's also not guaranteed that this regression would be caught in a sample there, because our profiles only have 30 to 40 samples total for each thread here, for these subtests. So a hypothetical few-percent-regression wouldn't necessarily be big enough to be detected with a sample, unless we get lucky.)

(In reply to Daniel Holbert [:dholbert] from comment #7)

(following up: I clicked around in some of the profiles, focusing on subtests hixie-005.xml and hixie-006.xml which perfherder shows as being most-affected, and I'm not seeing any samples that mention Places or History

I looked through some of the "after" runs, the ones with larger times. In hixie-004, from the Markers table I found:
Runnable - places::NotifyRankingChanged - priority: Normal (4)
Runnable - places::NotifyManyVisitsObservers - priority: Normal (4)

What I suspect is that because now add visit queries don't recalculate origins frecency immediately through triggers (but delay it to a later idle) these runnables may happen a little bit earlier, and steal cpu cycles to the test.

Another interesting finding is this hixie-005 profile with the same runnables plus
NotifyObservers - places-metadata-updated
This is sign of "interactions" being stored, though I think nothing changed from this point of view. I must investigate if interactions will respect the history disabled prefs I'll point out later, I'll file a bug.

My hypothesis overall is a time shift of work that normally would have happened a bit later.

(In reply to Daniel Holbert [:dholbert] from comment #6)

If we do opt to turn off history, what's the appropriate pref/etc to do that?

(Alternately, is there a more-targeted pref that we could use to just turn off this async task, or force it to be executed synchronously, or scheduled differently [either immediately or far in the future]?)

places.history.enabled would disable history completely.
Or before the load one can set browser.setAttribute("disableglobalhistory", "true"); or browsingContext.useGlobalHistory = false;
There's no pref to disable async recalculation as it's necessary for the correct functionality of history and the address bar.

So we could either disable global history (and I'm filing a bug to investigate interactions in that case) to cleanup a bit the measurement, or just declare this a false positive and have a new baseline.
I don't see any evidence of us doing more expensive work.

Flags: needinfo?(mak) → needinfo?(dholbert)

(In reply to Marco Bonardo [:mak] from comment #8)

I looked through some of the "after" runs [...] I found:
[...]
What I suspect is that because now add visit queries don't recalculate origins frecency immediately through triggers (but delay it to a later idle) these runnables may happen a little bit earlier, and steal cpu cycles to the test.

Awesome, thanks for looking -- I'm glad you were able to find those to validate our theories about the explanation here, and I'm glad that they don't look particularly expensive (except maybe from the perspective of an "ASAP-mode" CPU-hogging perf-test).

So we could either disable global history [...] to cleanup a bit the measurement

I'm going with that -- otherwise this will be a source of noise in this performance test, depending on whether or not the runnable gets scheduled during the test vs. later. I've posted a patch in bug 1855852 to turn off history for tsvgx. Let's get that landed and then see if the graphs improve (presumably so).

I don't see any evidence of us doing more expensive work

Great. If that changes, let's track that in a followup, but otherwise I think bug 1855852 is all we need here. (And we may want to extend that to other "ASAP-mode" talos tests to reduce possible noise in them as well, but I'm just starting with tsvgx for now to validate that it's useful.)

Depends on: 1855852
Flags: needinfo?(dholbert)

Hmm, at first glance it doesn't look like the graph has moved much since bug 1855852 disabled history. And also, it does look like this is a very real regression in the measurements here.

Here's the graph for the last 60 days:
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&series=autoland,4767645,1,1&timerange=5184000

Before September 13th, we frequently had samples in the range of 261 (you can easily see these, since their dots overlap the horizontal gray line for y=260). After September 13th, we haven't had any such samples; the lowest seems to be ~266 (excluding a few extreme outliers).

I think the lack-of-an-improvement from landing bug 1855852 indicates that history probably isn't involved, unless there's some additional chunk of work that we haven't yet identified that remains even with the history pref disabled.

I wonder if one of the other nearby commits was the real culprit here?

This seems to be the "guilty region" of the graph:
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&series=autoland,4767645,1,1&timerange=7776000&zoom=1694602984891,1694625075077,256.983434047769,286.9858440372385

It looks like commit a34e138c35adb is tentatively "good" and d1bd96b0987ce is tentatively "bad". That yields this pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=a3a13413c12ce4f41369dc4206b798e444e2a4df&tochange=d1bd96b0987ce25496ef588de9c8f54bb72027bd

I've spammed some retriggers in there to get more data about where we changed from good to bad; I think those will appear as dots on the graph that I linked above (if that graph URL is a permalink in terms of time-range, which I think it is).

Further retriggers/tests seem to indicate that this was indeed a regression from the places change (bug 1846781) here, but we incur the same regression if I simply disable history entirely, as a patch layered on the base commit.

So my working theory is that there's some fiddly runnable (or some batch of work) that we previously tended to schedule at a "better" time (not during the talos test's measurements) before mak's patch, but now it tends to be scheduled at the same time as our talos tests (more often than before, at least). And this is a task that's not controlled by places.history.enabled, though perhaps it should be.

One clue: in a local profile of talos tsvgx with history disabled, I see the parent process spending some time running JS script with this backtrace, while our content process is partway through the talos test:

#updateDatabase/< [resource:///modules/Interactions.sys.mjs:730:12]
executeBeforeShutdown [resource://gre/modules/Sqlite.sys.mjs:509:23]
executeBeforeShutdown [resource://gre/modules/Sqlite.sys.mjs:1739:23]
withConnectionWrapper [resource://gre/modules/PlacesUtils.sys.mjs:1462:29]
InterpretGeneratorResume [self-hosted:1469:33]

https://share.firefox.dev/3tnVwL1

That PlacesUtils/Sqlite "updateDatabase" task seems like something I would expect that we wouldn't need to do, if history is disabled. Mak, do you know what might be going on there, and is that a task that we can opt out of?

(Not sure to-what-extent that explains the performance difference here, but I'm looking for all options to reduce places-related noise during talos here, to understand what's going on.)

Flags: needinfo?(mak)

Hmm, looks like browser.places.interactions.enabled might be the way to get rid of the JS noise that I'm seeing in my profile from comment 13. I'll try that and see if it improves anything here.

Looks like mak landed a patch that has the same effect as toggling that pref off, here:
https://hg.mozilla.org/integration/autoland/rev/13ee6ecc1f427ec1b553b9542c298d34643d26aa

I generated a bunch of tsvgx talos runs before/after that commit, and that data confirms that it seems to have helped:
https://treeherder.mozilla.org/perfherder/compare?originalProject=autoland&newProject=autoland&newRevision=99a4c5ae7bcd8ac307ad9a448b8b29d1f6c7809e&framework=1&originalRevision=350b696788a174ceeba0e05ac3ef2be7ddd7c5af&page=1&filter=tsvgx

With the current data (39 runs before, 41 runs after), it shows a 1.39% improvement in tsvgx; and the subtest view shows a ~3% improvement in 5.xml and 6.xml, the ones that were most-affected (possibly just the ones that are at risk of overlapping a lazily-scheduled task?), per comment 7.

(The values might change slightly as more data accumulates, since I'm going to trigger 10-20 more runs to smooth out the impact of a few outlier data-points. But I don't suspect they'll change much.)

We can watch the charts for a few more days, but I think it's likely we can consider this fixed. (To the extent that there's any remaining regression here associated with this places change, it's small enough not to worry about, and pretty-clearly not actually something that's impacting the paint speed that this test is generally trying to measure.)

Good job with the additional investigation! I agree with your theory and indeed that bug fix should help here. I also have another patch in Bug 1856903 that should improve startup time (or better, restore it as it was, since it's a regression in 115). I don't think it affects these directly but it may shift timings.

Flags: needinfo?(mak)

:mak any further action required here? Wondering if there is anything else expected or you will resolve this.
Bug 1856903 was uplifted to Fx119.

Flags: needinfo?(mak)

With bug 1855645, Bug 1856903 and Bug 1855852 fixed, I don't think there's anything else we can do here, we didn't measure any meaningful additional work.

Afaik this is still open because Daniel wanted to monitor the results for a few days.
From the graphs I see a tiny increase still persisting, but considered that there has been a timing shift and some of the jobs are marked as "affected by infra changes", I doubt we can figure out more.

Flags: needinfo?(mak)

Yeah, there's still[1] a regression that persists here, but I'm ~OK considering it a "new baseline" and not an actual performance regression in SVG rendering performance, due to the nature of the regressing change (which was related to incidental/unrealted history/bookmarks-DB background work).

If we can, it would be great to understand what the persisting "regression" is coming from, since it's likely some background non-rendering-related "noise" that would be good to eliminate here. But that's not a showstopper/blocker in any sense.

[1]
The graphs do still show a regression here, for the suite as a whole and also specifically for 5.xml and 6.xml:
https://treeherder.mozilla.org/perfherder/graphs?highlightedRevisions=b38476822c61&highlightedRevisions=3020232880d7&series=autoland%2Cdd9a390b32a8214a896e9e39022a1cdf9d95ebd4%2C1%2C1&timerange=5184000
https://treeherder.mozilla.org/perfherder/graphs?highlightedRevisions=b38476822c61&highlightedRevisions=3020232880d7&series=autoland%2C954a85f545e7fcef7efe5f682284614f4b69479c%2C1%2C1&timerange=5184000

Having said that, there's also a more-recent improvement in 7.xml shown in the graph here, around the end of September, possibly due to bug 1855852:
https://treeherder.mozilla.org/perfherder/graphs?highlightedRevisions=b38476822c61&highlightedRevisions=3020232880d7&series=autoland%2Ce1d0bf4add90e21c5b5cbcb382d0d5e7e50cf630%2C1%2C1&timerange=5184000

(mak, feel free to wontfix, or leave-open-for-further-investigation, or whatever you feel is appropriate at this point given the above. I probably won't be spending more time on profiling/diagnosis here, personally.)

The severity field is not set for this bug.
:mak, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(mak)

I think it's fine to consider it WONTFIX and call it a new baseline, even if in practice we addressed all the issues we could reasonably found.

Thank you a lot for the help Daniel.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(mak)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.