ts_paint_webext regression due to a performance cliff (tripped in the second instance by separating out opensearch initialization )
Categories
(Core :: Performance, defect, P3)
Tracking
()
Performance Impact | low |
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox78 | --- | unaffected |
firefox79 | --- | unaffected |
firefox80 | --- | wontfix |
firefox81 | --- | fix-optional |
People
(Reporter: Gijs, Unassigned)
References
(Regression)
Details
(Keywords: perf:startup, regression)
Bear with me for the following story.
bug 1648639 moved some code and caused us to load PanelMultiView.jsm sooner than we used to. This tripped a regression of ts_paint_webext on Windows (non-qr). This was filed as bug 1650404.
A patch that fixed that regression (by removing the code that required this import) landed a bit over a week after the regressing patch. However, it didn't move the regressed ts_paint_webext needle.
I verified via trypushes that the patch worked (ie did move the needle) against the cset where the regression occurred, on try.
I then bisected the range where the patch worked and where it actually landed (and didn't work). Most of that history is in bug 1650404.
The last two trypushes I did (not listed in that bug) were:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f5fd43c2e942fdfc56ccec383ed166df8eb1e45&selectedTaskRun=fYHsoM8PQj-k46bkLA-Fhw.0
(ts_paint_webext numbers are around the ~295ms mark)
and
https://treeherder.mozilla.org/#/jobs?repo=try&revision=63678c5268cc8912a2f6faf77d2ca555ae9b7c68&selectedTaskRun=Io-QT3NKQgCpapRVM2QEVA.0
(ts_paint_webext numbers are around the ~307ms mark)
They both have the patch from bug 1650404 and are based on https://hg.mozilla.org/integration/autoland/rev/bb6e164924381be127634741162f98542365c3ab and https://hg.mozilla.org/integration/autoland/rev/fa7ca9bd0ba95ad797939fed1819e045499e103f , respectively, which leaves the pushlog https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=bb6e164924381be127634741162f98542365c3ab&tochange=fa7ca9bd0ba95ad797939fed1819e045499e103f , which points to bug 1649426 .
Clearly, there is some kind of performance cliff that first the patch in bug 1648639 and then the patch in bug 1649426 both tripped. Unfortunately, at least in bug 1650404, we couldn't figure out anything useful using profiles, so I'm honestly not sure what kind of perf cliff it is. Guesses include it being paint cycles somehow, though the diff. is around 10ms rather than 16ms (though I could just about believe that being noise), or something GC-based - but those are just that: guesses.
I do know that if you look at the graph, these numbers have long been around the 307ms mark, and then briefly dipped below it as a result of the patch in bug 1642465 (visible in the graph around June 28 - June 30th). Unfortunately, regression alerts get filed with a delay, and then there's another delay while we have to cook up a fix for the regression, by which time other patches have also landed.
So I'm not sure what to do at this point. We can try and chase this particular regressor, but I half expect that if I push-to-try (as I soon shall) from current m-c with the patches for bug 1649426 and bug 1637744 reverted, it still won't touch the numbers - as bug 1649426 landed 8 days ago, and so the chances are pretty good that something else has regressed the same numbers since.
It also appears to me that if there is a perf cliff, the 10ms or so difference is possibly correlated with the machine or other factors, ie the regressing bugs are getting "blamed" for a 3-4% regression but it's not like they actually added a full 10ms of work to anything, and we don't know if the 10ms is an accurate representation everywhere, or just on our non-qr Windows talos machines...
On the other hand, it would be nice to at least understand the perf cliff better, as if we understood what it was, perhaps we could come up with a more reliable fix than playing whack-a-mole with all the patches tripping over said cliff.
Florian, I don't suppose you have any idea how we could meaningfully proceed here? Could we somehow run the job and not collect profiles but only collect markers, and see if that gets us a better idea of where the 10ms gap is?
Reporter | ||
Comment 1•4 years ago
|
||
I swear I set a needinfo for comment #0, but clearly it didn't work...
Reporter | ||
Comment 2•4 years ago
|
||
baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=184539ac4d4c93cfff64f0993a1fe269dc04240e
bunch of backouts (there's been a lot of changes to the search service recently :-( ): https://treeherder.mozilla.org/#/jobs?repo=try&revision=28a7c01c82d9e7ee1011764f24edb52b198a7877
Reporter | ||
Comment 3•4 years ago
|
||
(In reply to :Gijs (he/him) from comment #2)
baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=184539ac4d4c93cfff64f0993a1fe269dc04240e
bunch of backouts (there's been a lot of changes to the search service recently :-( ): https://treeherder.mozilla.org/#/jobs?repo=try&revision=28a7c01c82d9e7ee1011764f24edb52b198a7877
As I suspected, these backouts were insufficient to bring numbers back down to what they used to be, so in addition to bug 1649426, there must be one or more other bugs that would have caused the same regression.
Updated•4 years ago
|
Comment 4•4 years ago
|
||
Set release status flags based on info from the regressing bug 1649426
Comment 5•4 years ago
|
||
(In reply to :Gijs (he/him) from comment #0)
Florian, I don't suppose you have any idea how we could meaningfully proceed here? Could we somehow run the job and not collect profiles but only collect markers, and see if that gets us a better idea of where the 10ms gap is?
At https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/testing/talos/talos/gecko_profile.py#85-91 you could add the MOZ_PROFILER_STARTUP_FEATURES="stackwalk,js,nostacksampling" environment variable to collect startup profiles without samples (ie. markers only).
10ms is so small that it could be just a difference in the order of pieces of code running asynchronously, vs the refresh driver tick.
Comment 6•4 years ago
|
||
Gijs, Florian, what is the next step on this bug? Thanks
Updated•4 years ago
|
Reporter | ||
Comment 7•4 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #6)
Gijs, Florian, what is the next step on this bug? Thanks
I don't know. This regression is not caused by people are adding a specific bit of code that in and of itself takes 10-15ms to run, and even if we fixed this particular instance of tripping the perf cliff it wouldn't help as there is at least one (and there likely are many) other such changes (cf. comment 3). It's also likely that this change on the infra where we run this test doesn't reflect tripping the perf cliff for all our users as people are likely on different sides of this perf cliff depending on the hardware and environment in/on which Firefox is run. So I think we don't have a hope of getting the perf metric back at the "low" value by chasing the individual regressors, and from that perspective closing this is OK.
However... I think it would be valuable if we could understand what trips things into the higher and lower values. Perhaps the suggestion from Florian in comment #5 could help with this. If we understood this better, we could make the performance test more stable/reliable, and potentially reap bigger performance benefits (e.g. if we find out that right now showing about:home in the webext situation is blocked on webext (or other) things that shouldn't block it).
I started
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f67dd5ba99a123652ad30ba1054b1dc9416cd4b (base)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a2c2f1fd92dec9029ecf7786f6edf14a04caab4 ("regression")
Assuming the non-profile jobs still bear out the difference, I'll trigger a profiling job on these and hopefully unlike the "full" profiles these continue to bear out the difference. If not... well, I'm starting to run out of ideas.
Reporter | ||
Comment 8•4 years ago
|
||
So the profiler in markers-only mode still has a significant impact (around 30-40ms) which continues to make it hard to see what the deal is. :-(
Florian, any other ideas left?
Comment 9•4 years ago
|
||
Unfortunately, I don't have any other idea.
Comment 10•4 years ago
|
||
The severity field is not set for this bug.
:daleharvey, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 11•4 years ago
|
||
Marking as fxperf:p2. Going to bring this up in group triage, since it seems like this could benefit from a group discussion and a plan for next steps
Comment 12•4 years ago
|
||
Given that the cause here doesn't seem to be the search code but various combinations of code triggering something, I'm going to move this across to general.
Comment 13•4 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::JavaScript: GC' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•3 years ago
|
Description
•