Open Bug 1652273 Opened 4 years ago Updated 3 years ago

ts_paint_webext regression due to a performance cliff (tripped in the second instance by separating out opensearch initialization )

Categories

(Core :: Performance, defect, P3)

Desktop
Windows 10
defect

Tracking

()

Performance Impact low
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox78 --- unaffected
firefox79 --- unaffected
firefox80 --- wontfix
firefox81 --- fix-optional

People

(Reporter: Gijs, Unassigned)

References

(Regression)

Details

(Keywords: perf:startup, regression)

Bear with me for the following story.

bug 1648639 moved some code and caused us to load PanelMultiView.jsm sooner than we used to. This tripped a regression of ts_paint_webext on Windows (non-qr). This was filed as bug 1650404.

A patch that fixed that regression (by removing the code that required this import) landed a bit over a week after the regressing patch. However, it didn't move the regressed ts_paint_webext needle.

I verified via trypushes that the patch worked (ie did move the needle) against the cset where the regression occurred, on try.

I then bisected the range where the patch worked and where it actually landed (and didn't work). Most of that history is in bug 1650404.

The last two trypushes I did (not listed in that bug) were:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f5fd43c2e942fdfc56ccec383ed166df8eb1e45&selectedTaskRun=fYHsoM8PQj-k46bkLA-Fhw.0
(ts_paint_webext numbers are around the ~295ms mark)

and

https://treeherder.mozilla.org/#/jobs?repo=try&revision=63678c5268cc8912a2f6faf77d2ca555ae9b7c68&selectedTaskRun=Io-QT3NKQgCpapRVM2QEVA.0
(ts_paint_webext numbers are around the ~307ms mark)

They both have the patch from bug 1650404 and are based on https://hg.mozilla.org/integration/autoland/rev/bb6e164924381be127634741162f98542365c3ab and https://hg.mozilla.org/integration/autoland/rev/fa7ca9bd0ba95ad797939fed1819e045499e103f , respectively, which leaves the pushlog https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=bb6e164924381be127634741162f98542365c3ab&tochange=fa7ca9bd0ba95ad797939fed1819e045499e103f , which points to bug 1649426 .

Clearly, there is some kind of performance cliff that first the patch in bug 1648639 and then the patch in bug 1649426 both tripped. Unfortunately, at least in bug 1650404, we couldn't figure out anything useful using profiles, so I'm honestly not sure what kind of perf cliff it is. Guesses include it being paint cycles somehow, though the diff. is around 10ms rather than 16ms (though I could just about believe that being noise), or something GC-based - but those are just that: guesses.

I do know that if you look at the graph, these numbers have long been around the 307ms mark, and then briefly dipped below it as a result of the patch in bug 1642465 (visible in the graph around June 28 - June 30th). Unfortunately, regression alerts get filed with a delay, and then there's another delay while we have to cook up a fix for the regression, by which time other patches have also landed.

So I'm not sure what to do at this point. We can try and chase this particular regressor, but I half expect that if I push-to-try (as I soon shall) from current m-c with the patches for bug 1649426 and bug 1637744 reverted, it still won't touch the numbers - as bug 1649426 landed 8 days ago, and so the chances are pretty good that something else has regressed the same numbers since.

It also appears to me that if there is a perf cliff, the 10ms or so difference is possibly correlated with the machine or other factors, ie the regressing bugs are getting "blamed" for a 3-4% regression but it's not like they actually added a full 10ms of work to anything, and we don't know if the 10ms is an accurate representation everywhere, or just on our non-qr Windows talos machines...

On the other hand, it would be nice to at least understand the perf cliff better, as if we understood what it was, perhaps we could come up with a more reliable fix than playing whack-a-mole with all the patches tripping over said cliff.

Florian, I don't suppose you have any idea how we could meaningfully proceed here? Could we somehow run the job and not collect profiles but only collect markers, and see if that gets us a better idea of where the 10ms gap is?

I swear I set a needinfo for comment #0, but clearly it didn't work...

Flags: needinfo?(florian)

(In reply to :Gijs (he/him) from comment #2)

baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=184539ac4d4c93cfff64f0993a1fe269dc04240e

bunch of backouts (there's been a lot of changes to the search service recently :-( ): https://treeherder.mozilla.org/#/jobs?repo=try&revision=28a7c01c82d9e7ee1011764f24edb52b198a7877

As I suspected, these backouts were insufficient to bring numbers back down to what they used to be, so in addition to bug 1649426, there must be one or more other bugs that would have caused the same regression.

Set release status flags based on info from the regressing bug 1649426

(In reply to :Gijs (he/him) from comment #0)

Florian, I don't suppose you have any idea how we could meaningfully proceed here? Could we somehow run the job and not collect profiles but only collect markers, and see if that gets us a better idea of where the 10ms gap is?

At https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/testing/talos/talos/gecko_profile.py#85-91 you could add the MOZ_PROFILER_STARTUP_FEATURES="stackwalk,js,nostacksampling" environment variable to collect startup profiles without samples (ie. markers only).

10ms is so small that it could be just a difference in the order of pieces of code running asynchronously, vs the refresh driver tick.

Flags: needinfo?(florian)

Gijs, Florian, what is the next step on this bug? Thanks

Flags: needinfo?(gijskruitbosch+bugs)
Whiteboard: [fxperf]

(In reply to Pascal Chevrel:pascalc from comment #6)

Gijs, Florian, what is the next step on this bug? Thanks

I don't know. This regression is not caused by people are adding a specific bit of code that in and of itself takes 10-15ms to run, and even if we fixed this particular instance of tripping the perf cliff it wouldn't help as there is at least one (and there likely are many) other such changes (cf. comment 3). It's also likely that this change on the infra where we run this test doesn't reflect tripping the perf cliff for all our users as people are likely on different sides of this perf cliff depending on the hardware and environment in/on which Firefox is run. So I think we don't have a hope of getting the perf metric back at the "low" value by chasing the individual regressors, and from that perspective closing this is OK.

However... I think it would be valuable if we could understand what trips things into the higher and lower values. Perhaps the suggestion from Florian in comment #5 could help with this. If we understood this better, we could make the performance test more stable/reliable, and potentially reap bigger performance benefits (e.g. if we find out that right now showing about:home in the webext situation is blocked on webext (or other) things that shouldn't block it).

I started

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f67dd5ba99a123652ad30ba1054b1dc9416cd4b (base)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a2c2f1fd92dec9029ecf7786f6edf14a04caab4 ("regression")

Assuming the non-profile jobs still bear out the difference, I'll trigger a profiling job on these and hopefully unlike the "full" profiles these continue to bear out the difference. If not... well, I'm starting to run out of ideas.

So the profiler in markers-only mode still has a significant impact (around 30-40ms) which continues to make it hard to see what the deal is. :-(

Florian, any other ideas left?

Flags: needinfo?(gijskruitbosch+bugs)

Unfortunately, I don't have any other idea.

The severity field is not set for this bug.
:daleharvey, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dharvey)
Severity: -- → S4
Flags: needinfo?(dharvey)
Priority: -- → P3

Marking as fxperf:p2. Going to bring this up in group triage, since it seems like this could benefit from a group discussion and a plan for next steps

Whiteboard: [fxperf] → [fxperf:p2]

Given that the cause here doesn't seem to be the search code but various combinations of code triggering something, I'm going to move this across to general.

Component: Search → General

The Bugbug bot thinks this bug should belong to the 'Core::JavaScript: GC' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: General → JavaScript: GC
Product: Firefox → Core
Component: JavaScript: GC → Performance
Has Regression Range: --- → yes
Performance Impact: --- → P3
Keywords: perf:startup
Whiteboard: [fxperf:p2]
You need to log in before you can comment on or make changes to this bug.