<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 1

•

4 years ago

I am looking into this, but it should be backed out if I don't come up with something in the next day or two (in my opinion; you can take it out earlier if it's causing problems.)

Comment 2

•

4 years ago

Set release status flags based on info from the regressing bug 1692308

status-firefox87: --- → unaffected

status-firefox88: --- → affected

status-firefox-esr78: --- → unaffected

Comment 3

•

4 years ago

I think we can go ahead with the backout here.

Flags: needinfo?(fstrugariu)

Reporter

Comment 4

•

4 years ago

:RyanVM yep we can back this out

Flags: needinfo?(fstrugariu)

Updated

•

4 years ago

tracking-firefox88: --- → +

tracking-firefox89: --- → +

Comment 5

•

4 years ago

backout

Bug 1692308 backed out from autoland and Beta (for 88.0b5).
https://hg.mozilla.org/integration/autoland/rev/67a0bd10d0f819b17feccfb5577c9bc213ddb265
https://hg.mozilla.org/releases/mozilla-beta/rev/c8f878c8ba1d

Assignee: nobody → sphink

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox88: affected → fixed

status-firefox89: affected → fixed

Flags: needinfo?(sphink)

Resolution: --- → FIXED

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 6

•

4 years ago

== Change summary for alert #29550 (as of Fri, 02 Apr 2021 05:03:55 GMT) ==

Regressions:

Ratio	Suite	Test	Platform	Options	Absolute values (old vs new)
15%	booking	fcp	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	494.58 -> 569.54
15%	booking	fnbpaint	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	501.62 -> 575.92
15%	booking	fnbpaint	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	502.02 -> 576.00
12%	booking	FirstVisualChange	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	565.42 -> 634.00
11%	booking	dcf	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	1,065.29 -> 1,177.96
7%	amazon	LastVisualChange	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	1,847.10 -> 1,968.25
4%	reddit	loadtime	linux1804-64-shippable-qr	nocondprof warm webrender	2,036.19 -> 2,114.71

Improvements:

Ratio	Suite	Test	Platform	Options	Absolute values (old vs new)
22%	booking	PerceptualSpeedIndex	android-hw-g5-7-0-arm7-api-16-shippable	nocondprof warm webrender	1,013.88 -> 795.25

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=29550

Comment 7

•

4 years ago

I'm confused, doesn't that point to mostly progressions? Was the imgur regression in comment 0 missclassified given we didn't see the relevant progression when reverting?

Flags: needinfo?(sphink)

Flags: needinfo?(ryanvm)

Flags: needinfo?(fstrugariu)

Assignee

Comment 8

•

4 years ago

I'm not sure if I can fully answer that, but:

comment 0 had 5 regressions and 4 improvements, but the magnitude of the regressions was much larger
this was meant to be a nearly pure refactoring patch, so just about any change is unexpected
when I manually pushed the osx imgur jobs, I consistently saw the regression (of about the same magnitude)
the regression is one of those annoying ones where the results are and have always been bimodal, but before the patch it was rare to be in the slow mode and after the patch it was rare to be in the fast mode :(

I'm still intending to look into this, since it's holding up some other changes (and not just mine). But I got a little sidetracked into making it easy to push marker-only profiling runs, which I thought would be helpful for diagnosing this. ;-)

Flags: needinfo?(sphink)

Assignee

Comment 9

•

4 years ago

I do agree that comment 6 implies that it might be better to leave it in at this point, rather than holding up stuff on my investigation. I was certainly not expecting those results.

Comment 10

•

4 years ago

Bebe, did we see any change in the numbers on Beta after this landed?

Flags: needinfo?(ryanvm)

Reporter

Comment 11

•

4 years ago

== Change summary for alert #29555 (as of Fri, 02 Apr 2021 12:54:49 GMT) ==

This is the beta alert:

Regressions:

Ratio	Suite	Test	Platform	Options	Absolute values (old vs new)
116%	imgur	FirstVisualChange	macosx1015-64-shippable-qr	nocondprof warm webrender	83.33 -> 180.00
13%	imgur	SpeedIndex	macosx1015-64-shippable-qr	nocondprof warm webrender	649.72 -> 734.42
11%	imgur	PerceptualSpeedIndex	macosx1015-64-shippable-qr	nocondprof warm webrender	753.92 -> 839.25

Improvements:

Ratio	Suite	Test	Platform	Options	Absolute values (old vs new)
53%	imgur	fcp	macosx1015-64-shippable-qr	nocondprof warm webrender	145.96 -> 68.17
30%	imgur		macosx1015-64-shippable-qr	nocondprof warm webrender	234.07 -> 164.78

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=29555

Flags: needinfo?(fstrugariu) → needinfo?(ryanvm)

Comment 12

•

4 years ago

I'm so lost. On Nightly we only changed mostly Android numbers but on Beta it was all macOS? Not gonna lie, I'm having a really difficult time trusting the reliability of these test suites right now.

Flags: needinfo?(ryanvm)

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=0&highlightedRevisions=b6b786aa47c5fc504aa2a8cbc5f510091626e1ac&series=mozilla-beta,2955667,1,13&series=autoland,2891214,1,13&series=autoland,3341809,1,13&timerange=2592000&zoom=1615285015325,1617891945658,374.94444444444446,671.5555555555555

Reporter

Comment 13

•

4 years ago

Looking at the graph for booking on android we can see that the regression hit autoland and was backed out
on beta we can only fin the backout commit and not the original commit

on osx imgur we can see the data is there and we saw the backout regression on beta
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&series=autoland,3341232,1,13&series=autoland,3186474,1,13&series=mozilla-beta,3337152,1,13&series=mozilla-beta,2907877,1,13&timerange=2592000

As mac suffered a higher magnitude regression on beta should't we focus on that
Also is there any reason we did not seen the initial commit on beta?

Are there any beta flags that can hide this regression on andorid?

Flags: needinfo?(ryanvm)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 14

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #13)

As mac suffered a higher magnitude regression on beta should't we focus on that

I'm just trying to understand why we're seeing different effects on different platforms depending on which branch we're on. My concerns aren't about the priority of one platform vs. another as much as trying to understand the inconsistencies we're seeing.

Also is there any reason we did not seen the initial commit on beta?

The regressing bug landed on m-c during the Fx88 development cycle, so presumably it came to Beta on merge day when 88 migrated there.

Are there any beta flags that can hide this regression on andorid?

Not that I'm aware of, hence my confusion. Also, I don't know why we wouldn't have seen the macOS regressions on autoland either.

Flags: needinfo?(ryanvm)

Comment 15

•

4 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)

Not that I'm aware of, hence my confusion. Also, I don't know why we wouldn't have seen the macOS regressions on autoland either.

I suspect this is due to the changes in the macOS platform (bug 1694185).

I'm having a really difficult time trusting the reliability of these test suites right now.

Is this because we didn't see the change on mozilla-beta for Android? I agree this is worth investigating but I don't think it's concerning enough to question the reliability of our test suites. :bebe could you file a bug to investigate autoland Android changes reaching beta? We should determine if this is an isolated incident.

Flags: needinfo?(fstrugariu)

Paul Bone [:pbone]

Comment 16

•

4 years ago

I've spent some time looking at this.

Before: https://share.firefox.dev/2P2WUPk
After: https://share.firefox.dev/3edApQs

I hope that when I click the permalink button in the profiler I get a link that gives the view of what I'm zoomed in to.

The first non-blank paint is definitely slower in "After". The difference as far as I can tell is that there's extra JS code compiled and executed before the paint, I think that this code is normally executed after this paint. This is my first time diving into our task scheduler code, and the part that I don't understand is that both Before and After profiles create a LOW_PRIORITY_ONE_SHOT timer, the 2nd one does it indirectly through the IdleTaskRunner. There's something I'm missing. How could these timers that look the same (or maybe some other difference I haven't noticed) cause different tasks to be scheduled? (NI: Bas)

This is shuffling around of some tasks, so maybe it doesn't actually make an important difference to performance? Is it worth landing Bug 1692308 anyway? (NI: sfink).

Thanks.

Flags: needinfo?(sphink)

Flags: needinfo?(bas)

Assignee

Comment 17

•

4 years ago

I stared at this for a while today too. As pbone describes, the patch really looks like it ought to keep the timing identical. It creates the same timer as before, with the same priority.

Here are my sample profiles.

Before (unpatched): https://share.firefox.dev/3dCsdKt
After (patched): https://share.firefox.dev/3tD0tef

Those are zoomed into the beginning portion of the time, roughly aligned. They show enough time for the unpatched version to finish the first nonblank paint, but the patched version is still going.

Differences I noticed (validated with another pair of profiles to remove some of the chance artifacts:

Both have a common section where they're doing roughly the same thing in terms of flushing/reflowing/painting, until the unpatched version does a reflow followed by SetNeedStyleFlush in "Task nsHtml5SVGLoadDispatcher". Then the next refresh tick does the paint that finishes the FirstNonBlankPaint. The patched version also does an SVGLoadDispatcher SetNeedsStyleFlush, but much later, well after the patched version is already done. Both do other SetNeedStyleFlushes.
Shortly before the above, the unpatched (fast) version gets "SVGLoad" and "overflow" DOMEvents, associated with the same URL. There are 3 SVGLoad runnables that fire at this time. None of these show up within the same time window in the patched version.
The unpatched (fast) profile fires a RunWillPaintObservers runnable right around the time of the FirstNonBlankPaint
The patched (slow) version compiles https://s.imgur.com/min/gallery.js?1591637802 offthread, runs it, and finishes much sooner than the unpatched.
The patched version does several more mainthread inline script compiles+runs

I don't know what all this means. I'm kind of out of my depth here. If I were to make up a story, it would be that the patched version is too busy running some scripts to bother loading the SVG, and that's the first thing drawn in the unpatched version (to make it a NonBlankPaint)?

I doubt that in the end this matters much for performance, so it would probably be best to land and take the (mysterious) hits on osx and Android. I would still like to understand what's going on, given that it doesn't change the set of timers as far as I know. I'm going to hope that Bas's more experienced eye can see what's going on. The difference is quite reliable, even if it is a result of chance reshuffling.

If there is some difference in scheduling decisions, is there a good place I could instrument the scheduling code with markers? Like, "we had 3 tasks to choose from and picked X".

Paul Bone [:pbone]

Comment 18

•

4 years ago

mccr8 suggested we also NI smaug as he's done some work with scheduling too.

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 19

•

4 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #13)

Are there any beta flags that can hide this regression on andorid?

Not that I'm aware of, hence my confusion. Also, I don't know why we wouldn't have seen the macOS regressions on autoland either.

Note that there are some (different) Android improvements from the backout on Beta, on the order of 5%. See the reddit & espn results at https://treeherder.mozilla.org/perfherder/alerts?id=29555

Flags: needinfo?(sphink)

Comment 20

•

4 years ago

I commented in https://phabricator.services.mozilla.com/D104881 and matrix. The regressing patch does change behavior at least in one case. It might lead to running full GC more often.

Flags: needinfo?(bugs)