Closed Bug 1769195 Opened 2 years ago Closed 2 years ago

38.74% reddit-billgates-ama.members ContentfulSpeedIndex (Linux) regression on Thu May 5 2022

Tracking

()

Status:

RESOLVED FIXED

Milestone:

102 Branch

Tracking Flags:

Tracking

Status

firefox-esr91

---

unaffected

firefox100

---

unaffected

firefox101

---

unaffected

firefox102

---

fixed

People

(Reporter: alexandrui, Assigned: tnikkel)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 file)

Bug 1769195. Don't bother to try to do a partial draw of a background image if we haven't decoded any pixels. r?aosmond 2 years ago Timothy Nikkel (:tnikkel) 48 bytes, text/x-phabricator-request		Details \| Review

Alexandru Ionescu (needinfo me) [:alexandrui]

Reporter

Description

•

2 years ago

Perfherder has detected a browsertime performance regression from push 846e7307e1a3894c84993f4d96d178de6917681e. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio	Test	Platform	Options	Absolute values (old vs new)
39%	reddit-billgates-ama.members ContentfulSpeedIndex	linux1804-64-shippable-qr	cold fission webrender	299.73 -> 415.83

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
81%	facebook-nav.marketplace LastVisualChange	macosx1015-64-shippable-qr	cold fission webrender	6,238.33 -> 1,206.67
80%	facebook-nav.marketplace LastVisualChange	linux1804-64-shippable-qr	cold fission webrender	6,321.67 -> 1,233.33
80%	facebook-nav.marketplace LastVisualChange	macosx1015-64-shippable-qr	cold fission webrender	6,240.00 -> 1,230.00
7%	facebook-nav.marketplace ContentfulSpeedIndex	linux1804-64-shippable-qr	cold fission webrender	1,125.75 -> 1,049.29
5%	facebook-nav.marketplace SpeedIndex	linux1804-64-shippable-qr	cold fission webrender	1,140.27 -> 1,080.83

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(tnikkel)

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

Set release status flags based on info from the regressing bug 1231622

status-firefox100: --- → unaffected

status-firefox101: --- → unaffected

status-firefox-esr91: --- → unaffected

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

2 years ago

Flags: needinfo?(tnikkel) → needinfo?(aionescu)

Timothy Nikkel (:tnikkel)

Assignee

Comment 2

•

2 years ago

I suspect this is mis-attributed. I pushed to try current trunk and then a backout of bug 1231622. Retriggered the job in question 5 times. Trunk avg 436, avg with backout 435.4. There's an infra change marker right before the jump in the graph, maybe that's related?

I also downloaded the browsertime results tgz both before and after to look visual to see if anything was going on. I didn't notice anything related this specific bug. I did notice there seems to be a bug in how we calculate these figures.

It looks like we run the test 10 times. In the video first the previous page is showing, then an "orange div" covers the page, and then in 2 of 10 videos we have the page in question that we are measuring showing in it's very early load state (call this case A), and in 8 of 10 videos we briefly show the previous page before changing to the very early load stages of the page in question (call this case B). And this difference changes when we determine various events have occurred. In case A first visual change happens when we have a significant amount of content visible on the page, in case B first visual change happens when we switch from the previous page to the new page with almost nothing drawn. And it affects other events too like what's visible at SpeedIndex. Not sure if this is known and/or if there is someone/some team that is interested in this finding.

Andrew Osmond [:aosmond] (he/him)

Comment 3

•

2 years ago

Pretty sure at one point that bug 1763643 improved this and the backout of bug 1766333 would have caused the reversion to the mean.

Alexandru Ionescu (needinfo me) [:alexandrui]

Reporter

Comment 4

•

2 years ago

The graph is very noisy and, despite the magnitude, it makes the regression very unclear locally. I retriggered f0fda878f51a5 to check the behavior of infra. If it's not infra, we can close this.

Flags: needinfo?(aionescu)

Timothy Nikkel (:tnikkel)

Assignee

Comment 5

•

2 years ago

After doing a bunch of retriggers on autoland it's pretty clear that the graph does go up exactly at bug 1231622.

However when I push various things to try I only get the higher numbers. Things I've pushed to try: current trunk, current trunk with my patch backed out, hg update <revid> where revid is various revisions well before my changeset landed. So it's impossible to investigate this via try. Somehow it gets different numbers then what autoland gets. And I'm comparing to jobs triggered on autoland at the same time as the pushes, so infra changes shouldn't be a problem.

I've deep dived into the browsertime downloadable json/videos to try to understand what is going on. In addition to the problem I noted in comment 2, I've also noticed that a page that loads faster in all respects (all visual milestones are achieved sooner) can get a ContentfulSpeedIndex that is larger. In detail, the visual load of the page in question happens in 5 discrete chunks: when the skeleton page first shows up, when the bg image is drawn, when the title image is drawn, when the skeleton page goes away, when the actual page content shows up.

Comparing two page loads, 1st number timestamp in ms when page load A hits that milestone, 2nd number is page load B.

200 160 skeleton ui
320 320 bg image drawn
320 320 title image drawn
480 440 skeleton page goes away
920 840 page content shows up

As you can see page load B is always faster or the same as page load A, however page load A scores 284 ContentfulSpeedIndex, page load B scores 345 ContentfulSpeedIndex.

Digging into the browsertime json file we can find a map from timestamp to ContentfulSpeedIndex percent complete which looks like it is used to compute the final score. At timestamp 160ms page load A is 52 percent complete (note page load A hasn't even reached the first visual milestone, but perhaps 160ms just got rounded up to 200ms so we'll let it pass). For page load B, it doesn't even hit 52 percent at our second last visual milestone (skeleton page goes away), we only surpass 52 percent when we hit 99 percent at 760ms (just before the page content almost fully shows up at 840ms).

This is not an isolated example. So my trust in this metric is not very high.

Timothy Nikkel (:tnikkel)

Assignee

Comment 6

•

2 years ago

Attached file Bug 1769195. Don't bother to try to do a partial draw of a background image if we haven't decoded any pixels. r?aosmond — Details

The draw will be pointless, and it regresses one perf metric.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → tnikkel

Status: NEW → ASSIGNED

Pulsebot

Comment 7

•

2 years ago

Pushed by tnikkel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/eda72c9d12f1 Don't bother to try to do a partial draw of a background image if we haven't decoded any pixels. r=aosmond

Atila Butkovits

Comment 8

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/eda72c9d12f1

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox102: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 102 Branch

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

2 years ago

Regressions: 1770464

Alexandru Ionescu (needinfo me) [:alexandrui]

Reporter

Comment 9

•

2 years ago

•

Edited

(In reply to Pulsebot from comment #7)

Pushed by tnikkel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/eda72c9d12f1
Don't bother to try to do a partial draw of a background image if we haven't
decoded any pixels. r=aosmond

== Change summary for alert #34197 (as of Sun, 22 May 2022 00:27:03 GMT) ==

Regressions:

Ratio	Test	Platform	Options	Absolute values (old vs new)
424%	facebook-nav.marketplace LastVisualChange	macosx1015-64-shippable-qr	cold fission webrender	1,193.33 -> 6,253.33
407%	facebook-nav.marketplace LastVisualChange	linux1804-64-shippable-qr	cold fission webrender	1,250.00 -> 6,331.67
12%	outlook ContentfulSpeedIndex	windows10-64-shippable-qr	fission warm webrender	870.75 -> 976.67
6%	facebook-nav.marketplace SpeedIndex	linux1804-64-shippable-qr	cold fission webrender	1,082.38 -> 1,150.88

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=34197

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 10

•

2 years ago

:alexandrui did you mean to needinfo anyone on the last comment? did the last patch cause further regressions, or are the latest alerts from the original bug?

Timothy Nikkel (:tnikkel)

Assignee

Comment 11

•

2 years ago

The huge facebook-nav.marketplace LastVisualChange changes are just the tests going back to normal (in the first comment here you see the reverse change), but it seems they are buggy, these patches shouldn't be having that kind of impact on a properly calibrated test.

The facebook-nav.marketplace SpeedIndex is also the test going back to normal (reverse change is in comment 0).

Looking at the graph for outlook ContentfulSpeedIndex when bug 1231622 landed it looks like it caused an improvement equal to the regression here.

So everything is back to normal here as far as I can tell.

Why these two patches move the numbers at all here I'm not sure, I suspect the tests aren't well calibrated or something.

Alexandru Ionescu (needinfo me) [:alexandrui]

Reporter

Comment 12

•

2 years ago

(In reply to Dave Hunt [:davehunt] [he/him] ⌚GMT from comment #10)

:alexandrui did you mean to needinfo anyone on the last comment? did the last patch cause further regressions, or are the latest alerts from the original bug?

nope, it's just a regression fix.

Timothy Nikkel (:tnikkel)

Assignee

Comment 13

•

2 years ago

I looked into the facebook-nav.marketplace LastVisualChange changes that happened here because I was investigating something similar (bug 1771977). When LastVisualChange is around 1 second it's because we never draw (or draw it after the browsertime analysis is complete) the little chat overlay icon in the bottom right. When LastVisualChange is 6 seconds we wait until that little chat overlay icon is drawn.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

2 years ago