Closed Bug 1717198 Opened 3 years ago Closed 3 years ago

44 - 2.97% twitch FirstVisualChange / tumblr PerceptualSpeedIndex + 7 more (Linux) regression on Tue June 15 2021

Categories

(Firefox Build System :: Toolchains, defect)

defect

Tracking

(firefox-esr78 unaffected, firefox89 unaffected, firefox90 unaffected, firefox91 affected)

RESOLVED WONTFIX
Tracking Status
firefox-esr78 --- unaffected
firefox89 --- unaffected
firefox90 --- unaffected
firefox91 --- affected

People

(Reporter: Bebe, Assigned: mhentges)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 obsolete file)

Perfherder has detected a browsertime performance regression from push cd54d705a55dcb037a99b25d89475fba2bafe795. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
44% twitch FirstVisualChange linux1804-64-shippable cold 83.33 -> 120.00
44% twitch FirstVisualChange linux1804-64-shippable cold 83.33 -> 120.00
18% twitter FirstVisualChange linux1804-64-shippable cold 203.33 -> 240.00
14% reddit FirstVisualChange linux1804-64-shippable-qr cold webrender 280.00 -> 320.00
11% netflix FirstVisualChange linux1804-64-shippable-qr cold webrender 251.67 -> 280.00
5% twitter fcp linux1804-64-shippable cold 209.75 -> 220.96
4% reddit SpeedIndex linux1804-64-shippable-qr cold webrender 457.46 -> 477.25
4% reddit fnbpaint linux1804-64-shippable-qr cold webrender 271.46 -> 282.71
3% tumblr PerceptualSpeedIndex linux1804-64-shippable cold 727.42 -> 749.00

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
17% netflix fcp linux1804-64-shippable-qr warm webrender 95.62 -> 79.46
17% netflix fnbpaint linux1804-64-shippable-qr warm webrender 96.85 -> 80.58
6% tumblr fnbpaint linux1804-64-shippable-qr warm webrender 280.60 -> 264.21
5% youtube fnbpaint linux1804-64-shippable-qr warm webrender 256.29 -> 243.42
5% twitter fcp linux1804-64-shippable warm 188.18 -> 178.88
... ... ... ... ... ...
2% microsoft loadtime linux1804-64-shippable-qr warm webrender 336.90 -> 329.33

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(dmosedale)

Set release status flags based on info from the regressing bug 1690377

FWIW, the chances of these regressions really having been caused by the NodeJS upgrade strike me as pretty low.

Unfortunately, I'm out on PTO, so I'm going to have to depend on some folks here.

First off, my sense is that backing this out should be avoided at all possible, because we've just asked all the developers to upgrade their toolchains to Node 12, and since --enable-bootstrap was landed after this patch, if we back it out, everybody will have to downgrade by running mach bootstrap manually, which sounds exceedingly painful. mhentges, would you agree? If so, can you help prevent that from happening?

The main thing that the NodeJS upgrade would effect in terms of what gets built on the build machines, as far as I know, is the bundling of devtools. That strikes me as quite unlikely to have caused the perf regressions linked to here, but maybe there's something I don't know about how the devtools code works. Honza, what are your thoughts on this?

One possibility that springs to mind is if PerfHerder itself uses the in-tree Node toolchain (eg a Node script does the testing and timing using web driver or puppeteer or something), and somehow changing versions of Node changed that timing. The chances of this strike me as fairly low, but I've been wrong before! :Bebe, would you please reach out to a perfherder owner and ask them if that theory is even slightly plausible, or if they have any better suggestions of things to look into?

Mardak, Standard8, can you guys think of anything that could explain this?

mhentges, would you be willing to take this bug in my absence and drive it?

Thanks to all for any and all help/input..

Flags: needinfo?(standard8)
Flags: needinfo?(odvarko)
Flags: needinfo?(mhentges)
Flags: needinfo?(fstrugariu)
Flags: needinfo?(edilee)
Flags: needinfo?(dmosedale)

since --enable-bootstrap was landed after this patch

Only for mac right now.

everybody will have to downgrade by running mach bootstrap manually

actually, with --enable-bootstrap, downgrade would be automatic. The pain would be when --enable-bootstrap is not used.

One possibility that springs to mind is if PerfHerder itself uses the in-tree Node toolchain (eg a Node script does the testing and timing using web driver or puppeteer or something), and somehow changing versions of Node changed that timing.

An easy way to check if it's the node version for tests would be to change the fetches definitions for tests to use the node 10 toolchains instead of node 12.

I can't think of anything else apart from the possible effect of test harnesses.

Flags: needinfo?(standard8)

I am not seeing anything related to DevTools in the alert summary
https://treeherder.mozilla.org/perfherder/alerts?id=30404
Julian, is there anything I am missing?

Honza

Flags: needinfo?(odvarko) → needinfo?(jdescottes)

At runtime, nothing in DevTools or Remote uses node. However, the harness itself uses node I think (https://searchfox.org/mozilla-central/source/testing/raptor/browsertime). If there's any performance regression linked to updating node I guess it's related to the harness?

Regarding bundling, node is used to bundle the DevTools debugger UI, but this only loaded if you open DevTools, which is not the case here.

Flags: needinfo?(jdescottes)

:Honza, :jdescottes: I was just imagining some exceedingly unlikely world where the bundling step generated some bit of code that got run at startup, despite the fact that devtools itself doesn't start at that point, and somehow the version of node affected what got generated into the bundle. In other words, I was grasping at straws. :-)

(In reply to Dan Mosedale (:dmose, :dmosedale) from comment #8)

:Honza, :jdescottes: I was just imagining some exceedingly unlikely world where the bundling step generated some bit of code that got run at startup, despite the fact that devtools itself doesn't start at that point, and somehow the version of node affected what got generated into the bundle. In other words, I was grasping at straws. :-)

That's understandable, DevTools have a bad record of slowing down the browser startup unexpectedly :)
FWIW, we mentioned this to the rest of the devtools peers, in case we missed something.

(In reply to Mike Hommey [:glandium] from comment #3)

since --enable-bootstrap was landed after this patch

Only for mac right now.

It sounds like we should definitely try to avoid a backout.

(In reply to Mike Hommey [:glandium] from comment #4)

An easy way to check if it's the node version for tests would be to change the fetches definitions for tests to use the node 10 toolchains instead of node 12.

I've put up a simple patch which should make perftest run using node-10 on Linux:

  ./mach try --no-artifact --rebuild 3 fuzzy

and then I selected the jobs that matched linux-browsertime. Unfortunately, this keeps failing with connection errors, almost certainly because I'm on a flight with a very flaky internet connection.

I think the next step is to run two try builds like that, one with my patch, and one without, and compare the jobs. I've been using https://wiki.mozilla.org/TestEngineering/Performance/RunningTests as my playbook, as I haven't done this before.

I'll be the unlucky soul who picks up this ticket :)
I've taken Dan's patch and have pushed the following to try:

  • All linux-browsertime on central (2ad6d79): treeherder
  • All linux-browsertime on D118387: treeherder

I'll keep an eye on this and continue based on results.

Assignee: nobody → mhentges
Status: NEW → ASSIGNED
Flags: needinfo?(mhentges)

Comparison is here.
Note that the twitch FirstVisualChange tests did not have the same level of regression (I searched for "twitch FirstVisualChange" to spot them in the big comparison list).

However, I'm seeing a lot of "jitter" here with all sorts of results that don't match the alert as reported.
I'm going to do a second try run off central to compare against - if it also significantly varies from the original central run, then I'll have increased confidence that this NodeJS change didn't cause the perf regression.

OK, I have some solid evidence that this regression may not be due to the NodeJS change.
Take a look at this comparison between ea659615bc03dde68a1fd9f486b4b7b6cb4450e8 (central + try push revision) and 6c5f580a8d5f7c39cb36ecfb8464e0d1e4dd5b5a (central + a try push revision made at a different time). They vary pretty greatly (amazon ContentfulSpeedIndex opt cold has an 11% regression?) despite the fact that they have the same code.

I'm going to bug :sparky to see how I can wrangle these numbers.

Flags: needinfo?(gmierz2)

removing the ni? and waiting for :sparky for more information

Flags: needinfo?(fstrugariu)

(In reply to Mitchell Hentges [:mhentges] 🦀 from comment #14)

OK, I have some solid evidence that this regression may not be due to the NodeJS change.
Take a look at this comparison between ea659615bc03dde68a1fd9f486b4b7b6cb4450e8 (central + try push revision) and 6c5f580a8d5f7c39cb36ecfb8464e0d1e4dd5b5a (central + a try push revision made at a different time). They vary pretty greatly (amazon ContentfulSpeedIndex opt cold has an 11% regression?) despite the fact that they have the same code.

I'm going to bug :sparky to see how I can wrangle these numbers.

:mhentges, the reason your comparison numbers are bad is because the vismet tasks all ran on the same data. I noticed that you used --rebuild which doesn't work for browsertime and vismet. You'll need to use something like the add-new-jobs action or the retrigger to add more trials.

Flags: needinfo?(gmierz2)

Ok, sending these again, but this time without --rebuild.
My command for each of these was ./mach try fuzzy --no-artifact

  • All linux-browsertime on central (70be81): treeherder
  • All linux-browsertime on central (70be81), again: treeherder
  • All linux-browsertime on central + D118387: treeherder

(also clearing obsolete NI for edilee)

Flags: needinfo?(edilee)

The results are in!

:sparky, do these results and my conclusions seem accurate to you? Can we mark this ticket as invalid?

Flags: needinfo?(gmierz2)

I would accept these changes and WONTFIX rather than INVALID since they are valid but we shouldn't be concerned about metric changes caused by infrastructure changes.

The differences seem to be caused by changes in the variance of the pageload - the variance decreased so we're seeing improvements in the pageload numbers.

Flags: needinfo?(gmierz2)

Works for me, thanks :) 👍

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Attachment #9228175 - Attachment is obsolete: true
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: