Closed Bug 1858236 Opened 1 year ago Closed 10 months ago

Resolve issues with chrome release on m2 osx speedometer 3 / sp3 tests

Categories

(Testing :: Raptor, defect, P2)

defect

Tracking

(firefox122 fixed)

RESOLVED FIXED
122 Branch
Tracking Status
firefox122 --- fixed

People

(Reporter: sparky, Assigned: aglavic)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [fxp])

Attachments

(1 file)

This bug is for resolving the issues with chrome release on m2 osx speedometer 3 tests. Currently they take a very long time to complete, and are timing out: https://treeherder.mozilla.org/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=speedometer3%2Cchrome%2C1300&revision=6404412771ea15ef1c719a515dd1369360fb8d4d

Blocks: 1809667
Depends on: 1858504
Blocks: 1858666
Assignee: nobody → aglavic

Closed the duplicate bug and wanted to copy over my initial relevant comment:

Chrome is taking a long time to complete tasks on osx13 M2s, this bug is to understand exactly why that is ocuring, we found that it was taking a very long time

For context on Firefox there have been no timeouts of osx1300 for speedometer3 the past 2 months
Speedometer3 takes 50% longer to run as Speedometer on Firefox

This month speedometer3 has completed 2 times this entire October(2/11) and takes 30+ minutes

What I see is that regular speedometer takes ~20 min, 50% more means it'll get to ~30 min also
Seems to be an issue with the mac chrome rather than speedometer or the macs themselves

Speedometer takes:

5 minutes and 40/50 seconds per iteration on macosx1300 for chrome
takes 30 seconds on firefox on macosx1300
Duplicate of this bug: 1859836

I had a look and profiled the chrome runs and was able to discover something that looked odd to me
This job on treeherder , failed generating perfherder data but generated a profiled run. In this run we had the profiled run take 10 min(going from 16:31:51 to 16:42:44) and a resulting profile lasting around 10 min and 25 seconds. In this run speedometer was run 5 times which might explain why speedometer and speedometer3 is taking so much longer to run, we may be running speedometer 5 times in a single speedometer iteration and instead of 5 we could be running 25.
I think this may be the case as when I look at the other speedometer tasks like this one on treeherder and investigate the profile it's showing a much more reasonable 30 seconds for a speedometer profiled run

If we approximate the following:

  • a single run of speedometer to take 30 seconds(like it does on firefox for macosx1300)
  • Startup(ie everything from the first line of the log to the first run) to take about 2.5 minutes
  • Teardown(ie everything from last run data output to final line) to take 45 seconds
  • And 5+1 runs(plus 1 for profiled)

The speedometer run should take about 30*6+150+45 which is 375 seconds or 6 min 15 seconds which looking at the perfherder similar jobs this matches up. Comparing this to running 5x the amount of jobs means not only do we have to multiply by 5, but we also need to add 30 seconds for each repeated post-startup-delay, which is by default set to 30 seconds. This changes the above math to (30+30)65+150+45=1995 seconds or 33.25 minutes which is suspiciously close to the actual time recorded for speedometer on chrome on the mac1300s which is 32 min

Applying the same logic to speedometer3, gives:

  • a single run of speedometer3 takes 1 minute(like it does on firefox for macosx1300)
  • Startup(ie everything from the first line of the log to the first run) to take about 1:45 minutes
  • Teardown(ie everything from last run data output to final line) to take 30 seconds
  • And 5+1 runs(plus 1 for profiled)

gives 60*6+105+30 = 510 seconds or 8:15 minutes, an approximation slightly under an expected completed time of 9 min

Applying the 5x and extra 30 second rule from my above comment we get:
(60+30)65+120+30= 47:50 min which is far from the actual 67 but closer than any current reasonable expectation

Also found the following if you look at the speedometer profiled runs in chrome you can see the markers indicating the start of the tests repeat and you can see the page reloading and beginning the test suite again if you zoom in

My theory was wrong, I am re-assessing what is going on and working with mstange to see what is going on

After going through a profile with mstange, it doesn't appear to be stuck on any one particular task.
Rather it appears that each individual test is just taking a long time, I have filed a relops ticket and requested they investigate if there is something wrong with chrome that is making it take so long

Our best theory for what's going on here is that we're accidentally running the Intel build of Chrome some how.

We vnced into a machine and found the following:
Chrome is being run as x86, the parent process is arm64(the chromedriver), and the parent process(node) is intel

A successful try run was completed yesterday, patch incoming to resolve the issue

We were experincing an issue of chrome on M2 chips taking 5-7 times longer than firefox on M2 chips
After debugging it was discovered the chrome universal binary was runnning x86 with rosetta
The chromedriver version was arm, but the node process running the chromedriver was x86
This patch changes node to run as arm not x86, which looks to have slipped through the inital setup of the M2s

Attachment #9365216 - Attachment description: Bug 1858236 - Slow Chrome speedometer and speedometer3 completion on M2s. r?#perftest → Bug 1858236 - Change node version to arm64 on M2 macs
Attachment #9365216 - Attachment description: Bug 1858236 - Change node version to arm64 on M2 macs → Bug 1858236 - Change node architecture to arm64 on M2 macs

:glandium I believe it should be being used now? Try from the patch https://firefox-ci-tc.services.mozilla.com/tasks/cLLWlBojQca7MA56W0iDHQ has the osx aarch64 node as dependency

Attachment #9365216 - Attachment description: Bug 1858236 - Change node architecture to arm64 on M2 macs → Bug 1858236 - Slow Chrome speedometer and speedometer3 completion on M2s. r?#perftest
Attachment #9365216 - Attachment description: Bug 1858236 - Slow Chrome speedometer and speedometer3 completion on M2s. r?#perftest → Bug 1858236 - Change node architecture to arm64 on M2 macs. r?#perftest
Depends on: 1866703
Pushed by aglavic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ab3f06ef0efa Change node architecture to arm64 on M2 macs. r=perftest-reviewers,taskgraph-reviewers,kshampur,jmaher
Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch
See Also: → 1848400

(In reply to Pulsebot from comment #15)

Pushed by aglavic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab3f06ef0efa
Change node architecture to arm64 on M2 macs.
r=perftest-reviewers,taskgraph-reviewers,kshampur,jmaher

== Change summary for alert #40445 (as of Tue, 28 Nov 2023 22:28:13 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
87% speedometer macosx1300-64-shippable-qr fission webrender 227.46 -> 424.43
83% speedometer3 macosx1300-64-shippable-qr fission webrender 15.78 -> 28.95 Before/After

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=40445

Keywords: perf-alert
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: