Resolve issues with chrome release on m2 osx speedometer 3 / sp3 tests
Categories
(Testing :: Raptor, defect, P2)
Tracking
(firefox122 fixed)
Tracking | Status | |
---|---|---|
firefox122 | --- | fixed |
People
(Reporter: sparky, Assigned: aglavic)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: perf-alert, Whiteboard: [fxp])
Attachments
(1 file)
This bug is for resolving the issues with chrome release on m2 osx speedometer 3 tests. Currently they take a very long time to complete, and are timing out: https://treeherder.mozilla.org/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=speedometer3%2Cchrome%2C1300&revision=6404412771ea15ef1c719a515dd1369360fb8d4d
Updated•1 year ago
|
Updated•11 months ago
|
Assignee | ||
Comment 1•11 months ago
|
||
Closed the duplicate bug and wanted to copy over my initial relevant comment:
Chrome is taking a long time to complete tasks on osx13 M2s, this bug is to understand exactly why that is ocuring, we found that it was taking a very long time
For context on Firefox there have been no timeouts of osx1300 for speedometer3 the past 2 months
Speedometer3 takes 50% longer to run as Speedometer on Firefox
This month speedometer3 has completed 2 times this entire October(2/11) and takes 30+ minutes
What I see is that regular speedometer takes ~20 min, 50% more means it'll get to ~30 min also
Seems to be an issue with the mac chrome rather than speedometer or the macs themselves
Speedometer takes:
5 minutes and 40/50 seconds per iteration on macosx1300 for chrome
takes 30 seconds on firefox on macosx1300
Assignee | ||
Comment 3•11 months ago
|
||
I had a look and profiled the chrome runs and was able to discover something that looked odd to me
This job on treeherder , failed generating perfherder data but generated a profiled run. In this run we had the profiled run take 10 min(going from 16:31:51 to 16:42:44) and a resulting profile lasting around 10 min and 25 seconds. In this run speedometer was run 5 times which might explain why speedometer and speedometer3 is taking so much longer to run, we may be running speedometer 5 times in a single speedometer iteration and instead of 5 we could be running 25.
I think this may be the case as when I look at the other speedometer tasks like this one on treeherder and investigate the profile it's showing a much more reasonable 30 seconds for a speedometer profiled run
Assignee | ||
Comment 4•11 months ago
|
||
If we approximate the following:
- a single run of speedometer to take 30 seconds(like it does on firefox for macosx1300)
- Startup(ie everything from the first line of the log to the first run) to take about 2.5 minutes
- Teardown(ie everything from last run data output to final line) to take 45 seconds
- And 5+1 runs(plus 1 for profiled)
The speedometer run should take about 30*6+150+45 which is 375 seconds or 6 min 15 seconds which looking at the perfherder similar jobs this matches up. Comparing this to running 5x the amount of jobs means not only do we have to multiply by 5, but we also need to add 30 seconds for each repeated post-startup-delay, which is by default set to 30 seconds. This changes the above math to (30+30)65+150+45=1995 seconds or 33.25 minutes which is suspiciously close to the actual time recorded for speedometer on chrome on the mac1300s which is 32 min
Assignee | ||
Comment 5•11 months ago
|
||
Applying the same logic to speedometer3, gives:
- a single run of speedometer3 takes 1 minute(like it does on firefox for macosx1300)
- Startup(ie everything from the first line of the log to the first run) to take about 1:45 minutes
- Teardown(ie everything from last run data output to final line) to take 30 seconds
- And 5+1 runs(plus 1 for profiled)
gives 60*6+105+30 = 510 seconds or 8:15 minutes, an approximation slightly under an expected completed time of 9 min
Applying the 5x and extra 30 second rule from my above comment we get:
(60+30)65+120+30= 47:50 min which is far from the actual 67 but closer than any current reasonable expectation
Assignee | ||
Comment 6•11 months ago
|
||
Also found the following if you look at the speedometer profiled runs in chrome you can see the markers indicating the start of the tests repeat and you can see the page reloading and beginning the test suite again if you zoom in
Assignee | ||
Comment 7•11 months ago
|
||
My theory was wrong, I am re-assessing what is going on and working with mstange to see what is going on
Assignee | ||
Comment 8•11 months ago
|
||
After going through a profile with mstange, it doesn't appear to be stuck on any one particular task.
Rather it appears that each individual test is just taking a long time, I have filed a relops ticket and requested they investigate if there is something wrong with chrome that is making it take so long
Assignee | ||
Updated•11 months ago
|
Comment 9•10 months ago
|
||
Our best theory for what's going on here is that we're accidentally running the Intel build of Chrome some how.
Assignee | ||
Comment 10•10 months ago
•
|
||
We vnced into a machine and found the following:
Chrome is being run as x86, the parent process is arm64(the chromedriver), and the parent process(node) is intel
Assignee | ||
Comment 11•10 months ago
|
||
A successful try run was completed yesterday, patch incoming to resolve the issue
Assignee | ||
Comment 12•10 months ago
|
||
We were experincing an issue of chrome on M2 chips taking 5-7 times longer than firefox on M2 chips
After debugging it was discovered the chrome universal binary was runnning x86 with rosetta
The chromedriver version was arm, but the node process running the chromedriver was x86
This patch changes node to run as arm not x86, which looks to have slipped through the inital setup of the M2s
Updated•10 months ago
|
Updated•10 months ago
|
Comment 13•10 months ago
|
||
Why are these not using node from toolchains? https://searchfox.org/mozilla-central/rev/5e0c42441a93d7a2307d3baa49d3a15f553d6757/taskcluster/ci/toolchain/node.yml#50
Comment 14•10 months ago
|
||
:glandium I believe it should be being used now? Try from the patch https://firefox-ci-tc.services.mozilla.com/tasks/cLLWlBojQca7MA56W0iDHQ has the osx aarch64 node as dependency
Updated•10 months ago
|
Updated•10 months ago
|
Comment 15•10 months ago
|
||
Comment 16•10 months ago
|
||
bugherder |
Comment 17•10 months ago
|
||
(In reply to Pulsebot from comment #15)
Pushed by aglavic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab3f06ef0efa
Change node architecture to arm64 on M2 macs.
r=perftest-reviewers,taskgraph-reviewers,kshampur,jmaher
== Change summary for alert #40445 (as of Tue, 28 Nov 2023 22:28:13 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) | Performance Profiles |
---|---|---|---|---|---|
87% | speedometer | macosx1300-64-shippable-qr | fission webrender | 227.46 -> 424.43 | |
83% | speedometer3 | macosx1300-64-shippable-qr | fission webrender | 15.78 -> 28.95 | Before/After |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=40445
Description
•