[meta] Resolve performance test issues on Ubuntu 24.04 CI tests
Categories
(Testing :: Performance, defect, P1)
Tracking
(Not tracked)
People
(Reporter: kshampur, Assigned: kshampur)
References
(Depends on 5 open bugs, Blocks 1 open bug)
Details
(Keywords: meta, Whiteboard: [fxp])
Attachments
(3 files)
on the linux 24.04 machines, there are a few issues. example push
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9cf93b1fc0033b21075d62ac4ec52dd507c304e0
some talos perma fails
and for browsertime, there are some silent failures (there is a pop up banner that appears in some videos)
just to name a couple things to start with...
going to add information to this bug as I go a long
Updated•5 months ago
|
| Assignee | ||
Comment 1•5 months ago
|
||
this banners message https://searchfox.org/mozilla-central/rev/270c20e4b063d80ce71f029b4adc4ba03a12edc0/toolkit/locales/en-US/toolkit/updates/elevation.ftl#24
is present in the recording
| Assignee | ||
Comment 2•5 months ago
|
||
I added a pref to this try run https://treeherder.mozilla.org/jobs?repo=try&revision=947a3d196c87c4229706c96e6d0b0c5bb995c1cc
e.g. user_pref("security.sandbox.warn_unprivileged_namespaces", false); in the raptor user.js and the banner seems to be gone
| Assignee | ||
Updated•5 months ago
|
| Assignee | ||
Comment 3•5 months ago
|
||
additional update. for the popup banner, following the recommendations here https://support.mozilla.org/en-US/kb/linux-security-warning?as=u&utm_source=inproduct
it seems adding this config removes the pop up (only tested on a ubuntu VM on my mac) so this means we dont have to set the pref
:aerickson is looking into making it more dynamic in CI (i had hard coded the path in the apparmor config to the mozilla-central build)
locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)
| Assignee | ||
Comment 4•5 months ago
|
||
locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)
so it damp-inspector did pass on the VM. So at the moment unable to reproduce that talos perma fail
| Assignee | ||
Comment 5•4 months ago
|
||
:aerickson is going to implement that apparmor config thing on the machine and we will re-verify the browsertime videos after
for some reason i am unable to reproduce the talos failures on my VM, perhaps due to being an aarch64? in the mean time i am getting set up with SSH access to one of the CI machines
| Assignee | ||
Comment 6•4 months ago
|
||
As pointed out by Kershaw, we also need to factor in the t-linux-netperf-1804 machines. It seems they were configured differently to create proper networking conditions. So we'll have to also consider a 24.04 equivalent of the netperf machines.
| Assignee | ||
Comment 7•4 months ago
|
||
i've ssh'd into one of the machines, and weirdly a talos test (damp-inspector) that perma fails when run on CI is passing when I run it there. very weird. will continue investigating
| Assignee | ||
Comment 8•4 months ago
|
||
unable to replicate the damp-inspector failure, however seems I am able to sometimes replicate the damp-webconsole failure so will proceed to look into that one with a solution that hopefully fixes all talos failures... (at least damp ones)
| Assignee | ||
Comment 9•4 months ago
|
||
okay interesting, the webconsole test passed after i applied the apparmor config. hard to say if this is the solution to fix other tests, but regardless we should deploy the apparmor config as it also fixes raptor
| Assignee | ||
Comment 10•4 months ago
|
||
something like this would work
# This profile allows everything and only exists to give the
# application a name instead of having the label "unconfined"
abi <abi/4.0>,
include <tunables/global>
profile firefox-local
/home/cltbld/tasks/task_*/build/application/firefox/firefox
flags=(unconfined) {
userns,
# Site-specific additions and overrides. See local/README for details.
include if exists <local/firefox>
}
since the task_<id> changes each time in CI
| Assignee | ||
Comment 11•4 months ago
|
||
oh and great we can ignore the tabswitch failures as well, we don't even run it and infact intend to remove it from selection in Bug 1980619
| Assignee | ||
Comment 12•4 months ago
|
||
chrome talos test also seems to pass now on the ssh'd machine. This is promising. Now we just need to try all of it out on an actual CI run
Comment 13•4 months ago
|
||
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #10)
something like this would work
# This profile allows everything and only exists to give the # application a name instead of having the label "unconfined" abi <abi/4.0>, include <tunables/global> profile firefox-local /home/cltbld/tasks/task_*/build/application/firefox/firefox flags=(unconfined) { userns, # Site-specific additions and overrides. See local/README for details. include if exists <local/firefox> }since the task_<id> changes each time in CI
This file is now being delivered to hosts in our small test pool (4 hosts) that are running the PR branch (https://github.com/mozilla-platform-ops/ronin_puppet/pull/908/).
| Assignee | ||
Comment 14•4 months ago
|
||
unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...
but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662
| Assignee | ||
Comment 15•4 months ago
|
||
Comment 16•4 months ago
|
||
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)
unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...
but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662
Hm, I guess we're making progress. :)
How/why is the test failing? It's running too slowly?
| Assignee | ||
Comment 17•4 months ago
•
|
||
(In reply to Andrew Erickson [:aerickson] from comment #16)
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)
unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...
but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662
Hm, I guess we're making progress. :)
How/why is the test failing? It's running too slowly?
it gets to a point and just hangs consistently at these parts of the test:
in the damp-inspector test it hangs here https://treeherder.mozilla.org/logviewer?job_id=528329071&repo=try&task=PgX05nvnSAyYorZRlfkXkw.0&lineNumber=3894
and in damp-webconsole it hangs here https://treeherder.mozilla.org/logviewer?job_id=528350855&repo=try&task=CpeKEin6Sl6IYddrCkG51Q.0&lineNumber=2070
as mentioned, it was odd to me that it was passing while I was in SSH'd into ms-239. Since the app armor update I haven't re-tried again, should have time tomorrow to re-investigate on the machine to reproduce the fail
| Assignee | ||
Comment 18•3 months ago
•
|
||
quick update by harness/framework, mostly about what is failing
Perftest:
- Perftest-linux-ml-perf-autofill
- perftest-linux-ml-summarizer-perf
Unclear on these following ones, however looking at similar jobs tab they either dont run on cron and have been perma failing for a while, so we can ignore probably
- Perftest-linux-controlled
- Perftest-linux-http3
- Perftest-linux-perfstats
- Perftest-linux-try-xpcshell
- Perftest-linux-webpagetest-chrome
Raptor:
- a bunch of chrome and all CaR tasks are failing
- otherwise it seems firefox tp6 tests are good
- (ignore tp7 tests)
- Godot
- js2
(^ but only on chrome)
Indexddb
seems to be all good:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=indexeddb&revision=5640ee2919a26a5c1009d9300267738b1ed11a5e
Upload (netperf machines)
all good except for CaR
Custom tests:
Throttled is perma failing but I checked tree and it doesnt run on cron + has been perma failing for a while
AWSY
Talos
as previously mentioned the following fail
- damp-inspector
- damp-webconsole
- chromez
so in summary...
- CaR: all failing (possibly chromedriver issue)
- Chrome: multiple failures across Raptor pageload + benchmarks
- Talos: 3 failures (2 damp + chromez)
- Perftest: 2 failing ML tests
| Assignee | ||
Comment 19•3 months ago
|
||
CaR may be failing due to this as well https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces
so may need an apparmor config for chromium-as-release
| Assignee | ||
Comment 20•3 months ago
|
||
this method seems to work
but that assumes there is an (old?) version of chrome already installed. This may not be the case on netperf machines
| Assignee | ||
Comment 21•3 months ago
•
|
||
app armor config that should work for CaR
/etc/apparmor.d/chrome-local
abi <abi/4.0>,
include <tunables/global>
profile chrome-local
/home/cltbld/tasks/task_*/fetches/chromium/Default/chrome
flags=(unconfined) {
userns,
# Site-specific additions and overrides. See local/README for details.
include if exists <local/chrome>
}
naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running sudo systemctl restart apparmor.service
this method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed
Comment 22•3 months ago
|
||
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #21)
app armor config that should work for CaR
/etc/apparmor.d/chrome-localabi <abi/4.0>, include <tunables/global> profile chrome-local /home/cltbld/tasks/task_*/fetches/chromium/Default/chrome flags=(unconfined) { userns, # Site-specific additions and overrides. See local/README for details. include if exists <local/chrome> }naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running
sudo systemctl restart apparmor.servicethis method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed
Added to the 24.04 talos and netperf configs.
Andy
| Assignee | ||
Comment 23•3 months ago
|
||
many CaR tasks work now thanks to the new configs
pageload is odd, screen goes black sometimes... (happens on chrome as well)
| Assignee | ||
Comment 24•3 months ago
|
||
I can repro the black screens on CaR on the SSH machine
| Assignee | ||
Comment 25•3 months ago
|
||
actually the black screen recordings happens even on existing chromium and chrome tests in CI so the failure is something else.
We could consider disabling both on linux for the time being since they don't even provide value right now on 18.04.
| Assignee | ||
Comment 26•3 months ago
|
||
resummarize state of things
all good except
- raptor:
- chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)
- godot benchmark on Chrome
- talos:
- damp-inspector
- damp-webconsole
- chromez
- perftest:
- ml-sumperf
- ml-perf-autofill
unable to repro talos, going to try ML
| Assignee | ||
Comment 27•3 months ago
•
|
||
after discussing with :sparky and :aerickson, we'll partially roll out 24.04 to some machines (how many to convert is TBD)
for this, I will:
- make a list of tests that seem to be good on 24.04
- make a perfcompare comparing the 18.04/24.04
- anything that looks weird (high variance/bimodal/etc) should probably stay on 18
- everything else (assuming new baselines values make sense and are stable) should be on 24.04
this way any tests having issues running on 24.04, just continue running on 18 and get investigated that separately. Everything else can run on 24.04
| Assignee | ||
Comment 28•3 months ago
|
||
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #26)
resummarize state of things
all good except
- raptor:
- chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)
- godot benchmark on Chrome
- talos:
- damp-inspector
- damp-webconsole
- chromez
- perftest:
- ml-sumperf
- ml-perf-autofill
actually looks like i already summarized it here. I believe this is still correct. so these tests are having issues on 24.04 and should be investigate later, and in the meantime remain running on a subset of 18.04 machines
so that means, I just will make a perfcompare of all other remaining tests excluding these ones
Comment 29•3 months ago
|
||
could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.
| Assignee | ||
Comment 30•3 months ago
|
||
compare link (in progress though, and some tests never got picked up so i'll have to retrigger)
https://perf.compare/compare-results?baseRev=20b8bdb89e450aee867f2f5219653ac63e26ba61&baseRepo=try&newRev=463119152cc7fdf34707fe7cb7a119b138ccf90d&newRepo=try&framework=1&filter_confidence=high
(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #29)
could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.
hi Joel could you clarify what you mean by "or losing coverage"? i.e. if the remaining "bad" tests are not fixed in the 60 day window, do you mean we just completely disable them (and so that coverage is lost?)
regarding the transition period. For the tests that do work, do we still want to do the (typical?) 14 days of having "good" tests run on both 18.04 and 24.04 before completely swapping it over to 24.04? (and after transitioning those tests, the "bad" tests remain on 18.04 for the remaining previously mentioned window?)
| Assignee | ||
Comment 31•2 months ago
•
|
||
alright looks like most tasks have enough data... so re-reviewing the the perfcompare:
awsy: has some improvements but nothing that seems alarming
browsertime: has regressions only in buzzfeed on firefox, graphs look okay (in that they were bimodal before and kind of look the same ish still)
mozperftest: improvement in a cloudflare test, nothing too weird when looking at the distribution graph
talos: nothing here
js-bench: i realize i never looked into this... submitted a task for that now. This isn't our harness but might be good to have a Try on hand
i did notice some retriggers timed out, so just going to wait on those to finish but so far things are looking good!
Based on what Joel said, probably makes sense to do 50/50 or 60/40 split at first for 2 weeks? and then 80/20 after that for the ~60 day period (let me think about this more...)
| Assignee | ||
Comment 32•2 months ago
|
||
gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...
| Assignee | ||
Comment 33•2 months ago
|
||
for jsshell bench, unclear how it's environment is set up... it's unable to find the js fetch for both base and new?
FileNotFoundError: [Errno 2] No such file or directory: '/home/cltbld/tasks/task_176176062647205/fetches/js'
but i see it passing in tree...
well, jsshell was kind of missed in the original consideration... so in the interest of time, we'll keep that in the 18.04 pool and add it among the tests to investigate separately.
| Assignee | ||
Comment 34•2 months ago
|
||
(In reply to Kash Shampur [:kshampur] ⌚EST from comment #32)
gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...
for chrome and CaR, all green and some improvements to make note of
https://perf.compare/compare-results?baseRev=6289fe5b1d6d47a79e1f517f0a345e7ac0a81b53&baseRepo=try&newRev=ec27885aa1f5031245a683982e915c7b2529f4c5&newRepo=try&framework=13&filter_confidence=none%2Chigh
| Assignee | ||
Comment 35•2 months ago
|
||
was going to skip the jsshell thing but turns out I just needed to pass --no-artifact
So that is running in the meantime https://perf.compare/compare-results?baseRev=58e326781b63224d53c932166131fded33c207a5&baseRepo=try&newRev=1c0285d32fbdc670ef28d22afd1937c0efe3b684&newRepo=try&framework=11
| Assignee | ||
Comment 36•2 months ago
•
|
||
Okay, just re consolidating a proposed transition plan into one comment...
part 1 (first ~2 weeks): 50/50 split
Roughly half the machines stay on 18.04, half move to 24.04. TBD, how this might affect queue times...?
And for this, write a (temporary) patch with taskcluster transforms to target machines for a given test list
note: the small subset of netperf machines may suffer since there were only a few to begin with? and now that will be split into an even smaller pool temporarily...
part 2 (next 2-4 weeks): 80/20 split (24.04/18.04)
Start shifting most machines to 24.04, but still keep a small portion on 18.04 to either see if more regressions need to be looked at and/or to run the problematic tests on the smaller pool. At this point, ideally all "good" tests only run on 24.04, and only problematic tests stay on 18.04
Phase 3 (remaining 30-60 days): 95/5 or 90/10 split
Keep a few 18.04 machines around only for known-problem tests or a small daily sample. At the 60-day mark, disable 18.04 completely and file follow-ups for anything still broken and accept temporary data loss.
It is possible we just consolidate phase 2 and phase 3 into one thing as well, since most tests are greened up. The only unknown is how problematic the new baseline values/regressions might be.
Comment 37•2 months ago
|
||
given this plan, when would part 1 start?
| Assignee | ||
Comment 38•2 months ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #37)
given this plan, when would part 1 start?
In the process as of very recently. :aerickson is starting to get some more machines up and I'll begin writing a patch to target those machines
Comment 39•2 months ago
|
||
We were able to take over the Windows moonshot hardware, so we have plenty of hosts. We won't need to reduce the 18.04 pool.
I have ~50 hosts online (https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-2404) and will shoot for 90 total for the pool (by Dec 5, possibly next week).
Comment 40•2 months ago
|
||
has the non hardware tests been looked at? awsy? talos-xperf?
| Assignee | ||
Comment 41•2 months ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #40)
has the non hardware tests been looked at? awsy? talos-xperf?
what qualifies as talos-xperf? according to this, xperf only runs on windows? https://firefox-source-docs.mozilla.org/testing/perfdocs/talos.html#xperf
I have earlier Try with various talos & awsy but the artifacts are expired by now. Despite the 18.04 name for the job, it was pushed to a 24.04 pool with worker override.
I will at somepoint soon update the patch in Bug 2002063 with newer Try runs
I've actually just re pushed awsy and talos so let me see if they still pass.
Comment 42•2 months ago
|
||
the awsy here are on ubuntu 18.04:
https://treeherder.mozilla.org/jobs?repo=try&searchStr=awsy&revision=75847ce4655e9bea7cc63c15a9e22efd61aba827
| Assignee | ||
Comment 43•2 months ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #42)
the awsy here are on ubuntu 18.04:
https://treeherder.mozilla.org/jobs?repo=try&searchStr=awsy&revision=75847ce4655e9bea7cc63c15a9e22efd61aba827
I am confused. There is both 18 and 24 there, and looking at the logs seem to suggest 24.04 as well?
Comment 44•2 months ago
|
||
thanks for correcting me; looks good.
Comment 45•1 month ago
|
||
We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.
I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).
| Assignee | ||
Comment 46•1 month ago
|
||
(In reply to Andrew Erickson [:aerickson] from comment #45)
We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.
I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).
That should be plenty for now, thank you :aerickson!
| Assignee | ||
Updated•26 days ago
|
Description
•