1983694 - [meta] Resolve performance test issues on Ubuntu 24.04 CI tests

locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 4

•

5 months ago

locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)

so it damp-inspector did pass on the VM. So at the moment unable to reproduce that talos perma fail

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

5 months ago

Depends on: 1983988

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 5

•

4 months ago

:aerickson is going to implement that apparmor config thing on the machine and we will re-verify the browsertime videos after

for some reason i am unable to reproduce the talos failures on my VM, perhaps due to being an aarch64? in the mean time i am getting set up with SSH access to one of the CI machines

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 6

•

4 months ago

As pointed out by Kershaw, we also need to factor in the t-linux-netperf-1804 machines. It seems they were configured differently to create proper networking conditions. So we'll have to also consider a 24.04 equivalent of the netperf machines.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 7

•

4 months ago

i've ssh'd into one of the machines, and weirdly a talos test (damp-inspector) that perma fails when run on CI is passing when I run it there. very weird. will continue investigating

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

4 months ago

Blocks: 1978107

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 8

•

4 months ago

unable to replicate the damp-inspector failure, however seems I am able to sometimes replicate the damp-webconsole failure so will proceed to look into that one with a solution that hopefully fixes all talos failures... (at least damp ones)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 9

•

4 months ago

okay interesting, the webconsole test passed after i applied the apparmor config. hard to say if this is the solution to fix other tests, but regardless we should deploy the apparmor config as it also fixes raptor

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 10

•

4 months ago

something like this would work

# This profile allows everything and only exists to give the
# application a name instead of having the label "unconfined"
abi <abi/4.0>,
include <tunables/global>
profile firefox-local
/home/cltbld/tasks/task_*/build/application/firefox/firefox
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/firefox>
}

since the task_<id> changes each time in CI

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 11

•

4 months ago

oh and great we can ignore the tabswitch failures as well, we don't even run it and infact intend to remove it from selection in Bug 1980619

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 12

•

4 months ago

chrome talos test also seems to pass now on the ssh'd machine. This is promising. Now we just need to try all of it out on an actual CI run

Andrew Erickson [:aerickson]

Comment 13

•

4 months ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #10)

something like this would work

# This profile allows everything and only exists to give the
# application a name instead of having the label "unconfined"
abi <abi/4.0>,
include <tunables/global>
profile firefox-local
/home/cltbld/tasks/task_*/build/application/firefox/firefox
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/firefox>
}

since the task_<id> changes each time in CI

This file is now being delivered to hosts in our small test pool (4 hosts) that are running the PR branch (https://github.com/mozilla-platform-ops/ronin_puppet/pull/908/).

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 14

•

4 months ago

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 15

•

4 months ago

Attached video no-more-banner-3-original.mp4 — Details

Andrew Erickson [:aerickson]

Comment 16

•

4 months ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

Hm, I guess we're making progress. :)

How/why is the test failing? It's running too slowly?

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 17

•

4 months ago

•

Edited

(In reply to Andrew Erickson [:aerickson] from comment #16)

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

Hm, I guess we're making progress. :)

How/why is the test failing? It's running too slowly?

it gets to a point and just hangs consistently at these parts of the test:

in the damp-inspector test it hangs here https://treeherder.mozilla.org/logviewer?job_id=528329071&repo=try&task=PgX05nvnSAyYorZRlfkXkw.0&lineNumber=3894

and in damp-webconsole it hangs here https://treeherder.mozilla.org/logviewer?job_id=528350855&repo=try&task=CpeKEin6Sl6IYddrCkG51Q.0&lineNumber=2070

as mentioned, it was odd to me that it was passing while I was in SSH'd into ms-239. Since the app armor update I haven't re-tried again, should have time tomorrow to re-investigate on the machine to reproduce the fail

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 18

•

3 months ago

•

Edited

quick update by harness/framework, mostly about what is failing

Perftest:

Perftest-linux-ml-perf-autofill
perftest-linux-ml-summarizer-perf

Unclear on these following ones, however looking at similar jobs tab they either dont run on cron and have been perma failing for a while, so we can ignore probably

Perftest-linux-controlled
Perftest-linux-http3
Perftest-linux-perfstats
Perftest-linux-try-xpcshell
Perftest-linux-webpagetest-chrome

Raptor:

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=0acd08a6370e3edb0210ce1c11fbc5be8bef09f5&searchStr=tp
tp6:

a bunch of chrome and all CaR tasks are failing
otherwise it seems firefox tp6 tests are good
(ignore tp7 tests)

Benchmark:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=0acd08a6370e3edb0210ce1c11fbc5be8bef09f5&searchStr=benchmark

Godot
js2
(^ but only on chrome)

Indexddb
seems to be all good:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=indexeddb&revision=5640ee2919a26a5c1009d9300267738b1ed11a5e

Upload (netperf machines)
all good except for CaR

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=63e79b8f87369add3eec04194bb5539fdfc325e2

Custom tests:
Throttled is perma failing but I checked tree and it doesnt run on cron + has been perma failing for a while

AWSY

all good https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=awsy&revision=eec95859eba01fc1a7e31253b8a4ed997b2d4835

Talos

as previously mentioned the following fail

damp-inspector
damp-webconsole
chromez

so in summary...

CaR: all failing (possibly chromedriver issue)
Chrome: multiple failures across Raptor pageload + benchmarks
Talos: 3 failures (2 damp + chromez)
Perftest: 2 failing ML tests

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 19

•

3 months ago

CaR may be failing due to this as well https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces

so may need an apparmor config for chromium-as-release

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 20

•

3 months ago

this method seems to work

https://chromium.googlesource.com/chromium/src/+/main/docs/security/apparmor-userns-restrictions.md#option-3_the-safest-way

but that assumes there is an (old?) version of chrome already installed. This may not be the case on netperf machines

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 21

•

3 months ago

•

Edited

app armor config that should work for CaR

/etc/apparmor.d/chrome-local

abi <abi/4.0>,
include <tunables/global>
profile chrome-local
/home/cltbld/tasks/task_*/fetches/chromium/Default/chrome
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}

naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running sudo systemctl restart apparmor.service

this method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed

Andrew Erickson [:aerickson]

Comment 22

•

3 months ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #21)

app armor config that should work for CaR

/etc/apparmor.d/chrome-local
abi <abi/4.0>,
include <tunables/global>
profile chrome-local
/home/cltbld/tasks/task_*/fetches/chromium/Default/chrome
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}
naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running sudo systemctl restart apparmor.service

this method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed

Added to the 24.04 talos and netperf configs.

Andy

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 23

•

3 months ago

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=m-car&revision=63e79b8f87369add3eec04194bb5539fdfc325e2&selectedTaskRun=NkCsBCw7Sz2c2q4NTi0cKw.0

many CaR tasks work now thanks to the new configs

pageload is odd, screen goes black sometimes... (happens on chrome as well)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 24

•

3 months ago

I can repro the black screens on CaR on the SSH machine

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 25

•

3 months ago

actually the black screen recordings happens even on existing chromium and chrome tests in CI so the failure is something else.
We could consider disabling both on linux for the time being since they don't even provide value right now on 18.04.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 26

•

3 months ago

resummarize state of things

all good except

raptor:
- chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)
- godot benchmark on Chrome
talos:
- damp-inspector
- damp-webconsole
- chromez
perftest:
- ml-sumperf
- ml-perf-autofill

unable to repro talos, going to try ML

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 27

•

3 months ago

•

Edited

after discussing with :sparky and :aerickson, we'll partially roll out 24.04 to some machines (how many to convert is TBD)

for this, I will:

make a list of tests that seem to be good on 24.04
make a perfcompare comparing the 18.04/24.04
- anything that looks weird (high variance/bimodal/etc) should probably stay on 18
- everything else (assuming new baselines values make sense and are stable) should be on 24.04

this way any tests having issues running on 24.04, just continue running on 18 and get investigated that separately. Everything else can run on 24.04

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 28

•

3 months ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #26)

resummarize state of things

all good except

raptor:

chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)

godot benchmark on Chrome

talos:

damp-inspector

damp-webconsole

chromez

perftest:

ml-sumperf

ml-perf-autofill

actually looks like i already summarized it here. I believe this is still correct. so these tests are having issues on 24.04 and should be investigate later, and in the meantime remain running on a subset of 18.04 machines

so that means, I just will make a perfcompare of all other remaining tests excluding these ones

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

3 months ago

could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 30

•

3 months ago

compare link (in progress though, and some tests never got picked up so i'll have to retrigger)
https://perf.compare/compare-results?baseRev=20b8bdb89e450aee867f2f5219653ac63e26ba61&baseRepo=try&newRev=463119152cc7fdf34707fe7cb7a119b138ccf90d&newRepo=try&framework=1&filter_confidence=high

(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #29)

could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.

hi Joel could you clarify what you mean by "or losing coverage"? i.e. if the remaining "bad" tests are not fixed in the 60 day window, do you mean we just completely disable them (and so that coverage is lost?)

regarding the transition period. For the tests that do work, do we still want to do the (typical?) 14 days of having "good" tests run on both 18.04 and 24.04 before completely swapping it over to 24.04? (and after transitioning those tests, the "bad" tests remain on 18.04 for the remaining previously mentioned window?)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 31

•

2 months ago

•

Edited

alright looks like most tasks have enough data... so re-reviewing the the perfcompare:

awsy: has some improvements but nothing that seems alarming

browsertime: has regressions only in buzzfeed on firefox, graphs look okay (in that they were bimodal before and kind of look the same ish still)

mozperftest: improvement in a cloudflare test, nothing too weird when looking at the distribution graph

talos: nothing here

js-bench: i realize i never looked into this... submitted a task for that now. This isn't our harness but might be good to have a Try on hand

i did notice some retriggers timed out, so just going to wait on those to finish but so far things are looking good!

Based on what Joel said, probably makes sense to do 50/50 or 60/40 split at first for 2 weeks? and then 80/20 after that for the ~60 day period (let me think about this more...)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 32

•

2 months ago

gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 33

•

2 months ago

for jsshell bench, unclear how it's environment is set up... it's unable to find the js fetch for both base and new?
FileNotFoundError: [Errno 2] No such file or directory: '/home/cltbld/tasks/task_176176062647205/fetches/js'

but i see it passing in tree...

well, jsshell was kind of missed in the original consideration... so in the interest of time, we'll keep that in the 18.04 pool and add it among the tests to investigate separately.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 34

•

2 months ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #32)

gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...

for chrome and CaR, all green and some improvements to make note of
https://perf.compare/compare-results?baseRev=6289fe5b1d6d47a79e1f517f0a345e7ac0a81b53&baseRepo=try&newRev=ec27885aa1f5031245a683982e915c7b2529f4c5&newRepo=try&framework=13&filter_confidence=none%2Chigh

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 35

•

2 months ago

was going to skip the jsshell thing but turns out I just needed to pass --no-artifact
So that is running in the meantime https://perf.compare/compare-results?baseRev=58e326781b63224d53c932166131fded33c207a5&baseRepo=try&newRev=1c0285d32fbdc670ef28d22afd1937c0efe3b684&newRepo=try&framework=11

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 36

•

2 months ago

•

Edited

Okay, just re consolidating a proposed transition plan into one comment...

part 1 (first ~2 weeks): 50/50 split

Roughly half the machines stay on 18.04, half move to 24.04. TBD, how this might affect queue times...?
And for this, write a (temporary) patch with taskcluster transforms to target machines for a given test list

note: the small subset of netperf machines may suffer since there were only a few to begin with? and now that will be split into an even smaller pool temporarily...

part 2 (next 2-4 weeks): 80/20 split (24.04/18.04)

Start shifting most machines to 24.04, but still keep a small portion on 18.04 to either see if more regressions need to be looked at and/or to run the problematic tests on the smaller pool. At this point, ideally all "good" tests only run on 24.04, and only problematic tests stay on 18.04

Phase 3 (remaining 30-60 days): 95/5 or 90/10 split

Keep a few 18.04 machines around only for known-problem tests or a small daily sample. At the 60-day mark, disable 18.04 completely and file follow-ups for anything still broken and accept temporary data loss.

It is possible we just consolidate phase 2 and phase 3 into one thing as well, since most tests are greened up. The only unknown is how problematic the new baseline values/regressions might be.

Joel Maher ( :jmaher ) (UTC -8)

Comment 37

•

2 months ago

given this plan, when would part 1 start?

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 38

•

2 months ago

(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #37)

given this plan, when would part 1 start?

In the process as of very recently. :aerickson is starting to get some more machines up and I'll begin writing a patch to target those machines

Andrew Erickson [:aerickson]

Comment 39

•

2 months ago

We were able to take over the Windows moonshot hardware, so we have plenty of hosts. We won't need to reduce the 18.04 pool.

I have ~50 hosts online (https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-2404) and will shoot for 90 total for the pool (by Dec 5, possibly next week).

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

2 months ago

Depends on: 2002063

Joel Maher ( :jmaher ) (UTC -8)

Comment 40

•

2 months ago

has the non hardware tests been looked at? awsy? talos-xperf?

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 41

•

2 months ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #40)

has the non hardware tests been looked at? awsy? talos-xperf?

what qualifies as talos-xperf? according to this, xperf only runs on windows? https://firefox-source-docs.mozilla.org/testing/perfdocs/talos.html#xperf

I have earlier Try with various talos & awsy but the artifacts are expired by now. Despite the 18.04 name for the job, it was pushed to a 24.04 pool with worker override.

I will at somepoint soon update the patch in Bug 2002063 with newer Try runs

I've actually just re pushed awsy and talos so let me see if they still pass.

Joel Maher ( :jmaher ) (UTC -8)

Comment 42

•

2 months ago

the awsy here are on ubuntu 18.04:
https://treeherder.mozilla.org/jobs?repo=try&searchStr=awsy&revision=75847ce4655e9bea7cc63c15a9e22efd61aba827

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 43

•

2 months ago

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #42)

the awsy here are on ubuntu 18.04:
https://treeherder.mozilla.org/jobs?repo=try&searchStr=awsy&revision=75847ce4655e9bea7cc63c15a9e22efd61aba827

I am confused. There is both 18 and 24 there, and looking at the logs seem to suggest 24.04 as well?

Joel Maher ( :jmaher ) (UTC -8)

Comment 44

•

2 months ago

thanks for correcting me; looks good.

Andrew Erickson [:aerickson]

Comment 45

•

1 month ago

We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.

I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 46

•

1 month ago

(In reply to Andrew Erickson [:aerickson] from comment #45)

We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.

I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).

That should be plenty for now, thank you :aerickson!

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

26 days ago

Keywords: meta

Summary: Resolve performance test issues on Ubuntu 24.04 CI tests → [meta] Resolve performance test issues on Ubuntu 24.04 CI tests

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

26 days ago

Depends on: 2008057

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

26 days ago

Depends on: 2008058

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

26 days ago

Depends on: 2008059

23.mp4 5 months ago Kash Shampur [:kshampur] ⌚EST 221.92 KB, video/mp4		Details
9-original.mp4 5 months ago Kash Shampur [:kshampur] ⌚EST 116.86 KB, video/mp4		Details
no-more-banner-3-original.mp4 4 months ago Kash Shampur [:kshampur] ⌚EST 114.03 KB, video/mp4		Details