Open Bug 1983694 Opened 5 months ago Updated 26 days ago

[meta] Resolve performance test issues on Ubuntu 24.04 CI tests

Categories

(Testing :: Performance, defect, P1)

defect

Tracking

(Not tracked)

People

(Reporter: kshampur, Assigned: kshampur)

References

(Depends on 5 open bugs, Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [fxp])

Attachments

(3 files)

on the linux 24.04 machines, there are a few issues. example push
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9cf93b1fc0033b21075d62ac4ec52dd507c304e0

some talos perma fails
and for browsertime, there are some silent failures (there is a pop up banner that appears in some videos)

just to name a couple things to start with...

going to add information to this bug as I go a long

Attached video 9-original.mp4

I added a pref to this try run https://treeherder.mozilla.org/jobs?repo=try&revision=947a3d196c87c4229706c96e6d0b0c5bb995c1cc

e.g. user_pref("security.sandbox.warn_unprivileged_namespaces", false); in the raptor user.js and the banner seems to be gone

Summary: Resolve performance tests issues on Ubuntu 2404 CI tests → Resolve performance test issues on Ubuntu 24.04 CI tests

additional update. for the popup banner, following the recommendations here https://support.mozilla.org/en-US/kb/linux-security-warning?as=u&utm_source=inproduct

it seems adding this config removes the pop up (only tested on a ubuntu VM on my mac) so this means we dont have to set the pref

:aerickson is looking into making it more dynamic in CI (i had hard coded the path in the apparmor config to the mozilla-central build)

locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)

locally, damp-inpsector test seemed to pass (perma fails on CI) so maybe this was the cause of failure?
I forgot to check if it failed prior to adding that apparmor config so I am going to spin up another VM to test that (for some reason it seems the apparmor config persists indefinitely...)

so it damp-inspector did pass on the VM. So at the moment unable to reproduce that talos perma fail

Depends on: 1983988

:aerickson is going to implement that apparmor config thing on the machine and we will re-verify the browsertime videos after

for some reason i am unable to reproduce the talos failures on my VM, perhaps due to being an aarch64? in the mean time i am getting set up with SSH access to one of the CI machines

As pointed out by Kershaw, we also need to factor in the t-linux-netperf-1804 machines. It seems they were configured differently to create proper networking conditions. So we'll have to also consider a 24.04 equivalent of the netperf machines.

i've ssh'd into one of the machines, and weirdly a talos test (damp-inspector) that perma fails when run on CI is passing when I run it there. very weird. will continue investigating

unable to replicate the damp-inspector failure, however seems I am able to sometimes replicate the damp-webconsole failure so will proceed to look into that one with a solution that hopefully fixes all talos failures... (at least damp ones)

okay interesting, the webconsole test passed after i applied the apparmor config. hard to say if this is the solution to fix other tests, but regardless we should deploy the apparmor config as it also fixes raptor

something like this would work

# This profile allows everything and only exists to give the
# application a name instead of having the label "unconfined"
abi <abi/4.0>,
include <tunables/global>
profile firefox-local
/home/cltbld/tasks/task_*/build/application/firefox/firefox
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/firefox>
}

since the task_<id> changes each time in CI

oh and great we can ignore the tabswitch failures as well, we don't even run it and infact intend to remove it from selection in Bug 1980619

chrome talos test also seems to pass now on the ssh'd machine. This is promising. Now we just need to try all of it out on an actual CI run

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #10)

something like this would work

# This profile allows everything and only exists to give the
# application a name instead of having the label "unconfined"
abi <abi/4.0>,
include <tunables/global>
profile firefox-local
/home/cltbld/tasks/task_*/build/application/firefox/firefox
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/firefox>
}

since the task_<id> changes each time in CI

This file is now being delivered to hosts in our small test pool (4 hosts) that are running the PR branch (https://github.com/mozilla-platform-ops/ronin_puppet/pull/908/).

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

Hm, I guess we're making progress. :)

How/why is the test failing? It's running too slowly?

(In reply to Andrew Erickson [:aerickson] from comment #16)

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #14)

unfortunately it didn't fix the talos tests, which is bizarre since it worked when. I SSH'd in...

but looking at the video artifacts of raptor-browsertime tests, you no longer see the popup banner as expected https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=4446ecb89b20f3bda913fe1b65371a6b12623662

Hm, I guess we're making progress. :)

How/why is the test failing? It's running too slowly?

it gets to a point and just hangs consistently at these parts of the test:

in the damp-inspector test it hangs here https://treeherder.mozilla.org/logviewer?job_id=528329071&repo=try&task=PgX05nvnSAyYorZRlfkXkw.0&lineNumber=3894

and in damp-webconsole it hangs here https://treeherder.mozilla.org/logviewer?job_id=528350855&repo=try&task=CpeKEin6Sl6IYddrCkG51Q.0&lineNumber=2070

as mentioned, it was odd to me that it was passing while I was in SSH'd into ms-239. Since the app armor update I haven't re-tried again, should have time tomorrow to re-investigate on the machine to reproduce the fail

quick update by harness/framework, mostly about what is failing

Perftest:

  • Perftest-linux-ml-perf-autofill
  • perftest-linux-ml-summarizer-perf

Unclear on these following ones, however looking at similar jobs tab they either dont run on cron and have been perma failing for a while, so we can ignore probably

  • Perftest-linux-controlled
  • Perftest-linux-http3
  • Perftest-linux-perfstats
  • Perftest-linux-try-xpcshell
  • Perftest-linux-webpagetest-chrome

Raptor:

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=0acd08a6370e3edb0210ce1c11fbc5be8bef09f5&searchStr=tp
tp6:

  • a bunch of chrome and all CaR tasks are failing
  • otherwise it seems firefox tp6 tests are good
  • (ignore tp7 tests)

Benchmark:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=0acd08a6370e3edb0210ce1c11fbc5be8bef09f5&searchStr=benchmark

  • Godot
  • js2
    (^ but only on chrome)

Indexddb
seems to be all good:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=indexeddb&revision=5640ee2919a26a5c1009d9300267738b1ed11a5e

Upload (netperf machines)
all good except for CaR

Custom tests:
Throttled is perma failing but I checked tree and it doesnt run on cron + has been perma failing for a while

AWSY

all good https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=awsy&revision=eec95859eba01fc1a7e31253b8a4ed997b2d4835

Talos

as previously mentioned the following fail

  • damp-inspector
  • damp-webconsole
  • chromez

so in summary...

  • CaR: all failing (possibly chromedriver issue)
  • Chrome: multiple failures across Raptor pageload + benchmarks
  • Talos: 3 failures (2 damp + chromez)
  • Perftest: 2 failing ML tests

CaR may be failing due to this as well https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces

so may need an apparmor config for chromium-as-release

this method seems to work

https://chromium.googlesource.com/chromium/src/+/main/docs/security/apparmor-userns-restrictions.md#option-3_the-safest-way

but that assumes there is an (old?) version of chrome already installed. This may not be the case on netperf machines

app armor config that should work for CaR

/etc/apparmor.d/chrome-local

abi <abi/4.0>,
include <tunables/global>
profile chrome-local
/home/cltbld/tasks/task_*/fetches/chromium/Default/chrome
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}

naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running sudo systemctl restart apparmor.service

this method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #21)

app armor config that should work for CaR

/etc/apparmor.d/chrome-local

abi <abi/4.0>,
include <tunables/global>
profile chrome-local
/home/cltbld/tasks/task_*/fetches/chromium/Default/chrome
flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}

naming shouldn't matter i think... so i chose chrome-local to be consistent with firefox-local. seemed to work on machine 239 after creating this file and running sudo systemctl restart apparmor.service

this method probably makes more sense than the environment variable approach for consistency with Fx and also due to the fact that the netperf machines wouldn't/shouldn't have chrome (regular) installed

Added to the 24.04 talos and netperf configs.

Andy

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&searchStr=m-car&revision=63e79b8f87369add3eec04194bb5539fdfc325e2&selectedTaskRun=NkCsBCw7Sz2c2q4NTi0cKw.0

many CaR tasks work now thanks to the new configs

pageload is odd, screen goes black sometimes... (happens on chrome as well)

I can repro the black screens on CaR on the SSH machine

actually the black screen recordings happens even on existing chromium and chrome tests in CI so the failure is something else.
We could consider disabling both on linux for the time being since they don't even provide value right now on 18.04.

resummarize state of things

all good except

  • raptor:
    • chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)
    • godot benchmark on Chrome
  • talos:
    • damp-inspector
    • damp-webconsole
    • chromez
  • perftest:
    • ml-sumperf
    • ml-perf-autofill

unable to repro talos, going to try ML

after discussing with :sparky and :aerickson, we'll partially roll out 24.04 to some machines (how many to convert is TBD)

for this, I will:

  • make a list of tests that seem to be good on 24.04
  • make a perfcompare comparing the 18.04/24.04
    • anything that looks weird (high variance/bimodal/etc) should probably stay on 18
    • everything else (assuming new baselines values make sense and are stable) should be on 24.04

this way any tests having issues running on 24.04, just continue running on 18 and get investigated that separately. Everything else can run on 24.04

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #26)

resummarize state of things

all good except

  • raptor:
    • chrome+ CaR pageload (which we should probably disable since almost all the recordings are just black screens)
    • godot benchmark on Chrome
  • talos:
    • damp-inspector
    • damp-webconsole
    • chromez
  • perftest:
    • ml-sumperf
    • ml-perf-autofill

actually looks like i already summarized it here. I believe this is still correct. so these tests are having issues on 24.04 and should be investigate later, and in the meantime remain running on a subset of 18.04 machines

so that means, I just will make a perfcompare of all other remaining tests excluding these ones

could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.

compare link (in progress though, and some tests never got picked up so i'll have to retrigger)
https://perf.compare/compare-results?baseRev=20b8bdb89e450aee867f2f5219653ac63e26ba61&baseRepo=try&newRev=463119152cc7fdf34707fe7cb7a119b138ccf90d&newRepo=try&framework=1&filter_confidence=high

(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #29)

could you put a 60 day window (or something reasonable) on fixing the remaining tests, or losing coverage. This would allow for reducing the time we have to split the pool between 18.04 and 24.04. I know we need to let the transition ride the trains.

hi Joel could you clarify what you mean by "or losing coverage"? i.e. if the remaining "bad" tests are not fixed in the 60 day window, do you mean we just completely disable them (and so that coverage is lost?)

regarding the transition period. For the tests that do work, do we still want to do the (typical?) 14 days of having "good" tests run on both 18.04 and 24.04 before completely swapping it over to 24.04? (and after transitioning those tests, the "bad" tests remain on 18.04 for the remaining previously mentioned window?)

alright looks like most tasks have enough data... so re-reviewing the the perfcompare:

awsy: has some improvements but nothing that seems alarming

browsertime: has regressions only in buzzfeed on firefox, graphs look okay (in that they were bimodal before and kind of look the same ish still)

mozperftest: improvement in a cloudflare test, nothing too weird when looking at the distribution graph

talos: nothing here

js-bench: i realize i never looked into this... submitted a task for that now. This isn't our harness but might be good to have a Try on hand


i did notice some retriggers timed out, so just going to wait on those to finish but so far things are looking good!

Based on what Joel said, probably makes sense to do 50/50 or 60/40 split at first for 2 weeks? and then 80/20 after that for the ~60 day period (let me think about this more...)

gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...

for jsshell bench, unclear how it's environment is set up... it's unable to find the js fetch for both base and new?
FileNotFoundError: [Errno 2] No such file or directory: '/home/cltbld/tasks/task_176176062647205/fetches/js'

but i see it passing in tree...

well, jsshell was kind of missed in the original consideration... so in the interest of time, we'll keep that in the 18.04 pool and add it among the tests to investigate separately.

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #32)

gah, the chromedriver fetches got stale? the custom-car retriggers arent working, so I'll push another one just for chrom* apps...

for chrome and CaR, all green and some improvements to make note of
https://perf.compare/compare-results?baseRev=6289fe5b1d6d47a79e1f517f0a345e7ac0a81b53&baseRepo=try&newRev=ec27885aa1f5031245a683982e915c7b2529f4c5&newRepo=try&framework=13&filter_confidence=none%2Chigh

was going to skip the jsshell thing but turns out I just needed to pass --no-artifact
So that is running in the meantime https://perf.compare/compare-results?baseRev=58e326781b63224d53c932166131fded33c207a5&baseRepo=try&newRev=1c0285d32fbdc670ef28d22afd1937c0efe3b684&newRepo=try&framework=11

Okay, just re consolidating a proposed transition plan into one comment...

part 1 (first ~2 weeks): 50/50 split

Roughly half the machines stay on 18.04, half move to 24.04. TBD, how this might affect queue times...?
And for this, write a (temporary) patch with taskcluster transforms to target machines for a given test list

note: the small subset of netperf machines may suffer since there were only a few to begin with? and now that will be split into an even smaller pool temporarily...

part 2 (next 2-4 weeks): 80/20 split (24.04/18.04)

Start shifting most machines to 24.04, but still keep a small portion on 18.04 to either see if more regressions need to be looked at and/or to run the problematic tests on the smaller pool. At this point, ideally all "good" tests only run on 24.04, and only problematic tests stay on 18.04

Phase 3 (remaining 30-60 days): 95/5 or 90/10 split

Keep a few 18.04 machines around only for known-problem tests or a small daily sample. At the 60-day mark, disable 18.04 completely and file follow-ups for anything still broken and accept temporary data loss.


It is possible we just consolidate phase 2 and phase 3 into one thing as well, since most tests are greened up. The only unknown is how problematic the new baseline values/regressions might be.

given this plan, when would part 1 start?

(In reply to Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) from comment #37)

given this plan, when would part 1 start?

In the process as of very recently. :aerickson is starting to get some more machines up and I'll begin writing a patch to target those machines

We were able to take over the Windows moonshot hardware, so we have plenty of hosts. We won't need to reduce the 18.04 pool.

I have ~50 hosts online (https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-2404) and will shoot for 90 total for the pool (by Dec 5, possibly next week).

Depends on: 2002063

has the non hardware tests been looked at? awsy? talos-xperf?

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #40)

has the non hardware tests been looked at? awsy? talos-xperf?

what qualifies as talos-xperf? according to this, xperf only runs on windows? https://firefox-source-docs.mozilla.org/testing/perfdocs/talos.html#xperf

I have earlier Try with various talos & awsy but the artifacts are expired by now. Despite the 18.04 name for the job, it was pushed to a 24.04 pool with worker override.

I will at somepoint soon update the patch in Bug 2002063 with newer Try runs

I've actually just re pushed awsy and talos so let me see if they still pass.

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #42)

the awsy here are on ubuntu 18.04:
https://treeherder.mozilla.org/jobs?repo=try&searchStr=awsy&revision=75847ce4655e9bea7cc63c15a9e22efd61aba827

I am confused. There is both 18 and 24 there, and looking at the logs seem to suggest 24.04 as well?

thanks for correcting me; looks good.

We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.

I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).

(In reply to Andrew Erickson [:aerickson] from comment #45)

We're at ~75 hosts in the pool. Still planning on delivering 90 hosts.

I don't think we'll be able to get more done before January (due to people being out on PTO), unless we really need them (please let me know).

That should be plenty for now, thank you :aerickson!

Keywords: meta
Summary: Resolve performance test issues on Ubuntu 24.04 CI tests → [meta] Resolve performance test issues on Ubuntu 24.04 CI tests
Depends on: 2008057
Depends on: 2008058
Depends on: 2008059
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: