Open Bug 1814051 Opened 2 years ago Updated 10 months ago

aarch64 workers of the NSS continuous integration are failing (since August 2021)

Categories

(Infrastructure & Operations :: RelOps: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: beurdouche, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

The aarch64 workers of the NSS continuous integration have stopped running all together in august 2021

https://treeherder.mozilla.org/jobs?repo=nss-try&selectedTaskRun=furyowVbTMCxpj10wP02qg.0

Hey Ben, do you know why these tasks were set up to run with a static worker-pool initially? It looks like we run our Firefox Linux aarch64 in GCP, do you think transitioning these tasks over there would be feasible?

Generally using a cloud based pool will be a lot less painful than a statically managed one. But maybe there's a reason it was set up this way?

Flags: needinfo?(bbeurdouche)

Is there a list of workers that are supposed to be reachable that are not?
Do we know if these are hardware?

(In reply to Andrew Halberstadt [:ahal] from comment #1)

Hey Ben, do you know why these tasks were set up to run with a static worker-pool initially? It looks like we run our Firefox Linux aarch64 in GCP, do you think transitioning these tasks over there would be feasible?

Generally using a cloud based pool will be a lot less painful than a statically managed one. But maybe there's a reason it was set up this way?

Hi Andrew, as some other workers are working in GCP already, that would definitely be good to have aarch64 workers set up the same way.
I don't think those were static as they were based on docker images https://hg.mozilla.org/projects/nss/file/tip/automation/taskcluster/docker-aarch64.

Flags: needinfo?(bbeurdouche)

(In reply to Michelle Goossens [:masterwayz] from comment #2)

Is there a list of workers that are supposed to be reachable that are not?
Do we know if these are hardware?

Unfortunately I don't have much more information on this, but I was pretty sure this was deployed on AWS hardware instances until I found this https://hg.mozilla.org/projects/nss/file/tip/automation/taskcluster/graph/src/extend.js#l287 which seems to indicate these were run on dedicated machines like our MacOS instances. I might be wrong though.

I have spotted one EC2 instance in an AWS account that is labelled nss-static-aarch64, so I assume that is the instance. It seems to be a regular EC2 instance.

For posterity, I filed https://mozilla-hub.atlassian.net/browse/RELENG-1023.

Initially I was thinking we could use the same pool of workers that the Gecko Linux aarch64 tasks are using.. But turns out that pool is actually x86_64 and the builds are cross-compiled.

This means we'd either need to block on https://mozilla-hub.atlassian.net/browse/RELOPS-265, or get the NSS builds to also cross-compile.

Blocks: nss-ci
Severity: -- → S4
Priority: -- → P2
Depends on: linux-arm64-ci

Actually, that last RELOPS ticket seems solved now... :)

Flags: needinfo?(ahal)

Looks like RELOPS-265 was repurposed, and https://mozilla-hub.atlassian.net/browse/RELOPS-686 is required here instead.

Flags: needinfo?(ahal)

Cross posting from https://bugzilla.mozilla.org/show_bug.cgi?id=1677963.

Headless multiuser g-w images are not currently possible, but here are arm64 images that should work.

monopacker-ubuntu-2204-wayland-arm64:
  fxci-level1-gcp: projects/taskcluster-imaging/global/images/gw-fxci-gcp-l1-arm64-gui-googlecompute-2024-02-14t21-16-11z
  fxci-level3-gcp: projects/fxci-production-level3-workers/global/images/gw-fxci-gcp-l3-arm64-gui-googlecompute-2024-02-15t21-29-13z
You need to log in before you can comment on or make changes to this bug.