Closed Bug 1772714 Opened 3 years ago Closed 3 years ago

Create GCP pools for testing with new testing compatible image 20220603

Categories

(Release Engineering :: Firefox-CI Administration, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: masterwayz, Assigned: masterwayz)

References

Details

Attachments

(2 files)

No description provided.
Summary: Update GCP pools to new testing compatible image 20220603 → Create GCP pools for testing with new testing compatible image 20220603

Hi,

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1752375#c21, now that we appear to have an image that works with testing, I'd like to know how we should set up the testing pools.
I see that for Linux on AWS we have gecko-t/t-linux* pools, should we do the same for GCP, and if yes, where should these go? (Also I see one image but on the AWS pools I see a default and trusted image).

Flags: needinfo?(dhouse)
Flags: needinfo?(ahal)

I was confused with the CoT update request. So :ahal is right we don't need new CoT on the tester image. And I think the same image can be used by all of the pools that need docker-worker.
I don't know why there is a default and trusted image in aws.

:jmaher do you know why there is a default and trusted image for docker-worker in aws and is that needed in gcp?

Flags: needinfo?(dhouse) → needinfo?(jmaher)

I have no idea. I would assume either different worker levels or for different projects. Is there a history we could dig into to see why there are two types?

Flags: needinfo?(jmaher)

Maybe there is something in taskcluster bugs or github. I'll search around a bit.

(In reply to :dhouse from comment #2)

I was confused with the CoT update request. So :ahal is right we don't need new CoT on the tester image. And I think the same image can be used by all of the pools that need docker-worker.
I don't know why there is a default and trusted image in aws.

:jmaher do you know why there is a default and trusted image for docker-worker in aws and is that needed in gcp?

We definitely use the trusted docker-worker image for builds in both AWS and GCP. Both of those need a new 2022 CoT key; the AWS trusted build docker-worker AMI has the 2021 CoT key, and the GCP one has the 2020 CoT key currently.

It looks like these tasks are using the gecko-3/t-linux-xlarge trusted pool:
https://firefox-ci-tc.services.mozilla.com/tasks/HTO0lz1-R2-rhAEwQaK0Hg

As Aki found, Chain of Trust might not be strictly necessary for them though as their artifacts don't get used by release tasks (only builds) and so there's likely nothing actually verifying them.

But that said, why are those using the test pool rather than the build pool in the first place? Let's just roll with the untrusted pool for now. Then we can either migrate these tasks to the build pool, or stand up a trusted test pool later when there's nothing more important left to do.

Flags: needinfo?(ahal)

(In reply to Andrew Halberstadt [:ahal] from comment #6)

It looks like these tasks are using the gecko-3/t-linux-xlarge trusted pool:
https://firefox-ci-tc.services.mozilla.com/tasks/HTO0lz1-R2-rhAEwQaK0Hg

I'm not sure where those came from. We had set up the gcp builds under gecko-{1,3}/b-linux-gcp (https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-pools.yml#l313). They are in the gcp project "fxci-production-level3-workers", and I have expected all build (and test or anything) level3 workers would be in that project.

:masterwayz I think adding the new gcp test pools could look like the aws one (https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-pools.yml#l906) with a suffix "-gcp" like is on the build pools (and a gcp config like our testing json instead of the aws config ;).
Do you want to submit the ci-config change for it? Or shall I make a draft of it?

Flags: needinfo?(mgoossens)

I'll make it, thanks for your help!

Flags: needinfo?(mgoossens)
Pushed by mgoossens@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/b91d6f16bc9f Create GCP pools for testing with new testing compatible image 20220603 r=releng-reviewers,gbrown,ahal
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/816b715dcc40 Increase size of GCP testing pools to fit the image, r=MasterWayZ

This is done.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

@masterwayz, do you know if this has the updated worker cert?

See https://bugzilla.mozilla.org/show_bug.cgi?id=1764751#c7

Thanks!

Does this have the updated worker cert? See above

Flags: needinfo?(dhouse)

:pete, thanks for following up on this! I didn't update the cert yet on these. Can you get the new cert to me? Or I'll wait and get it from ahal/aki when they're back from pto.

Flags: needinfo?(dhouse) → needinfo?(pmoore)

Hey Dave, no worries, you're welcome! :-)

The certs can be found in attachment 9276708 [details] and attachment 9276709 [details]. The first is the *.taskcluster-worker.net cert, the second is the digicert intermediate cert. Let me know if you need anything else.

Pete

Flags: needinfo?(pmoore)
See Also: → 1768865
Flags: needinfo?(dhouse)

I updated the gcp docker-worker tester image with the new livelog key (replaced the existing file. confirmed location in the config) matching https://github.com/mozilla/community-tc-config/blob/d2b36727db19bba379953d54042a3569a5898368/imagesets/docker-worker/bootstrap.sh :
"projects/taskcluster-imaging/global/images/docker-worker-gcp-u1404-2022-08-09"

I ran some tests through instances with it and tested loading the livelog and it failed (https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Flinux-xlarge-gcp)

I verified we have firewall rules to allow ingress on tcp ports 32768-65535 cidr 0.0.0.0/0 (matching aws community) to the fxci-production-level1 and fxci-production-test gcp projects. I tested with fully opening all tcp/udp ports, and loading a livelog still failed.

I think I'll next try running an instance and checking the port that livelog is listening on, and seeing if I can get a direct connection to it.

I checked the port listening for livelog on a running instance and it was 32769 and it responded but I didn't have the access_key, and I didn't find where it was set yet

:masterwayz could you test this image to see if livelog works now?
"projects/taskcluster-imaging/global/images/docker-worker-gcp-u1404-2022-08-09"

Flags: needinfo?(dhouse) → needinfo?(mgoossens)

I found the active livelog access_token (it changes per task/instance) and a direct load with the public ip of a gcp tester does work (tested a prod one, with the old livelog key, and a -staging one with the new livelog key).

prod:
https://35.222.122.118:32769/log/actualkeyherewiththisendIOeQ
new image tests:
https://104.198.100.49:32768/log/actualkeyherewiththisendKgmg
https://34.105.90.246:32768/log/actualkeyherewiththisendAraw

I think it picks the port, but the 32768 vs 32769 is different on these I checked.

I tested on a (new-ssl-key staging) level1 builder, and it had livelog running on port 32769
I think the port is okay in the full range 32768-65535

How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?

Flags: needinfo?(mgoossens)

I'm reopening as we have several reports that certificates have expired. I'm not sure if the problem is with this bug or bug 1764751.

@dhouse, see comment 18, I think this still requires action.

Status: RESOLVED → REOPENED
Flags: needinfo?(dhouse)
Resolution: FIXED → ---

Hey @dhouse, just checking in to see if this can be corrected? What is the target date of having the new certificates?

(In reply to Michelle Goossens [:masterwayz] from comment #23)

How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?

Yeah, that's the simplest.

:dhouse has some tools that will clone jobs from one queue to another at https://github.com/mozilla-platform-ops/relops_infra_as_code/tree/master/docker/stager/stager, but I'm not 100% sure how to use them.

(In reply to Rosanna from comment #25)

Hey @dhouse, just checking in to see if this can be corrected? What is the target date of having the new certificates?

I think :dhouse has created a new image with the updated certificates, but was waiting on testing.

:dhouse is out on PTO until Monday.

r+ livelog on gcp has not worked before. but we'll see if we get it working with the new certificate. I had confirmed the ports range is allowed from all, and that a direct ip:port load of live logs works (accepting that the cert is not for the ip but for the firefoxci hostname).

Flags: needinfo?(dhouse)

(In reply to Andrew Erickson [:aerickson] from comment #26)

(In reply to Michelle Goossens [:masterwayz] from comment #23)

How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?

Yeah, that's the simplest.

:dhouse has some tools that will clone jobs from one queue to another [...]

The only problem with the "stager" I use for testing is that you have to manually track task id's because there are no commits and so they will not appear in tree/perf-herder.
So using mach's --worker-suffix for a test pool to get treeherder comparable results is better.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: