Create GCP pools for testing with new testing compatible image 20220603
Categories
(Release Engineering :: Firefox-CI Administration, task)
Tracking
(Not tracked)
People
(Reporter: masterwayz, Assigned: masterwayz)
References
Details
Attachments
(2 files)
| Assignee | ||
Updated•3 years ago
|
| Assignee | ||
Comment 1•3 years ago
|
||
Hi,
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1752375#c21, now that we appear to have an image that works with testing, I'd like to know how we should set up the testing pools.
I see that for Linux on AWS we have gecko-t/t-linux* pools, should we do the same for GCP, and if yes, where should these go? (Also I see one image but on the AWS pools I see a default and trusted image).
I was confused with the CoT update request. So :ahal is right we don't need new CoT on the tester image. And I think the same image can be used by all of the pools that need docker-worker.
I don't know why there is a default and trusted image in aws.
:jmaher do you know why there is a default and trusted image for docker-worker in aws and is that needed in gcp?
Comment 3•3 years ago
|
||
I have no idea. I would assume either different worker levels or for different projects. Is there a history we could dig into to see why there are two types?
Maybe there is something in taskcluster bugs or github. I'll search around a bit.
Comment 5•3 years ago
|
||
(In reply to :dhouse from comment #2)
I was confused with the CoT update request. So :ahal is right we don't need new CoT on the tester image. And I think the same image can be used by all of the pools that need docker-worker.
I don't know why there is a default and trusted image in aws.:jmaher do you know why there is a default and trusted image for docker-worker in aws and is that needed in gcp?
We definitely use the trusted docker-worker image for builds in both AWS and GCP. Both of those need a new 2022 CoT key; the AWS trusted build docker-worker AMI has the 2021 CoT key, and the GCP one has the 2020 CoT key currently.
Comment 6•3 years ago
|
||
It looks like these tasks are using the gecko-3/t-linux-xlarge trusted pool:
https://firefox-ci-tc.services.mozilla.com/tasks/HTO0lz1-R2-rhAEwQaK0Hg
As Aki found, Chain of Trust might not be strictly necessary for them though as their artifacts don't get used by release tasks (only builds) and so there's likely nothing actually verifying them.
But that said, why are those using the test pool rather than the build pool in the first place? Let's just roll with the untrusted pool for now. Then we can either migrate these tasks to the build pool, or stand up a trusted test pool later when there's nothing more important left to do.
(In reply to Andrew Halberstadt [:ahal] from comment #6)
It looks like these tasks are using the
gecko-3/t-linux-xlargetrusted pool:
https://firefox-ci-tc.services.mozilla.com/tasks/HTO0lz1-R2-rhAEwQaK0Hg
I'm not sure where those came from. We had set up the gcp builds under gecko-{1,3}/b-linux-gcp (https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-pools.yml#l313). They are in the gcp project "fxci-production-level3-workers", and I have expected all build (and test or anything) level3 workers would be in that project.
:masterwayz I think adding the new gcp test pools could look like the aws one (https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-pools.yml#l906) with a suffix "-gcp" like is on the build pools (and a gcp config like our testing json instead of the aws config ;).
Do you want to submit the ci-config change for it? Or shall I make a draft of it?
| Assignee | ||
Comment 10•3 years ago
|
||
Comment 11•3 years ago
|
||
Comment 12•3 years ago
|
||
Comment 13•3 years ago
|
||
| Assignee | ||
Comment 14•3 years ago
|
||
This is done.
Comment 15•3 years ago
|
||
@masterwayz, do you know if this has the updated worker cert?
See https://bugzilla.mozilla.org/show_bug.cgi?id=1764751#c7
Thanks!
| Assignee | ||
Comment 16•3 years ago
|
||
Does this have the updated worker cert? See above
Comment 17•3 years ago
|
||
:pete, thanks for following up on this! I didn't update the cert yet on these. Can you get the new cert to me? Or I'll wait and get it from ahal/aki when they're back from pto.
Comment 18•3 years ago
|
||
Hey Dave, no worries, you're welcome! :-)
The certs can be found in attachment 9276708 [details] and attachment 9276709 [details]. The first is the *.taskcluster-worker.net cert, the second is the digicert intermediate cert. Let me know if you need anything else.
Pete
Updated•3 years ago
|
Comment 19•3 years ago
|
||
I updated the gcp docker-worker tester image with the new livelog key (replaced the existing file. confirmed location in the config) matching https://github.com/mozilla/community-tc-config/blob/d2b36727db19bba379953d54042a3569a5898368/imagesets/docker-worker/bootstrap.sh :
"projects/taskcluster-imaging/global/images/docker-worker-gcp-u1404-2022-08-09"
I ran some tests through instances with it and tested loading the livelog and it failed (https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Flinux-xlarge-gcp)
I verified we have firewall rules to allow ingress on tcp ports 32768-65535 cidr 0.0.0.0/0 (matching aws community) to the fxci-production-level1 and fxci-production-test gcp projects. I tested with fully opening all tcp/udp ports, and loading a livelog still failed.
I think I'll next try running an instance and checking the port that livelog is listening on, and seeing if I can get a direct connection to it.
Comment 20•3 years ago
|
||
I checked the port listening for livelog on a running instance and it was 32769 and it responded but I didn't have the access_key, and I didn't find where it was set yet
:masterwayz could you test this image to see if livelog works now?
"projects/taskcluster-imaging/global/images/docker-worker-gcp-u1404-2022-08-09"
Comment 21•3 years ago
|
||
I found the active livelog access_token (it changes per task/instance) and a direct load with the public ip of a gcp tester does work (tested a prod one, with the old livelog key, and a -staging one with the new livelog key).
prod:
https://35.222.122.118:32769/log/actualkeyherewiththisendIOeQ
new image tests:
https://104.198.100.49:32768/log/actualkeyherewiththisendKgmg
https://34.105.90.246:32768/log/actualkeyherewiththisendAraw
I think it picks the port, but the 32768 vs 32769 is different on these I checked.
Comment 22•3 years ago
|
||
I tested on a (new-ssl-key staging) level1 builder, and it had livelog running on port 32769
I think the port is okay in the full range 32768-65535
| Assignee | ||
Comment 23•3 years ago
|
||
How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?
Comment 24•3 years ago
|
||
I'm reopening as we have several reports that certificates have expired. I'm not sure if the problem is with this bug or bug 1764751.
@dhouse, see comment 18, I think this still requires action.
Comment 25•3 years ago
|
||
Hey @dhouse, just checking in to see if this can be corrected? What is the target date of having the new certificates?
Comment 26•3 years ago
|
||
(In reply to Michelle Goossens [:masterwayz] from comment #23)
How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?
Yeah, that's the simplest.
:dhouse has some tools that will clone jobs from one queue to another at https://github.com/mozilla-platform-ops/relops_infra_as_code/tree/master/docker/stager/stager, but I'm not 100% sure how to use them.
(In reply to Rosanna from comment #25)
Hey @dhouse, just checking in to see if this can be corrected? What is the target date of having the new certificates?
I think :dhouse has created a new image with the updated certificates, but was waiting on testing.
:dhouse is out on PTO until Monday.
Comment 27•3 years ago
|
||
r+ livelog on gcp has not worked before. but we'll see if we get it working with the new certificate. I had confirmed the ports range is allowed from all, and that a direct ip:port load of live logs works (accepting that the cert is not for the ip but for the firefoxci hostname).
Comment 28•3 years ago
|
||
(In reply to Andrew Erickson [:aerickson] from comment #26)
(In reply to Michelle Goossens [:masterwayz] from comment #23)
How can we best test it? Make a temp. pool and then just run a couple of jobs on it to see if it keeps working?
Yeah, that's the simplest.
:dhouse has some tools that will clone jobs from one queue to another [...]
The only problem with the "stager" I use for testing is that you have to manually track task id's because there are no commits and so they will not appear in tree/perf-herder.
So using mach's --worker-suffix for a test pool to get treeherder comparable results is better.
| Assignee | ||
Comment 29•3 years ago
|
||
| bugherder | ||
Description
•