Setup a separate mac mini pool for PGO builds
Categories
(Infrastructure & Operations :: RelOps: Posix OS, task)
Tracking
(Not tracked)
People
(Reporter: dividehex, Assigned: dividehex)
References
Details
User Story
Work summary: [X] [bug 1562289] Netops to setup a new tier3 network segment (vlan) [X] [bug 1563338] Netops to move ports of dedicated PGO minis to new vlan [X] [bug 1563333] IT storage virtualization to setup new bsdpy virtual hosts, nic attached to new vlan [X] [bug 1563340] Setup tier3 bsdpy service on bsdpy virtual hosts [X] [bug 1563357] Allocate and move 4 additional mac minis for deploystudio to new vlan [X] Setup and configure deploystudio minis plus dedicated netboot image [x] Fix existing taskcluster credentials to not grant access to tier-3 workers [x] Obtain new taskcluster credentials solely for PGO tier3 [x] Acquire and set up cot keys on tier-3 workers. (was: Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110) [x] [bug 1561956] Setup pgo tier3 puppet role and profile with multiuser generic worker [x] Re-image and provision PGO mac mini pool
Attachments
(4 files)
We need to set up a separate dedicated pool of mac minis in mdc1 and mdc2 in order to exec osx PGO builds (tier 3). Unfortunately, these will come from our existing pool of testers which already sees pending count spike on a regular basis.
Let's start by determining how big of a pool we need and how many we can sacrifice from the existing tester pool. Then we'll select a group, add a node def for them and re-image.
Joel, can you help determine how many minis to allocate from the current testers pool?
See: bug 1515415
Assignee | ||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
The estimate from bug 1515415 comment 7 is that we need to run a task that will take 5 minutes on average, an average of 500 times in a day.
Comment 2•6 years ago
|
||
that is 42 hours of compute time, minimum is 2, I would say 3 to be safe (averages out to 17 hours of compute time/day) my target is 20 hours/day to account for reboots, upgrades, timed out tasks, etc.
Comment 3•6 years ago
|
||
Hi Jake, is comment 2 feasible? Do you have an estimate of when we will be able to do this?
Comment 4•6 years ago
|
||
Additionally, make sure that other levels can't access them.
Assignee | ||
Comment 5•6 years ago
|
||
Assignee | ||
Comment 6•6 years ago
|
||
I've reallocated and re-imaged 4 macs into the new gecko-3-t-osx-1010 worker type. Thanks :tomprince for setting up the scopes!
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1010
:cmanchester, :mshal, is there anything else needed here with regards to the workers?
Comment 7•6 years ago
|
||
Thank you, this does seem to be what we need! I'm noticing testing the straightforward patch for this fails on try due to missing scopes. Tom, should we just make a transform swap the new worker type for the regular test worker when running these on try?
Comment 8•6 years ago
|
||
:chmanchester I'm working on a patch to abstract this; there are some other places where we want to vary the worker-type, and it doesn't make sense to encode the exact type in every kind.
Comment 9•6 years ago
•
|
||
Hey Jake,
We now have support for running tasks as unique OS users on macOS in generic-worker.
Any questions, let me know!
Thanks.
Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #9)
Hey Jake,
We now have support for running tasks as unique OS users on macOS in generic-worker.
Any questions, let me know!
Thanks.
Thanks Pete. I'll work on getting the multiuser gw client into puppet and onto the workers.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 12•6 years ago
|
||
Assignee | ||
Comment 13•6 years ago
|
||
Updated user story: Deploystudio setup completed. Re-imaging infra is in place.
Assignee | ||
Comment 14•6 years ago
|
||
Since we are migrating to Mojave, the new workerType will be gecko-3-t-osx-1014
Assignee | ||
Comment 15•6 years ago
|
||
NI :tomprince for:
[ ] Fix existing taskcluster credentials to not grant access to tier-3 workers
[ ] Obtain new taskcluster credentials solely for PGO tier3
Comment 16•6 years ago
|
||
I've created https://tools.taskcluster.net/auth/clients/project%2Freleng%2Freleng-hardware%2Fgecko-3-t-osx-1014 for the workers; you can reset the access token, to get a fresh token to put in puppet.
I've gone through all the roles and clients, and removed the unbounded releng-hardware/*
scope from the existing role (and cleaned up a bunch of other unrelated scopes of the same type that were too broad).
Updated•6 years ago
|
Comment 17•5 years ago
|
||
Is the something more needed with the scopes? ("queue:claim-work:releng-hardware/gecko-3-t-osx-1010" ?)
The t-yosemite-r7-{472,236} machines have repeated log entries like:
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Response Body:
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "code": "InsufficientScopes",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "message": "Client ID project/releng/generic-worker/os-x/production does not have sufficient scopes and is missing the following scopes:\n\n```\nqueue:claim-work:releng-hardware/gecko-3-t-osx-1010\n```\n\nThis request requires the client to satisfy the following scope expression:\n\n```\n{\n \"AllOf\": [\n \"queue:claim-work:releng-hardware/gecko-3-t-osx-1010\",\n \"queue:worker-id:mdc1/t-yosemite-r7-472\"\n ]\n}\n```\n\n---\n\n* method: claimWork\n* errorCode: InsufficientScopes\n* statusCode: 403\n* time: 2019-10-25T18:27:10.950Z",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "requestInfo": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "method": "claimWork",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "params": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "provisionerId": "releng-hardware",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerType": "gecko-3-t-osx-1010"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "payload": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "tasks": 1,
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerGroup": "mdc1",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerId": "t-yosemite-r7-472"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "time": "2019-10-25T18:27:10.950Z"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Attempts: 1
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker (Permanent) HTTP response code 403
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker HTTP/1.1 403 Forbidden#012#011
Comment 18•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #17)
Is the something more needed with the scopes? ("queue:claim-work:releng-hardware/gecko-3-t-osx-1010" ?)
The t-yosemite-r7-{472,236} machines have repeated log entries like:Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Response Body: Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker { Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "code": "InsufficientScopes", Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "message": "Client ID project/releng/generic-worker/os-x/production does not have sufficient scopes and is missing the following scopes:\n\n```\nqueue:claim-work:releng-hardware/gecko-3-t-osx-1010\n```\n\nThis request requires the client to satisfy the following scope expression:\n\n```\n{\n \"AllOf\": [\n \"queue:claim-work:releng-hardware/gecko-3-t-osx-1010\",\n \"queue:worker-id:mdc1/t-yosemite-r7-472\"\n ]\n}\n```\n\n---\n\n* method: claimWork\n* errorCode: InsufficientScopes\n* statusCode: 403\n* time: 2019-10-25T18:27:10.950Z", Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "requestInfo": { Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "method": "claimWork", Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "params": { Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "provisionerId": "releng-hardware", Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerType": "gecko-3-t-osx-1010" Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }, Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "payload": { Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "tasks": 1, Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerGroup": "mdc1", Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "workerId": "t-yosemite-r7-472" Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }, Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker "time": "2019-10-25T18:27:10.950Z" Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker } Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker } Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Attempts: 1 Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker (Permanent) HTTP response code 403 Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker HTTP/1.1 403 Forbidden#012#011
I've silenced 472 and 236 in papertrail for 1 month (expecting we'll resolve the scope issue by then). They each had 2.7+gb of logs for this month (mostly the repeated claimwork failure).
Comment 19•5 years ago
|
||
Dragos will be testing the latest multi-user updates/dev for macos pgo. I'll update the pgo pool to run on the new taskcluster-ci (needs new secrets created and put into each machine -- deploystudio and reimage/place all of the pgo workers).
Comment 20•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #19)
Dragos will be testing the latest multi-user updates/dev for macos pgo. I'll update the pgo pool to run on the new taskcluster-ci (needs new secrets created and put into each machine -- deploystudio and reimage/place all of the pgo workers).
Dragos, these aren't ready yet. I'm sorry but there is more needed than I had anticipated. You could try running in the regular staging pool, and then we can move over to these when they're ready (?). #471 I couldn't reach (I tried cycling the power for it also, but still no ping so I'll need to ask the datacenter techs to check it).
Here are the machines for the pgo pool:
t-mojave-r7-235.tier3.releng.mdc2.mozilla.com
t-mojave-r7-236.tier3.releng.mdc2.mozilla.com
t-mojave-r7-471.tier3.releng.mdc1.mozilla.com
t-mojave-r7-472.tier3.releng.mdc1.mozilla.com
old tc:
client: project/releng/releng-hardware/gecko-3-t-osx-1014
scopes: assume:worker-type:releng-hardware/gecko-3-t-osx-1014
new tc:
client: project/releng/generic-worker/datacenter-gecko-3-t-osx
scopes: assume:worker-type:releng-hardware/gecko-3-t-osx*
I updated the secrets on the deploystudio servers and on #236 and tried bootstrapping it (with https://raw.githubusercontent.com${PUPPET_FORK}/${PUPPET_BRANCH}/provisioners/macos/bootstrap_mojave.sh)
There is additional setup still needed in deploystudio (workflows and scripts; I netbooted #472 against install3, but it didn't run -- no logs created so I think it didn't reach the server), and some puppet changes (looks like firewall needs adjusted -- running a bootstrap terminated when the pf.mozilla.anchors/rules.conf was applied).
Comment 21•5 years ago
|
||
Dragos, here is the pgo pool with one worker:
https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014
I had to re-run the bootstrap on 235 and with the second puppet run the firewall didn't cut me off and homebrew installed.
I've also kicked off the netboot of 236 and 472. I had forgotten to netboot against the bsdpy hosts instead of the install hosts. The workflows are not updated, and so when they complete I'll run the boottstrap/puppet to pin them to my puppet branch:
# check pre-reqs
xcode-select -p
git --version
puppet --version
# if anything missing, install it. then proceed:
echo gecko_3_t_osx_1014 > /etc/puppet_role
curl --silent -L -O https://raw.githubusercontent.com/davehouse/ronin_puppet/bug1530732_l3-on-ffci/provisioners/macos/bootstrap_mojave.sh
bash bootstrap_mojave.sh
Comment 22•5 years ago
|
||
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!
Comment 23•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #22)
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!
It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).
Comment 24•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #23)
(In reply to Chris Manchester (:chmanchester) from comment #22)
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!
It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).
I'm not sure which tests are being referred to there. From the perspective of macOS pgo we need a handful of machines to run a single job type that hasn't landed yet (it will be added in bug 1528374, the patches from which I've been using to validate the work in bug 1561956), so I don't see tests as blocking us, unless I've misunderstood something here.
Comment 25•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #24)
(In reply to Dave House [:dhouse] from comment #23)
(In reply to Chris Manchester (:chmanchester) from comment #22)
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!
It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).
I'm not sure which tests are being referred to there. From the perspective of macOS pgo we need a handful of machines to run a single job type that hasn't landed yet (it will be added in bug 1528374, the patches from which I've been using to validate the work in bug 1561956), so I don't see tests as blocking us, unless I've misunderstood something here.
Dragos, is the pgo pool ready? Are the problems with non-pgo tests?
Comment 26•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #25)
Dragos, is the pgo pool ready? Are the problems with non-pgo tests?
Chris, I confirmed with Dragos that the pgo pool is ready for you to use and the problems are not with the pgo tasks. Let Dragos/me know if you see any problems as these go into production.
Comment 27•5 years ago
|
||
The user story has a few tasks not completed:
[ ] Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110
[ ] [bug 1561956] Setup pgo tier3 puppet role and profile with multiuser generic worker
[ ] Re-image and provision PGO mac mini pool
tprince asked about the first task, cot key, in #firefox-ci and in discussion with aki decided to use the existing generic-worker key.
The other two are complete(for pgo) now.
Comment 28•5 years ago
|
||
Changed the remaining relops task for macos pgo:
[ ] Acquire and set up cot keys on tier-3 workers. (was: Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110)
Comment 29•5 years ago
|
||
Grenade got the cot key to me.
This mana doc shows a process for adding the cot keys to ami's: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655
And it mentions putting the public keys in the repo like https://github.com/mozilla-releng/cot-gpg-keys/tree/master/generic-worker/2018-11-13_2019-05-13_gecko-3-b-win2012
Comment 30•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #29)
Grenade got the cot key to me.
This mana doc shows a process for adding the cot keys to ami's: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655
And it mentions putting the public keys in the repo like https://github.com/mozilla-releng/cot-gpg-keys/tree/master/generic-worker/2018-11-13_2019-05-13_gecko-3-b-win2012
The doc appears to be out of date: https://mana.mozilla.org/wiki/display/RelEng/How+to+redploy+AMIs+and+update+CoT+keys (I had linked to the wrong mana page).
github.com/mozilla-releng/cot-gpg-keys is marked with "This repository has been archived by the owner. It is now read-only. "
when it came up in #firefox-ci:
1110 tom.princ+| IIRC, generic-worker has a command that will generate the key. For level-3 workers, it should be generated and provisioned securely, and then it needs to be added to
https://github.com/mozilla-releng/scriptworker/blob/master/src/scriptworker/constants.py#L83 (cc: @aki)
That has the correct key matching the windows l3 key:
https://github.com/mozilla-releng/scriptworker/blob/master/src/scriptworker/constants.py#L83
Checking on the pgo pool, the config has a key set:
/etc/generic-worker/config
"ed25519SigningKeyLocation": "/var/local/generic-worker/generic-worker.ed25519.signing.key",
This generated key for the testers is made in puppet:
https://github.com/mozilla-platform-ops/ronin_puppet/blob/f87120c523cac0c9a35d30dff9cfc660a59e6ef0/modules/generic_worker/manifests/init.pp#L48
This is still in place for the multi-user update to generic-worker: https://github.com/mozilla-platform-ops/ronin_puppet/pull/75/files#diff-fb9e83eb4fe169b0d24255d1dd4a5180R47
The key file is preserved, not generated or replaced, if one exists.
The key is created as root with 600 perms (confirmed generic-worker readme shows creating the key as root for multi-user generic-worker: https://github.com/taskcluster/generic-worker/blob/a9c922fd727f720eafaf4fa7b030f761744e8fa6/README.md#macos---multiusersimple-build)
So we can place it during provisioning and expect it to not be changed, and the root-owner + 600 permissions are correct
I placed the key manually onto each of t-mojave-r7-{235,236,472}.tier3 workers and rebooted them (no active tasks were running).
Comment 31•5 years ago
|
||
I placed the key into the vault on install1/install3 for tier3 provisioning, under generic_worker.gecko_3_t_osx_1014.ed25519_signing_key
Comment 32•5 years ago
|
||
These have been running for the past day or so without issues -- thanks for your work on this!
Assignee | ||
Updated•5 years ago
|
Description
•