Closed Bug 1530732 Opened 5 years ago Closed 4 years ago

Setup a separate mac mini pool for PGO builds

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

User Story

Work summary:
[X] [bug 1562289] Netops to setup a new tier3 network segment (vlan)
[X] [bug 1563338] Netops to move ports of dedicated PGO minis to new vlan
[X] [bug 1563333] IT storage virtualization to setup new bsdpy virtual hosts, nic attached to new vlan
[X] [bug 1563340] Setup tier3 bsdpy service on bsdpy virtual hosts
[X] [bug 1563357] Allocate and move 4 additional mac minis for deploystudio to new vlan
[X] Setup and configure deploystudio minis plus dedicated netboot image
[x] Fix existing taskcluster credentials to not grant access to tier-3 workers
[x] Obtain new taskcluster credentials solely for PGO tier3
[x] Acquire and set up cot keys on tier-3 workers. (was: Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110)
[x] [bug 1561956] Setup pgo tier3 puppet role and profile with multiuser generic worker
[x] Re-image and provision PGO mac mini pool

Attachments

(4 files)

We need to set up a separate dedicated pool of mac minis in mdc1 and mdc2 in order to exec osx PGO builds (tier 3). Unfortunately, these will come from our existing pool of testers which already sees pending count spike on a regular basis.

Let's start by determining how big of a pool we need and how many we can sacrifice from the existing tester pool. Then we'll select a group, add a node def for them and re-image.

Joel, can you help determine how many minis to allocate from the current testers pool?

See: bug 1515415

Flags: needinfo?(jmaher)

The estimate from bug 1515415 comment 7 is that we need to run a task that will take 5 minutes on average, an average of 500 times in a day.

that is 42 hours of compute time, minimum is 2, I would say 3 to be safe (averages out to 17 hours of compute time/day) my target is 20 hours/day to account for reboots, upgrades, timed out tasks, etc.

Flags: needinfo?(jmaher)

Hi Jake, is comment 2 feasible? Do you have an estimate of when we will be able to do this?

Flags: needinfo?(jwatkins)

Additionally, make sure that other levels can't access them.

Attached file GitHub Pull Request
Flags: needinfo?(jwatkins)

I've reallocated and re-imaged 4 macs into the new gecko-3-t-osx-1010 worker type. Thanks :tomprince for setting up the scopes!

https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1010

:cmanchester, :mshal, is there anything else needed here with regards to the workers?

Flags: needinfo?(mshal)
Flags: needinfo?(cmanchester)

Thank you, this does seem to be what we need! I'm noticing testing the straightforward patch for this fails on try due to missing scopes. Tom, should we just make a transform swap the new worker type for the regular test worker when running these on try?

Flags: needinfo?(mshal)
Flags: needinfo?(mozilla)
Flags: needinfo?(cmanchester)

:chmanchester I'm working on a patch to abstract this; there are some other places where we want to vary the worker-type, and it doesn't make sense to encode the exact type in every kind.

Flags: needinfo?(mozilla)

Hey Jake,

We now have support for running tasks as unique OS users on macOS in generic-worker.

Any questions, let me know!

Thanks.

Flags: needinfo?(jwatkins)

(In reply to Pete Moore [:pmoore][:pete] from comment #9)

Hey Jake,

We now have support for running tasks as unique OS users on macOS in generic-worker.

Any questions, let me know!

Thanks.

Thanks Pete. I'll work on getting the multiuser gw client into puppet and onto the workers.

Flags: needinfo?(jwatkins)
User Story: (updated)
Depends on: 1562289

(added a couple of more steps)

User Story: (updated)
Depends on: 1563333
User Story: (updated)
Depends on: 1563338
User Story: (updated)
Depends on: 1563340
User Story: (updated)
Depends on: 1563357
User Story: (updated)
User Story: (updated)
Depends on: 1566159
Depends on: 1568299
User Story: (updated)

Updated user story: Deploystudio setup completed. Re-imaging infra is in place.

User Story: (updated)

Since we are migrating to Mojave, the new workerType will be gecko-3-t-osx-1014

User Story: (updated)
Depends on: 1561956

NI :tomprince for:
[ ] Fix existing taskcluster credentials to not grant access to tier-3 workers
[ ] Obtain new taskcluster credentials solely for PGO tier3

Flags: needinfo?(mozilla)

I've created https://tools.taskcluster.net/auth/clients/project%2Freleng%2Freleng-hardware%2Fgecko-3-t-osx-1014 for the workers; you can reset the access token, to get a fresh token to put in puppet.

I've gone through all the roles and clients, and removed the unbounded releng-hardware/* scope from the existing role (and cleaned up a bunch of other unrelated scopes of the same type that were too broad).

Flags: needinfo?(mozilla)
User Story: (updated)

Is the something more needed with the scopes? ("queue:claim-work:releng-hardware/gecko-3-t-osx-1010" ?)
The t-yosemite-r7-{472,236} machines have repeated log entries like:

Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Response Body:
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "code": "InsufficientScopes",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "message": "Client ID project/releng/generic-worker/os-x/production does not have sufficient scopes and is missing the following scopes:\n\n```\nqueue:claim-work:releng-hardware/gecko-3-t-osx-1010\n```\n\nThis request requires the client to satisfy the following scope expression:\n\n```\n{\n  \"AllOf\": [\n    \"queue:claim-work:releng-hardware/gecko-3-t-osx-1010\",\n    \"queue:worker-id:mdc1/t-yosemite-r7-472\"\n  ]\n}\n```\n\n---\n\n* method:     claimWork\n* errorCode:  InsufficientScopes\n* statusCode: 403\n* time:       2019-10-25T18:27:10.950Z",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "requestInfo": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "method": "claimWork",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "params": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "provisionerId": "releng-hardware",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerType": "gecko-3-t-osx-1010"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "payload": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "tasks": 1,
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerGroup": "mdc1",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerId": "t-yosemite-r7-472"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "time": "2019-10-25T18:27:10.950Z"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Attempts: 1
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker (Permanent) HTTP response code 403
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker HTTP/1.1 403 Forbidden#012#011

(In reply to Dave House [:dhouse] from comment #17)

Is the something more needed with the scopes? ("queue:claim-work:releng-hardware/gecko-3-t-osx-1010" ?)
The t-yosemite-r7-{472,236} machines have repeated log entries like:

Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Response Body:
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "code": "InsufficientScopes",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "message": "Client ID project/releng/generic-worker/os-x/production does not have sufficient scopes and is missing the following scopes:\n\n```\nqueue:claim-work:releng-hardware/gecko-3-t-osx-1010\n```\n\nThis request requires the client to satisfy the following scope expression:\n\n```\n{\n  \"AllOf\": [\n    \"queue:claim-work:releng-hardware/gecko-3-t-osx-1010\",\n    \"queue:worker-id:mdc1/t-yosemite-r7-472\"\n  ]\n}\n```\n\n---\n\n* method:     claimWork\n* errorCode:  InsufficientScopes\n* statusCode: 403\n* time:       2019-10-25T18:27:10.950Z",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   "requestInfo": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "method": "claimWork",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "params": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "provisionerId": "releng-hardware",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerType": "gecko-3-t-osx-1010"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "payload": {
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "tasks": 1,
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerGroup": "mdc1",
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker       "workerId": "t-yosemite-r7-472"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     },
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker     "time": "2019-10-25T18:27:10.950Z"
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker   }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker }
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker Attempts: 1
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker (Permanent) HTTP response code 403
Oct 25 11:27:11 t-yosemite-r7-472.test.releng.mdc1.mozilla.com generic-worker HTTP/1.1 403 Forbidden#012#011

I've silenced 472 and 236 in papertrail for 1 month (expecting we'll resolve the scope issue by then). They each had 2.7+gb of logs for this month (mostly the repeated claimwork failure).

Dragos will be testing the latest multi-user updates/dev for macos pgo. I'll update the pgo pool to run on the new taskcluster-ci (needs new secrets created and put into each machine -- deploystudio and reimage/place all of the pgo workers).

(In reply to Dave House [:dhouse] from comment #19)

Dragos will be testing the latest multi-user updates/dev for macos pgo. I'll update the pgo pool to run on the new taskcluster-ci (needs new secrets created and put into each machine -- deploystudio and reimage/place all of the pgo workers).

Dragos, these aren't ready yet. I'm sorry but there is more needed than I had anticipated. You could try running in the regular staging pool, and then we can move over to these when they're ready (?). #471 I couldn't reach (I tried cycling the power for it also, but still no ping so I'll need to ask the datacenter techs to check it).
Here are the machines for the pgo pool:
t-mojave-r7-235.tier3.releng.mdc2.mozilla.com
t-mojave-r7-236.tier3.releng.mdc2.mozilla.com
t-mojave-r7-471.tier3.releng.mdc1.mozilla.com
t-mojave-r7-472.tier3.releng.mdc1.mozilla.com

old tc:
client: project/releng/releng-hardware/gecko-3-t-osx-1014
scopes: assume:worker-type:releng-hardware/gecko-3-t-osx-1014

new tc:
client: project/releng/generic-worker/datacenter-gecko-3-t-osx
scopes: assume:worker-type:releng-hardware/gecko-3-t-osx*

I updated the secrets on the deploystudio servers and on #236 and tried bootstrapping it (with https://raw.githubusercontent.com${PUPPET_FORK}/${PUPPET_BRANCH}/provisioners/macos/bootstrap_mojave.sh)
There is additional setup still needed in deploystudio (workflows and scripts; I netbooted #472 against install3, but it didn't run -- no logs created so I think it didn't reach the server), and some puppet changes (looks like firewall needs adjusted -- running a bootstrap terminated when the pf.mozilla.anchors/rules.conf was applied).

Flags: needinfo?(dcrisan)

Dragos, here is the pgo pool with one worker:
https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014
I had to re-run the bootstrap on 235 and with the second puppet run the firewall didn't cut me off and homebrew installed.

I've also kicked off the netboot of 236 and 472. I had forgotten to netboot against the bsdpy hosts instead of the install hosts. The workflows are not updated, and so when they complete I'll run the boottstrap/puppet to pin them to my puppet branch:

# check pre-reqs
xcode-select -p
git --version
puppet --version
# if anything missing, install it. then proceed:
echo gecko_3_t_osx_1014 > /etc/puppet_role
curl --silent -L -O https://raw.githubusercontent.com/davehouse/ronin_puppet/bug1530732_l3-on-ffci/provisioners/macos/bootstrap_mojave.sh
bash bootstrap_mojave.sh
Flags: needinfo?(dcrisan)

Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!

Flags: needinfo?(dhouse)

(In reply to Chris Manchester (:chmanchester) from comment #22)

Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!

It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).

Flags: needinfo?(dhouse)

(In reply to Dave House [:dhouse] from comment #23)

(In reply to Chris Manchester (:chmanchester) from comment #22)

Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!

It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).

I'm not sure which tests are being referred to there. From the perspective of macOS pgo we need a handful of machines to run a single job type that hasn't landed yet (it will be added in bug 1528374, the patches from which I've been using to validate the work in bug 1561956), so I don't see tests as blocking us, unless I've misunderstood something here.

(In reply to Chris Manchester (:chmanchester) from comment #24)

(In reply to Dave House [:dhouse] from comment #23)

(In reply to Chris Manchester (:chmanchester) from comment #22)

Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c72, is this deployed and ready? Are we clear to use those workers on production repos (and therefore land bug 1528374)? Thanks!

It is not ready yet. There are still some problems being worked through. Dragos has the multiuser generic-worker running on the pgo pool, but there are some tests still failing after the move from single to multi-user (if I understand correctly).

I'm not sure which tests are being referred to there. From the perspective of macOS pgo we need a handful of machines to run a single job type that hasn't landed yet (it will be added in bug 1528374, the patches from which I've been using to validate the work in bug 1561956), so I don't see tests as blocking us, unless I've misunderstood something here.

Dragos, is the pgo pool ready? Are the problems with non-pgo tests?

Flags: needinfo?(dcrisan)

(In reply to Dave House [:dhouse] from comment #25)

Dragos, is the pgo pool ready? Are the problems with non-pgo tests?

Chris, I confirmed with Dragos that the pgo pool is ready for you to use and the problems are not with the pgo tasks. Let Dragos/me know if you see any problems as these go into production.

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

The user story has a few tasks not completed:
[ ] Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110
[ ] [bug 1561956] Setup pgo tier3 puppet role and profile with multiuser generic worker
[ ] Re-image and provision PGO mac mini pool

tprince asked about the first task, cot key, in #firefox-ci and in discussion with aki decided to use the existing generic-worker key.
The other two are complete(for pgo) now.

User Story: (updated)
User Story: (updated)

Changed the remaining relops task for macos pgo:
[ ] Acquire and set up cot keys on tier-3 workers. (was: Generate chain-of-trust keys for tier-3 workers and add to https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/constants.py#L110)

Grenade got the cot key to me.
This mana doc shows a process for adding the cot keys to ami's: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655
And it mentions putting the public keys in the repo like https://github.com/mozilla-releng/cot-gpg-keys/tree/master/generic-worker/2018-11-13_2019-05-13_gecko-3-b-win2012

(In reply to Dave House [:dhouse] from comment #29)

Grenade got the cot key to me.
This mana doc shows a process for adding the cot keys to ami's: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655
And it mentions putting the public keys in the repo like https://github.com/mozilla-releng/cot-gpg-keys/tree/master/generic-worker/2018-11-13_2019-05-13_gecko-3-b-win2012

The doc appears to be out of date: https://mana.mozilla.org/wiki/display/RelEng/How+to+redploy+AMIs+and+update+CoT+keys (I had linked to the wrong mana page).
github.com/mozilla-releng/cot-gpg-keys is marked with "This repository has been archived by the owner. It is now read-only. "

when it came up in #firefox-ci:

1110 tom.princ+| IIRC, generic-worker has a command that will generate the key. For level-3 workers, it should be generated and provisioned securely, and then it needs to be added to
https://github.com/mozilla-releng/scriptworker/blob/master/src/scriptworker/constants.py#L83 (cc: @aki)

That has the correct key matching the windows l3 key:
https://github.com/mozilla-releng/scriptworker/blob/master/src/scriptworker/constants.py#L83

Checking on the pgo pool, the config has a key set:
/etc/generic-worker/config

  "ed25519SigningKeyLocation": "/var/local/generic-worker/generic-worker.ed25519.signing.key",

This generated key for the testers is made in puppet:
https://github.com/mozilla-platform-ops/ronin_puppet/blob/f87120c523cac0c9a35d30dff9cfc660a59e6ef0/modules/generic_worker/manifests/init.pp#L48

This is still in place for the multi-user update to generic-worker: https://github.com/mozilla-platform-ops/ronin_puppet/pull/75/files#diff-fb9e83eb4fe169b0d24255d1dd4a5180R47
The key file is preserved, not generated or replaced, if one exists.
The key is created as root with 600 perms (confirmed generic-worker readme shows creating the key as root for multi-user generic-worker: https://github.com/taskcluster/generic-worker/blob/a9c922fd727f720eafaf4fa7b030f761744e8fa6/README.md#macos---multiusersimple-build)
So we can place it during provisioning and expect it to not be changed, and the root-owner + 600 permissions are correct

I placed the key manually onto each of t-mojave-r7-{235,236,472}.tier3 workers and rebooted them (no active tasks were running).

User Story: (updated)

I placed the key into the vault on install1/install3 for tier3 provisioning, under generic_worker.gecko_3_t_osx_1014.ed25519_signing_key

These have been running for the past day or so without issues -- thanks for your work on this!

Flags: needinfo?(cmanchester)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Blocks: 1667441
Blocks: 1628333
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: