Closed Bug 1524592 Opened 5 years ago Closed 5 years ago

Upgrade to taskcluster-proxy 5.1.0 and generic-worker 13.0.2 on OCC worker types

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

There are some gecko changes planned which require taskcluster-proxy 5.1.0 and generic-worker 12.0.0:

  • bug 1492618: g-w 12.0.0 is needed for ed25519 chain of trust signing keys
  • bug 1508381: removing hardcoding of TASKCLUSTER_ROOT_URL and TASKCLUSTER_PROXY_URL from task definitions
Depends on: 1518913

Either you can generate the new level 3 ed25519 keypair and send me the public key portion, or I can generate a keypair and send you the private key portion... either way works for me.

i'm currently clearing the current backlog of pull requests using the beta worker types.
i should be able to get those workers testing gw 12 and tc-p 5.1.0 tomorrow.

Attached file GitHub Pull Request

this pr to the cot-gpg-keys repo rotates the gecko-3-b-win2012 cot key and changes the key algorithm to eddsa/Ed25519

Attachment #9042403 - Flags: review?(aki)
  • the two links above are to try pushes that use the beta worker types
  • the beta worker types were built this morning from the occ beta branch using
    • generic-worker 12.0.0
    • taskcluster-proxy 5.1.0
Comment on attachment 9042403 [details] [review]
GitHub Pull Request

We'll need the base64 encoded key format like `generic-worker new-ed25519-keypair` gives us, and the pubkey will go into the scriptworker configs for now. (This is gated on my ed25519 PR for scriptworker; just send me the pubkey string for now.) In the future we may have a public json or yaml file with all the known valid public keys that we vendor into scriptworker; other, non-scriptworker tools could use this config to verify downloads. cot-gpg-keys should hopefully be retired in the next 6 weeks or so.

As mentioned in comment 1, I can generate a keypair or you can, and we can pass the key in. Or we can grab the pubkey from the generation logs or by deriving it from the private key.
Attachment #9042403 - Flags: review?(aki) → review-
Attached file genEd25519.go

Should be able to run this via go run genEd25519.go. Writes a private key to ./ed25519-privkey and the public key to stdout.

Attached file getPubKey.go

Should be able to run this via go run getPubKey.go. It will read the privkey base64 string from ./ed25519-privkey and writes the base64-encoded pubkey to stdout.

Attachment #9042613 - Attachment mime type: application/octet-stream → text/plain
Attachment #9042610 - Attachment mime type: application/octet-stream → text/plain
Summary: Upgrade to taskcluster-proxy 5.1.0 and generic-worker 12.0.0 on OCC worker types → Upgrade to taskcluster-proxy 5.1.0 and generic-worker 13.0.2 on OCC worker types
See Also: → 1527970

aki, new public key for gecko-3-b-win2012 is:
6UPrVTyw0EPQV7bCEMXo+5jNR4clbK55JWG74bBJHZQ=

should i make a pr for the cot repo or are we doing it differently now?

Flags: needinfo?(aki)

Thanks! No, for now this will go into the scriptworker ed25519 PR.
Once we retire gpg cot we can retire the cot-gpg-keys repo.

Flags: needinfo?(aki)

Thanks Rob!

Important deployment note

The old workers will be routinely checking .secrets.generic-worker.config.deploymentId to see if it changes, and if they should shut down.

However, the new deployment updates the .userData.genericWorker.config.deploymentId property. Removing the contents of secrets section isn't enough to trigger the old workers to shut down.

Therefore, the old workers won't notice the new deployment, so you'll need to either manually update the old deploymentId (.secrets.generic-worker.config.deploymentId) or just kill the old workers, if it is important that they don't stay around too long.

Credit to :SimonSapin for diagnosing this issue! He hit it when updating the servo-win2016 worker type.

Of course this only applies to the very first upgrade. Once the workers are running v13, future upgrades won't be affected, as the workers will be checking .userData.genericWorker.config.deploymentId and the deployments will be updating .userData.genericWorker.config.deploymentId. The issue occurs only when crossing the v13 boundary.

Thanks!

Flags: needinfo?(rthijssen)
Flags: needinfo?(rthijssen)
Attached file GitHub Pull Request
Assignee: nobody → rthijssen
Attachment #9044888 - Flags: review?(mcornmesser)
Comment on attachment 9044888 [details] [review]
GitHub Pull Request

jmaher: in the couple of pushes i've tested gw 13, there are some test failures i was hoping you could comment on:
- windows 10 (opt & debug)
  jittest: fails consistently:
  [1](https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c96605f57e701e130b92a62649750ae33396811&selectedJob=228429884)
  [2](https://treeherder.mozilla.org/#/jobs?repo=try&revision=84d83d08d74b15632e9bd236d58ae4ad8ea274a2&selectedJob=229189370)
- windows 7 (opt only)
  [jsreftest 1](https://treeherder.mozilla.org/#/jobs?repo=try&revision=84d83d08d74b15632e9bd236d58ae4ad8ea274a2&selectedJob=229196459): failed on latest rebase only
  [worked on thursday's rebase](https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c96605f57e701e130b92a62649750ae33396811&selectedJob=228441346&group_state=expanded)

are these known failures? are you happy for us to proceed with the windows worker updates regardless of these two failures? for the record, i don't think it's related to the gw upgrade, but if they're not known failures, it may be related to the issue we had in bug 1527970 and i can investigate further.
Attachment #9044888 - Flags: feedback?(jmaher)

we currently fail to run jittest on windows10 and it isn't run in general (we switched this off in place of spidermonkey, but this was an accident).

the jsreftest is a new issue for me, I see it broken on mozilla-central:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=jsreftest%2Cwindows7-32&fromchange=b29c87add05f735b250612ca2444103652750091

so the gw upgrade doesn't induce anything new.

Attachment #9044888 - Flags: review?(mcornmesser) → review+

gw update deployment in progress...

planned deployment order is:

  • 08:00 UTC: gecko-t-win10-64-hw, gecko-t-win10-64-ux
  • 09:00 UTC: gecko-t-win10-64, gecko-t-win10-64-gpu, gecko-t-win7-32, gecko-t-win7-32-gpu
  • 10:00 UTC: gecko-1-b-win2012, gecko-2-b-win2012, gecko-3-b-win2012

timings will change if anything doesn't go smoothly or if we rollback at some stage due to issues.

there are issues with missing configuration settings on gecko-t-win10-64-hw & gecko-t-win10-64-ux. i have patched these and am waiting to see if the patch succeeds.

deployment to gecko-t-win10-64, gecko-t-win10-64-gpu, gecko-t-win7-32, gecko-t-win7-32-gpu, gecko-1-b-win2012, gecko-2-b-win2012, gecko-3-b-win2012 is delayed until the issues are resolved.

Attachment #9044888 - Flags: feedback?(jmaher) → feedback+

rolled back gw to 10.11.2 and tc-proxy to 4.1 on win 10 hardware. will reattempt the upgrade tomorrow if i can fix the config issues.

Doh. Let me know if you need a hand with anything.

Blocks: 1375195
Depends on: 1530636

i've had to revert gecko-t-win10-64-gpu as something is wrong with gw configuration on ec2 instances as well.

papertrail shows this:

Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [0 C042021358] 
Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker:   Result: 0 0 The data area passed to a system call is too small. 
Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [C042017E20 C042021358] 
Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker:   Result: 1 0 The operation completed successfully. 
Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker: Loading generic-worker config file 'C:\generic-worker\generic-worker.config'... 
Feb 26 15:35:29 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker: Error loading configuration: open C:\generic-worker\generic-worker.config: The system cannot find the file specified. 

i thought that gw was generating its own configuration files when we run generic-worker.exe run --configure-for-aws but obviously something isn't right.

Comment on attachment 9046706 [details] [diff] [review]
Github Pull Request for OpenCloudConfig (added --configure-for-aws)

Review of attachment 9046706 [details] [diff] [review]:
-----------------------------------------------------------------

thanks, i'm retesting on beta presently.
Attachment #9046706 - Attachment is patch: true
Attachment #9046706 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #9046706 - Flags: review?(rthijssen) → review+

On reflection, I don't think that was the issue, as it looks like from this log line:

Feb 26 14:23:52 i-0096aed52a1cb4c8e.gecko-t-win10-64-gpu.use1.mozilla.com generic-worker-service: C:\generic-worker>.\generic-worker.exe run  1>.\generic-worker.log 2>&1 

that C:\generic-worker\run-generic-worker.bat didn't get replaced with the OCC version of run-generic-worker.bat on gecko-t-win10-64-gpu (since the log line does not include --configure-for-aws).

Although the PR from comment 30 does no harm, it technically isn't absolutely needed, although is desirable.

The PR is still desirable because it causes C:\generic-worker\run-generic-worker.bat to be generated correctly. In OCC, we replace C:\generic-worker\run-generic-worker.bat in GenericWorkerStateWait with this file, so applying the PR now will not really impact the final machine images. However it is a little more explicit/informative to include --configure-for-aws at installation time, and as such, if at some point we chose not to patch the generated C:\generic-worker\run-generic-worker.bat with the OCC version, having the PR already landed now would mean that the script would run generic-worker with valid parameters in the future. So it feels somehow cleaner to include this change set already, for future-proofing.

In conclusion, it appears the failure was that GenericWorkerStateWait didn't run or failed for some reason on gecko-t-win10-64-gpu.

(In reply to Rob Thijssen [:grenade (EET)] from comment #31)

Comment on attachment 9046706 [details] [diff] [review]
Github Pull Request for OpenCloudConfig (added --configure-for-aws)

Review of attachment 9046706 [details] [diff] [review]:

thanks, i'm retesting on beta presently.

Just a heads up that the first push accidentally included gecko-t-win7-32-hw (which doesn't run in AWS) so I removed the changes to it and force pushed...

windows 7 rollout is complete. amis are live and running gw 13 now

Any luck with the builds? I don't see the new ed25519 cot artifacts on autoland, which I'm guessing means they're running older AMIs.

(In reply to Aki Sasaki [:aki] from comment #35)

Any luck with the builds? I don't see the new ed25519 cot artifacts on autoland, which I'm guessing means they're running older AMIs.

windows 10 is in progress now. 2012 will follow...

windows 10 rollout is complete. amis are live and running gw 13 now

gecko-1-b-win2012 & gecko-2-b-win2012 gw 13.0.2 / tc-proxy 5.1.0 upgrade in progress...

gecko-1-b-win2012 & gecko-2-b-win2012 rollout is complete. amis are live and running gw 13 now

gecko-3-b-win2012, gecko-3-b-win2012-c4 & gecko-3-b-win2012-c5 (and non-prod stragglers) updates will proceed tomorrow morning (08:00 GMT)

gecko-3-b-win2012, gecko-3-b-win2012-c4 & gecko-3-b-win2012-c5 rollout is complete. amis are live and running gw 13 now

We're getting close:

aws-provisioner-v1/win2012r2-cu:                                           generic-worker 13.0.2
aws-provisioner-v1/gecko-2-b-win2012:                                      generic-worker 13.0.2
aws-provisioner-v1/servo-win2016-staging:                                  generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-cu:                                    generic-worker 13.0.2
aws-provisioner-v1/nss-win2012r2:                                          generic-worker 13.0.2
aws-provisioner-v1/deepspeech-win:                                         generic-worker 13.0.3
aws-provisioner-v1/gecko-t-win10-64:                                       generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win7-32-gpu-b:                                  generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win7-32-beta:                                   generic-worker 13.0.2
aws-provisioner-v1/gecko-3-b-win2012-c5:                                   generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win7-32:                                        generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-beta:                                  generic-worker 13.0.2
aws-provisioner-v1/win2012r2:                                              generic-worker 13.0.2
aws-provisioner-v1/gecko-3-b-win2012-c4:                                   generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-gpu-a:                                 generic-worker 11.1.0
aws-provisioner-v1/gecko-3-b-win2012:                                      generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win7-32-gpu:                                    generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win7-32-cu:                                     generic-worker 13.0.2
aws-provisioner-v1/gecko-1-b-win2012:                                      generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-gpu:                                   generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-gpu-b:                                 generic-worker 13.0.2
aws-provisioner-v1/gecko-1-b-win2012-beta:                                 generic-worker 13.0.2
aws-provisioner-v1/servo-win2016:                                          generic-worker 13.0.2
aws-provisioner-v1/nss-win2012r2-new:                                      generic-worker 13.0.2
aws-provisioner-v1/gecko-t-win10-64-alpha:                                 generic-worker 11.1.0

So I think the only ones that are still needed are:

aws-provisioner-v1/gecko-t-win10-64-gpu-a:                                 generic-worker 11.1.0
aws-provisioner-v1/gecko-t-win10-64-alpha:                                 generic-worker 11.1.0

Hey Rob,

Are you ok to upgrade these last two?

aws-provisioner-v1/gecko-t-win10-64-gpu-a:                                 generic-worker 11.1.0
aws-provisioner-v1/gecko-t-win10-64-alpha:                                 generic-worker 11.1.0

Also, any idea what might be wrong here?

Thanks!

Flags: needinfo?(rthijssen)

(In reply to Pete Moore [:pmoore][:pete] from comment #42)

Hey Rob,

Are you ok to upgrade these last two?

aws-provisioner-v1/gecko-t-win10-64-gpu-a:                                 generic-worker 11.1.0
aws-provisioner-v1/gecko-t-win10-64-alpha:                                 generic-worker 11.1.0

I'm deploying these in https://github.com/mozilla-releng/OpenCloudConfig/commit/5ce58ec41b0923af79c3e8a005a4d908dba040f8

Also, any idea what might be wrong here?

I created bug 1533402 for this. It turned out to be an issue in the archiver library:

Depends on: 1533402

Deploying generic-worker 13.0.4 to gecko-*-win*-{b,beta,cu} worker types in https://tools.taskcluster.net/groups/ZYTaoWLCSQu8qg4BIJzZ3g

(In reply to Pete Moore [:pmoore][:pete] from comment #44)

Deploying generic-worker 13.0.4 to gecko-*-win*-{b,beta,cu} worker types in https://tools.taskcluster.net/groups/ZYTaoWLCSQu8qg4BIJzZ3g

Due to a bug that deployment didn't go so well (it deployed generic-worker 13.0.2 instead of generic-worker 13.0.4), so I've triggered another deployment in https://tools.taskcluster.net/groups/KTuN0josS6SDidd9QQYLvQ

(In reply to Pete Moore [:pmoore][:pete] from comment #46)

Try push with generic-worker 13.0.4:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d5a0411f4070cb119ce8c4b48fb714e972844d4

Hey Rob,

This try push with 13.0.4 looks fine to me, do you see any reason for us not to update?

Note, the primary reason to upgrade was to fix gecko-t-win7-32-cu, used by generic-worker CI, when mounting archives that contained hard links - so we don't really gain anything by upgrading the gecko worker types, other than just keeping them up-to-date.

So I'll let you make the call if you think it is worth it or not.

Thanks!

no objections from me, looks good.

Flags: needinfo?(rthijssen)
Attachment #9050411 - Flags: review?(rthijssen)

I'm going to mark this as resolved, and put 13.0.2 -> 13.0.4 in a separate bug, as we're already on 13.0.2 or higher on all the worker types now.

Attachment #9050411 - Flags: review?(rthijssen)
Attachment #9050411 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: