1443589 - upgrade generic worker to 10.8.5 on Win 10 hardware

I gone to bump this up on my priority list for this week. A few weeks ago I was able to get GenericWorker 10.8.1 installed and running on hardware. However, once I put it out in the wild it began to burn tasks. At the time it was test failures not GenericWorker errors. When this happen I did not have the time to really dive into it. I have t-w1064-ms-077 spinning up using https://github.com/mozilla-platform-ops/OpenCloudConfig . Which is will install 10.8.1 . I am using the old SCL3-pupept provisionor, so that I can manually retirgger tests and point it at this particular machine. I am hoping to either have this resolved or gather enough info so I can pull in help during the work week.

Pete Moore [:pmoore][:pete]

Comment 5

•

7 years ago

++ Thanks Mark.

Mark Cornmesser [:markco]

Assignee

Comment 6

•

7 years ago

Attached file gen_worker_10_conf.txt — Details

A copy of the conf file being used

Mark Cornmesser [:markco]

Assignee

Comment 7

•

7 years ago

Ran a small handful of random hardware test yesterday. With one test that had previous passed failing. I am going to run through all the hardware tests today and will open additional bugs for failing tests. Other notes: I did not need to use the scl3-puppet provisioner. Worker types gecko-t-* will work with the releng-hardware provisioner. With all the tests failing last time I attempted to the test the updated worker, it may had been local network or DNS issues that affected the test.

Mark Cornmesser [:markco]

Assignee

Comment 8

•

7 years ago

Mark Cornmesser [:markco]

Assignee

Comment 9

•

7 years ago

Reftest as well: test-windows10-64/opt-reftest-e10s-1 https://tools.taskcluster.net/groups/PkIubl-8R1eRyvrsqfJuQg/tasks/PkIubl-8R1eRyvrsqfJuQg/details test-windows10-64/opt-reftest-e10s-2 https://tools.taskcluster.net/groups/V4NTRYHxTE-OrLeGT9Gx2Q/tasks/V4NTRYHxTE-OrLeGT9Gx2Q/details

Mark Cornmesser [:markco]

Assignee

Comment 10

•

7 years ago

I am setting up 2 nodes ms-135 and ms-81 to run as gecko-t-win10-64-hw worker type with GenericWorker 10.8.1. I am not to concerned with the functionality of the worker as much as the extended life of the nodes. After a mass of tests are ran these nodes I will go back through and look at the state of the node concerning disk space and schedule tasks. After this I will update the OCC testing repo and roll out a 5 machine test pool. For the testing repo I am going to include similar functions covered in https://bugzilla.mozilla.org/show_bug.cgi?id=1451837. This includes disk space management and Datacenter location decisions. The next sticky part is figuring out a method in which we can do an incremental roll out of the upgraded worker. I am thinking about creating a second Win 10 OCC manifest and creating a flag during the initial deployment that OCC can check for when deciding on which manifest to use.

Mark Cornmesser [:markco]

Assignee

Comment 11

•

7 years ago

The 2 test nodes are up, picking up tasks, and passing tests.

Pete Moore [:pmoore][:pete]

Comment 12

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #11) > The 2 test nodes are up, picking up tasks, and passing tests. Great news!

Mark Cornmesser [:markco]

Assignee

Comment 13

•

7 years ago

Attached file OCC win 10 hw GenericWorker upgrade — Details

To upgrade to 10.8.4 There is new manifest and a conditional clause looking for a gw 10 file from deployment. The reason for this is so we can do an incremental upgrade without breaking the entire hardware pool. This also incorporates hard disk clean up from Bug 1451837. As well as removal of KTS from Bug 1454759.

Attachment #8985274 - Flags: feedback?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 14

•

7 years ago

Comment on attachment 8985274 [details] OCC win 10 hw GenericWorker upgrade r+, looks good to me. in function hw-DiskManage, there's a reboot that references the lock file. the path to the lock file will need to be passed into the function as a parameter. here's an example: https://github.com/mozilla-releng/OpenCloudConfig/blob/5963cda/userdata/rundsc.ps1#L420

Attachment #8985274 - Flags: feedback?(rthijssen) → feedback+

Mark Cornmesser [:markco]

Assignee

Comment 15

•

7 years ago

Bumping GW version 10.8.5 .

Mark Cornmesser [:markco]

Assignee

Comment 16

•

7 years ago

Moved back to 10.8.4

Mark Cornmesser [:markco]

Assignee

Comment 17

•

7 years ago

Attached file Pull request to land upgrade — Details

Verbally R+ by grenade. Holding off on merging until current issues are resolved.

Attachment #8985473 - Flags: review+

Pete Moore [:pmoore][:pete]

Comment 18

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #16) > Moved back to 10.8.4 10.8.5 should be fine - the production Windows issues from today are unrelated.

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Comment 19

•

7 years ago

Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I will land the patch. The nodes will need to be reimaged for this to take affect.

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Comment 20

•

7 years ago

Patch landed.

Mark Cornmesser [:markco]

Assignee

Comment 21

•

7 years ago

Ciduty will start on a rolling install.

Danut Labici [:dlabici]

Comment 22

•

7 years ago

:apop will be handling the re-images. I can CCed the ciduty@m.c bugzilla account (under the same email, you can even NeedInfo it!) so we can all have better visibility into the process. Adrian will come back with updates, as he has them.

Adrian Pop

Comment 23

•

7 years ago

reimaged the following moonshots: T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019, T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023, T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027, T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031, T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037, T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041, T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045.

Radu Iman[:riman]

Comment 24

•

7 years ago

Today we re-imaged the following moonshots: T-W1064-MS-{ 061.. to 090 } T-W1064-MS-{ 106.. to 135 } T-W1064-MS-{ 151.. to 171 }

Pete Moore [:pmoore][:pete]

Comment 25

•

7 years ago

\o/ Thanks guys!

Pete Moore [:pmoore][:pete]

Comment 26

•

7 years ago

(In reply to Adrian Pop from comment #23) > reimaged the following moonshots: > > T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019, > T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023, > T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027, > T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031, > T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037, > T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041, > T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045. Looking at one of these at random, it doesn't appear to be taking jobs: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-045

Pete Moore [:pmoore][:pete]

Comment 27

•

7 years ago

Seems to be some infinite looping in: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/run-hw-generic-worker-8-and-reboot.bat#L3-L7 Let's please call off the rollout until this is solved.

Flags: needinfo?(riman)

Flags: needinfo?(apop)

Pete Moore [:pmoore][:pete]

Comment 28

•

7 years ago

See e.g. https://papertrailapp.com/groups/1141234/events?q=T-W1064-MS-037

Mark Cornmesser [:markco]

Assignee

Comment 29

•

7 years ago

On some of the nodes we hit an issue where generic worker exit and the wrapper script rebooted the node before OCC cleared the in-progress.lock file. When the node came back up, the rundsc powershell script exited without running. The odd bit is that is not across the board. I have added this to the wrapper script: https://github.com/mozilla-releng/OpenCloudConfig/commit/19fac03565089b5c8ef626ebde4be2dc8e684d58 Whenever the GenericWorker exits if the lock file exists it will be deleted. As well as if the manifest hasn't applied with in 15 minutes the lock file will be deleted and the machine will reboot.

Mark Cornmesser [:markco]

Assignee

Comment 30

•

7 years ago

Asked CiDuty in #ciduty to not start additional installs until the above patch is tested.

Mark Cornmesser [:markco]

Assignee

Comment 31

•

7 years ago

This seems to have affected about 20% of the newly installed machines. The ones that were function continue to function after picking up the new wrapper script. I am running a fresh install on ms-021. If installs through and picks up multiple tasks I will reninstall the other affected nodes on chassis 1. If there is no issue on those I will ask ciduty to resume installing and go through and identify the other affect nodes.

Mark Cornmesser [:markco]

Assignee

Comment 32

•

7 years ago

I have dropped the reboot after the manifest hasn't completed within 15 minutes. It was causing an additional loop on its own. I am planning on adding this later with Bug 1470016. Ms-021 successfully installed and picked up and passed multiple tests. With the exception of ms-038 and ms-035, which has other issues, nodes ms-016 through ms-045 are up and running. I am going to let these sit for a while and run. If there is no issues I will ask CiDuty to resume the roll out. I suspect the issue began during the initial run(s) of OCC on the nodes. Possible right after the creation of the first task user. Because the last exit code of GenericWorker in the log was a 67. I suspect rundsc.ps1 was waiting on something when the wrapper script issued a reboot.

Mark Cornmesser [:markco]

Assignee

Comment 33

•

7 years ago

So far things are looking OK. I am going to take a look the nodes mention above in the morning pdt.

Pete Moore [:pmoore][:pete]

Comment 34

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #19) > Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I > will land the patch. The nodes will need to be reimaged for this to take > affect. Looking at e.g. https://taskcluster-artifacts.net/TeZ1Sq4eSjC5k-NgAnH5Ww/0/public/logs/live_backing.log it looks like the new machines are running 10.8.4 not 10.8.5. Looking at userdata/Manifest/gecko-t-win10-64-hw-GW10.json I see that version 10.8.5 is specified, but I can see the sha256 is for version 10.8.4. It looks like the following (unreviewed) commit introduced the issue since it updated the version number but did not update the tooltool hash: https://github.com/mozilla-releng/OpenCloudConfig/commit/1e953d46810a2a86c648b30f1f40f526795c1c90 I'm happy to review any patches to OCC.

Pete Moore [:pmoore][:pete]

Comment 35

•

7 years ago

s/sha256/sha512/

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 36

•

7 years ago

Found the proper version of the generic worker in tooltool (10.8.5), and updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file. Created pull request https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b Please review and merge it if it's ok.

Pete Moore [:pmoore][:pete]

Comment 37

•

7 years ago

(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36) > Found the proper version of the generic worker in tooltool (10.8.5), and > updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file. > > Created pull request > https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b > > > Please review and merge it if it's ok. That looks like a commit to master rather than a PR. :-) But I would have r+'d it as the diff is correct. Thanks.

Attila Craciun [:arny]

Comment 38

•

7 years ago

There is also a PR: https://github.com/mozilla-releng/OpenCloudConfig/pull/156 ;)

Pete Moore [:pmoore][:pete]

Comment 39

•

7 years ago

(In reply to Attila Craciun [:arny] from comment #38) > There is also a PR: > https://github.com/mozilla-releng/OpenCloudConfig/pull/156 ;) Ah indeed! r+'d in github. Many thanks.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 40

•

7 years ago

This came in the puppet (~30 minutes) > Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' as a report for t-yosemite-r7-256. Did you guys started working on OSX generic worker?

Mark Cornmesser [:markco]

Assignee

Comment 41

•

7 years ago

(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #40) > This came in the puppet (~30 minutes) > > > Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' > > as a report for t-yosemite-r7-256. > > Did you guys started working on OSX generic worker? dhouse:^

Flags: needinfo?(dhouse)

Dragos Crisan [:dragrom]

Comment 42

•

7 years ago

The generic-worker on OSX was updated yesterday. Looking into taskxluster, this host not appear to the pool:https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010

Mark Cornmesser [:markco]

Assignee

Comment 43

•

7 years ago

I went back through and checked the nodes on chassis 1 (16 - 45), and the looping issue did not reoccur. However, there were 3 nodes that went unresponsive. Chassis 2 had 3 nodes that were stuck in the loop, but there I found a high percentage of unresponsive nodes. This is an issue we have been hitting (Bug 1452133). I am going to continue to go through the newly installed nodes and identify issues, reinstall to get them back into the pool, and continue to dive into this.

Flags: needinfo?(dhouse)

Mark Cornmesser [:markco]

Assignee

Comment 44

•

7 years ago

In the remaining the nodes that had the updated version of GW, I found 30 nodes which were in an unresponsive or possible shutdown state. Which is a much higher frequency than the older version. It maybe unrelated, but I am going to continue to monitor this. With a possibility of rolling all but the chassis 1 nodes to GW 8. The looping issue seems to have gone away on the newly installed nodes.

Mark Cornmesser [:markco]

Assignee

Comment 45

•

7 years ago

Going through the unresponsive nodes, nodes were locked up not shutdown. According to the last reported logs they locked up at different points, when rundsc.ps1 issued a reboot, when the wrapper script issued a reboot, and while waiting for a task. Other nodes had not installed correctly. Either the node was continuing to try to pxe boot or continuing to try to pxe boot over IPv6. Mostly likely it is not related, but the amount on the gw 10 workers was concerning. I am going to continue to monitor.

Mark Cornmesser [:markco]

Assignee

Comment 46

•

7 years ago

As for now I am going to hold off on completion of the roll out until Monday next week.

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

Depends on: 1470338

Mark Cornmesser [:markco]

Assignee

Comment 47

•

7 years ago

Some of the nodes that were unresponsive actually had failed to reboot. I have opened up Bug 1470338 to track those.

Mark Cornmesser [:markco]

Assignee

Comment 48

•

7 years ago

In the last 20 hours 15 out of 100 nodes went from good to not able to pick up tasks. (16-45) Unresponsive 20 24 32 39 Failed to reboot 23 42 (61-90) Unresponsive 72 Failed to reboot 75 85 (106-135) Unresponsive 107 Failed to reboot 115 118 126 (151-171) Failed to Reboot 154 155 170 The unresponsive nodes seems to be either different or at an accelerated rate than the previous known issue. This is now being tracked in Bug 1470507.

Mark Cornmesser [:markco]

Assignee

Comment 49

•

7 years ago

CiDuty: Could you all reinstall all except ms-016 - ms-045 to the old task sequence? We resume the upgrade after Bug 1470507 and Bug 1470338 are resolved.

Flags: needinfo?(riman)

Flags: needinfo?(ciduty)

Flags: needinfo?(apop)

Radu Iman[:riman]

Comment 50

•

7 years ago

Yes we can. I'm going to start right now.

Flags: needinfo?(ciduty)

Mark Cornmesser [:markco]

Assignee

Comment 51

•

7 years ago

I am going to move to using this repo while troubleshooting: https://github.com/mozilla-platform-ops/OpenCloudConfig .

Radu Iman[:riman]

Comment 52

•

7 years ago

I've re-imaged T-W1064-MS-061..115 to the old task sequence

Zsolt Fay [:zfay]

Comment 53

•

7 years ago

I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171. Except no. 130 it seems to have some issue not getting past boot and showing a blue screen saying we encountered an error. All the re-imaged workers (except 130) are well alive and completing jobs.

Jordan Lund (:jlund)

Comment 54

•

7 years ago

(In reply to Radu Iman[:riman] from comment #50) > Yes we can. I'm going to start right now. (In reply to Zsolt Fay [:zsoltfay] from comment #53) > I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171. > Except no. 130 it seems to have some issue not getting past boot and showing > a blue screen saying we encountered an error. > > All the re-imaged workers (except 130) are well alive and completing jobs. Thank you ciduty for your help with reimaging.

Mark Cornmesser [:markco]

Assignee

Comment 55

•

7 years ago

There were multiple complication with between OCC and GenericWorker on that last attempt of the upgrade. There are 2 files that the wrapper script looks for before starting up GW. One is to signal the end of the OCC manifest being applied (EndOfManifest.semaphore), and one is that signals then end of the rundsc.ps1 (task-claim-state.valid ). In both cases the wrapper script would fail on deleting the file because of file locks. Because those were not being removed the wrapper script would jump straight to starting GW. Rundsc.ps1 would then exit once it detected the GW process. This led to state where OCC was never fully applied. I have added multiple catches to delete these files if they exist when GW exits or when there is an early exit of rundsc.ps1. OCC would then apply fully on each boot. However, the vaildation of the GW install would fail and OCC would then install GW again. GW would then try to start using the default wrapper script and fail: "CommandsReturn": [ { "Command": "C:\\generic-worker\\generic-worker.exe", "Arguments": [ "--version" ], "Match": "generic-worker 10.8.5" Which the command returns: :\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8] 2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small. 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8] 2018/07/09 04:57:37 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] For the time being I have removed this validation piece. In addition the user intit script (https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/task-user-init-win10.cmd) has been added to the hw configuration. I have additional logging to the wrapper script looking at the DSC dir after GW exit. As well as an attempt at dumping full list of running task when the nodes fails to reboot, but I suspect the reboot issue was caused by unfinished portions of OCC. Also to get more info the generic-worker-service.log is being sent to papertrail. My hope is that there will be less issues because OCC is fully completing, and if not the additional logging will give us an idea of the possible cause.

Mark Cornmesser [:markco]

Assignee

Comment 56

•

7 years ago

Attached file Improvement on hw gw 10 configurations — Details

Rob: I still need to clean it up a bit, but do you have hny feedback? Any ideas on validation of the GW install?

Attachment #8990654 - Flags: feedback?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 57

•

7 years ago

i'm not sure how we can validate the gw version without running `generic-worker.exe --version` have i understood correctly that calling gw with the version flag causes gw to do more than just return the version?

Mark Cornmesser [:markco]

Assignee

Comment 58

•

7 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #57) > i'm not sure how we can validate the gw version without running > `generic-worker.exe --version` > have i understood correctly that calling gw with the version flag causes gw > to do more than just return the version? Yes. C:\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8] 2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small. 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8] 2018/07/09 04:57:37 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]

Mark Cornmesser [:markco]

Assignee

Comment 59

•

7 years ago

I have added this to the testing repo: https://github.com/mozilla-platform-ops/OpenCloudConfig/commit/91b6cba9a76d184a4b2a48c3a6d49fffb604b636 If one of the flag files do not exists and Powershell is not running, clear lock file and flag files if they exist, and reboot to try the process again. This sprung from https://bugzilla.mozilla.org/show_bug.cgi?id=1474678 . There seems to be an intermittent issue with the Mellanox network drivers on start up that prevented OCC from getting to externally source files.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 60

•

7 years ago

Comment on attachment 8990654 [details] Improvement on hw gw 10 configurations pmoore: up until now we have used a validation in occ that looks for a version string in the output from `generic-worker.exe --version` in the form of a complete line that matches: "generic-worker 10.8.5" eg: https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/Manifest/gecko-1-b-win2012.json#L1126 it seems that at some point the output from the version command changed to include a link to the git hash the version was built from. this change means that dsc assumes that the correct version of gw is not installed and goes on to reinstall gw on every occ run. i will patch occ to allow for the extra information in the output from the version command. in future, a heads up that a change like this is being introduced, would be appreciated and will help us to avoid bustage. also, is it intentional that there is so much output from the version command? eg (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log): Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0 C0423D76E8] 2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is too small. 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [C042405C40 C0423D76E8] 2018/07/11 08:23:31 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] is it necessary to make those system calls in order to determine the version? also the output seems superfluous to what a user would expect when requesting the version.

Flags: needinfo?(pmoore)

Attachment #8990654 - Flags: feedback?(rthijssen) → feedback+

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 61

•

7 years ago

OCC is now patched to handle the extra line content. https://github.com/mozilla-releng/OpenCloudConfig/commit/462d9fd the implementation now includes a *like* comparison so the validation for the generic worker version now looks like this: "Like": "generic-worker 10.8.5 *" instead of: "Match": "generic-worker 10.8.5"

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 62

•

7 years ago

pmoore: in light of the switch from "Match" to "Like", it's possible that the regex at https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu.sh#L55 may need adaptation. maybe not, but worth checking...

Pete Moore [:pmoore][:pete]

Comment 63

•

7 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #60) > Comment on attachment 8990654 [details] > Improvement on hw gw 10 configurations > > pmoore: > > up until now we have used a validation in occ that looks for a version > string in the output from `generic-worker.exe --version` in the form of a > complete line that matches: "generic-worker 10.8.5" > eg: > https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/ > Manifest/gecko-1-b-win2012.json#L1126 > > it seems that at some point the output from the version command changed to > include a link to the git hash the version was built from. this change means > that dsc assumes that the correct version of gw is not installed and goes on > to reinstall gw on every occ run. > > i will patch occ to allow for the extra information in the output from the > version command. > > in future, a heads up that a change like this is being introduced, would be > appreciated and will help us to avoid bustage. Apologies for this, I had forgotten that OCC uses this, indeed I should have given you a heads up. Another option could be to check e.g. the SHA256 of the binary, rather than the output of `generic-worker --version`. This has the advantage that it helps defend against an attack whereby an attacker replaces generic-worker.exe with a different binary that fakes the "version" output but does something malicious when called with the "run" target. However, I will make sure I notify in future if there are any changes, that was a careless oversight of mine. > > also, is it intentional that there is so much output from the version > command? eg > (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/ > SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log): > > Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version > 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0 > C0423D76E8] > 2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is > too small. > 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: > [C042405C40 C0423D76E8] > 2018/07/11 08:23:31 Result: 1 0 The operation completed successfully. > generic-worker 10.8.5 [ revision: > https://github.com/taskcluster/generic-worker/commits/ > 034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] > > is it necessary to make those system calls in order to determine the > version? also the output seems superfluous to what a user would expect when > requesting the version. Unfortunately this is a limitation of the docopt-go library we are using. The command line parser requires that the help text is passed into the method call that parses the command arguments. We need to parse the arguments to see that --help is called, and the help text for the command includes docs for the parameter "tasksDir". The default value of this property is system dependent, and depends on the Profiles Directory on the system. In other words, the --help output needs to know where the system Profiles Directory is in order to provide the --help text, and the parser needs the full help text before it parses the command line arguments. The reason we log all system calls is because it is very useful for troubleshooting when there are failures, as the go/c boundary is a potential source of failure. I believe the output is sent to standard error rather than standard out, so if standard error is disabled, it won't be shown, but I see that it is less than ideal. However, those system calls are needed so it is also kind of useful to see that they are made.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 64

•

7 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #62) > pmoore: in light of the switch from "Match" to "Like", it's possible that > the regex at > https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu. > sh#L55 may need adaptation. maybe not, but worth checking... I think it is ok. Thanks for the heads up!

Pete Moore [:pmoore][:pete]

Comment 65

•

7 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #60) > it seems that at some point the output from the version command changed to > include a link to the git hash the version was built from. this change means > that dsc assumes that the correct version of gw is not installed and goes on > to reinstall gw on every occ run. Hey Rob, I remember we had a couple of changes in the pipeline to help with these types of issues, but I can't find the bug numbers right now. Two things that would help in this situation are: 1) we only apply OCC manifests during AMI creation (so we wouldn't repeatedly reapply every worker run) 2) if a validation step fails (such as validating the g-w version number) the AMI creation process should fail With both of these safeguards, I think we would stop this issue earlier, and it couldn't get rolled out to production. Can you confirm if it is still intended to implement these changes? Thanks in advance.

Flags: needinfo?(rthijssen)

Pete Moore [:pmoore][:pete]

Comment 66

•

7 years ago

It looks like generic-worker 8.3.0 is running in production on Windows 10 hardware. See this task from today: https://tools.taskcluster.net/groups/Vbh4i4X1Qm-jcEtWb2TnDQ/tasks/AU4drFJNQM6SN7_lQaOx-w/runs/0/logs/public%2Flogs%2Flive.log#L11 This is a release from April 2017. Also the PR from comment 38 still needs landing, although I'm not sure it will have much effect since I suspect the wrong manifest is getting used. It looks like there are two: * https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json * https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw-GW10.json I suspect the second one is the one that should be used in production, but at least for the log link above, it looks like maybe the first one is getting used.

Flags: needinfo?(mcornmesser)

Kendall Libby [:fubar] (he/him)

Comment 67

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #66) > It looks like generic-worker 8.3.0 is running in production on Windows 10 > hardware. Yes, that's the entire point of this bug, to get us onto gw10. We attempted it once, as part of attempting to work around some of the issues we were having on the Moonshot hardware, but found issues with running gw10 on hardware.

Mark Cornmesser [:markco]

Assignee

Comment 68

•

7 years ago

Currently generic-worker 10.8.5 is only one ms-016 through ms-045. Up until last week we were hitting the issues mentioned in 55 which was preventing a full deployment. There are two issues that are still preventing a full deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and bug 1474678 which seems to be a network issue that has only affected the gw 10 nodes. The later may also just be related to the chassis in which the nodes are in. I am also a bit concern about bug 1474729 with some nodes seemingly running 2 tasks at once. I am hoping to be able to do a wide deployment the week of the 23rd.

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

Depends on: 1474678, 1474729

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 69

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #65) > Can you confirm if it is still intended to implement these changes? yes

Flags: needinfo?(rthijssen)

Pete Moore [:pmoore][:pete]

Comment 70

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #68) > There are two issues that are still preventing a full > deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and > bug 1474678 which seems to be a network issue that has only affected the gw > 10 nodes. Bug 1470338: Indeed this looks like an OCC issue, probably :grenade is the best person to help with that. Bug 1474678: This looks to be due to bug 1372172 - something in the win10 image is causing the machine to freeze/reboot. I've added my analysis to the bug. Neither of these issues are related to the generic-worker version.

Mark Cornmesser [:markco]

Assignee

Comment 71

•

7 years ago

Some of these issues maybe related to https://bugzilla.mozilla.org/show_bug.cgi?id=1475258 . Which is beginning to look like a possible hardware failure.

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

7 years ago

Depends on: 1433854

Mark Cornmesser [:markco]

Assignee

Comment 72

•

7 years ago

Current status. I have seen no issues if with the functionality of generic-worker 10 over the last month. Except for some clean up is not happening. Task user home directories are not being removed, and there is one drive schedule task for each user that has been created and never gets deleted. We are trying to get the overall environment to stabilize before deploying the current version of generic-worker. In the test pools, chassis 1 in both mdc 1 and 2, rundsc errors have been resolved and the Mellanox network drivers have been updated. However, we are seeing an issues where node occasionally can not get to external resources. When this happens OCC fails on trying to get to those resources. Also generic-worker is unable to talk to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads to oCC not running again after reboots and generic-worker never receiving the flags it needs to start. We are hoping that a firmware upgrade will prevent this issue from happening. Before upgrading across the board we are blocked on two things. The complete cleanup of the task dirs and old schedule tasks. The other, which is kind of a soft blocker, is the issue of unable to get to external resources on boot. We may want upgrade even if the later is happening. I am going to evaluate that this week.

Radu Iman[:riman]

Updated

•

7 years ago

Blocks: T-W1064-MS-066

Radu Iman[:riman]

Updated

•

7 years ago

Blocks: T-W1064-MS-087

Radu Iman[:riman]

Updated

•

7 years ago

No longer blocks: T-W1064-MS-087

Radu Iman[:riman]

Updated

•

7 years ago

No longer blocks: T-W1064-MS-066

Mark Cornmesser [:markco]

Assignee

Comment 73

•

7 years ago

To address the issue with external resources are not available I have added to catches. During deployment before the first OCC run, there is now a schedule task created to check for the existence of the rundsc.ps1 script: if (!(Test-Path C:\dsc\rundsc.ps1)) { (New-Object Net.WebClient).DownloadFile(("https://raw.githubusercontent.com/markcor/OpenCloudConfig/master/userdata/rundsc.ps1?{0}" -f [Guid]::NewGuid()), 'C:\dsc\rundsc.ps1') while (!(Test-Path "C:\dsc\rundsc.ps1")) { Start-Sleep 10 } Remove-Item -Path c:\dsc\in-progress.lock -force -ErrorAction SilentlyContinue shutdown @('-r', '-t', '0', '-c', 'Rundsc.ps1 did not exists; Restarting', '-f') } If it does not exist, the node will download and reboot. I have also added a step to the rundsc.ps1 that if it can't get to github the lock file will be deleted and reboot: if ($locationType -eq 'DataCenter') { if (!(Test-Connection github.com -quiet)) { Remove-Item -Path $lock -force -ErrorAction SilentlyContinue shutdown @('-r', '-t', '0', '-c', 'reboot; external resources are not available', '-f', '-d', '4:5') | Out-File -filePath $logFile -append } } This has seems to have the kept the nodes from being up and not running generic-worker. To address the issue of the onedrive schedule task for every user created, I have added a script to the deployment that removes onedrive and all its components: Import-Module -DisableNameChecking $PSScriptRoot\..\lib\force-mkdir.psm1 Import-Module -DisableNameChecking $PSScriptRoot\..\lib\take-own.psm1 echo "73 OneDrive process and explorer" taskkill.exe /F /IM "OneDrive.exe" taskkill.exe /F /IM "explorer.exe" echo "Remove OneDrive" if (Test-Path "$env:systemroot\System32\OneDriveSetup.exe") { & "$env:systemroot\System32\OneDriveSetup.exe" /uninstall } if (Test-Path "$env:systemroot\SysWOW64\OneDriveSetup.exe") { & "$env:systemroot\SysWOW64\OneDriveSetup.exe" /uninstall } echo "Disable OneDrive via Group Policies" force-mkdir "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive" sp "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive" "DisableFileSyncNGSC" 1 echo "Removing OneDrive leftovers trash" rm -Recurse -Force -ErrorAction SilentlyContinue "$env:localappdata\Microsoft\OneDrive" rm -Recurse -Force -ErrorAction SilentlyContinue "$env:programdata\Microsoft OneDrive" rm -Recurse -Force -ErrorAction SilentlyContinue "C:\OneDriveTemp" echo "Remove Onedrive from explorer sidebar" New-PSDrive -PSProvider "Registry" -Root "HKEY_CLASSES_ROOT" -Name "HKCR" mkdir -Force "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" sp "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0 mkdir -Force "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" sp "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0 Remove-PSDrive "HKCR" echo "Removing run option for new users" reg load "hku\Default" "C:\Users\Default\NTUSER.DAT" reg delete "HKEY_USERS\Default\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" /v "OneDriveSetup" /f reg unload "hku\Default" echo "Removing startmenu junk entry" rm -Force -ErrorAction SilentlyContinue "$env:userprofile\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\OneDrive.lnk" echo "Restarting explorer..." start "explorer.exe" echo "Wait for EX reload.." sleep 15 echo "Removing additional OneDrive leftovers" foreach ($item in (ls "$env:WinDir\WinSxS\*onedrive*")) { Takeown-Folder $item.FullName rm -Recurse -Force $item.FullName } To address the task user home directories that have been left behind, I am trying to add an OCC function to remove those directories once they hit a day old. Once that is addressed I will submit a PR for review for generic-worker 10.8.5. Shortly after I will look at getting the win 10 hardware nodes to the most current version.

Mark Cornmesser [:markco]

Assignee

Comment 74

•

7 years ago

I have added : if (Test-Path C:\dsc\GW10.semaphore) { $currenttaskuser = get-localuser -name task* $currenttaskname = $currenttaskuser.name Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default", "Administrator","Public" | Remove-Item -Force -Recurse } To address the task user home directories. This will find the current user and then delete the other task user directories.

Mark Cornmesser [:markco]

Assignee

Comment 75

•

7 years ago

Attached file Add support for generic-worker 10 on hardware — Details

See comment 73 and comment 74. I will update the source repo before merging.

Attachment #9002623 - Flags: review?(rthijssen)

Attachment #9002623 - Flags: feedback?(pmoore)

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

Blocks: 1484870

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 76

•

7 years ago

Comment on attachment 9002623 [details] [review] Add support for generic-worker 10 on hardware lgtm. some comments in gh review regarding references to forked repo.

Attachment #9002623 - Flags: review?(rthijssen) → review+

Pete Moore [:pmoore][:pete]

Comment 77

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #72) > Current status. > > I have seen no issues if with the functionality of generic-worker 10 over > the last month. Except for some clean up is not happening. Task user home > directories are not being removed, and there is one drive schedule task for > each user that has been created and never gets deleted. Please can you provide example logs showing user directory deletion failing / not happening, config settings used, links to failed tasks and/or links to papertrail worker logs etc to support this claim? What are the scheduled tasks? generic-worker 10 doesn't create any scheduled tasks. Are you sure you are running the correct version? > > We are trying to get the overall environment to stabilize before deploying > the current version of generic-worker. In the test pools, chassis 1 in both > mdc 1 and 2, rundsc errors have been resolved and the Mellanox network > drivers have been updated. However, we are seeing an issues where node > occasionally can not get to external resources. When this happens OCC fails > on trying to get to those resources. Also generic-worker is unable to talk > to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads > to oCC not running again after reboots and generic-worker never receiving > the flags it needs to start. We are hoping that a firmware upgrade will > prevent this issue from happening. > > Before upgrading across the board we are blocked on two things. The complete > cleanup of the task dirs and old schedule tasks. The other, which is kind of > a soft blocker, is the issue of unable to get to external resources on boot. > We may want upgrade even if the later is happening. I am going to evaluate > that this week. As above, generic-worker 10 doesn't create any scheduled tasks - please provide details of what the scheduled tasks are. Also please provide links to worker logs or task logs or screenshots, or any evidence to support your claims. Thanks! This will help people to support you.

Flags: needinfo?(mcornmesser)

Pete Moore [:pmoore][:pete]

Comment 78

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #74) > I have added : > > if (Test-Path C:\dsc\GW10.semaphore) { > $currenttaskuser = get-localuser -name task* > $currenttaskname = $currenttaskuser.name > Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default", > "Administrator","Public" | Remove-Item -Force -Recurse > } > > To address the task user home directories. This will find the current user > and then delete the other task user directories. This shouldn't be necessary. If there is an underlying problem, it should be addressed where the problem is rather than building a workaround on top. Adding more workarounds will lead to an increasingly chaotic and unmaintainable system. It is not useful or sufficient to claim that something isn't working. Claims in bugs /always/ need to be backed up with evidence (worker logs / task logs / config settings / screenshots / console dumps / whatever ....) - otherwise they just hearsay, serve little purpose, and nobody can support you. Many thanks.

Mark Cornmesser [:markco]

Assignee

Comment 79

•

7 years ago

From ms-016, which was reimaged about 3 hours ago: C:\windows\system32>dir c:\Users Volume in drive C is Windows Volume Serial Number is C898-9E3D Directory of c:\Users 08/21/2018 02:35 PM <DIR> . 08/21/2018 02:35 PM <DIR> .. 08/20/2018 10:32 PM <DIR> Administrator 08/20/2018 09:06 PM <DIR> Public 08/21/2018 11:49 AM <DIR> task_1534850768 08/21/2018 01:06 PM <DIR> task_1534851983 08/21/2018 02:39 PM <DIR> task_1534856554 08/21/2018 02:39 PM <DIR> task_1534862131 0 File(s) 0 bytes 8 Dir(s) 41,465,036,800 bytes free https://papertrailapp.com/groups/1141234/events?focus=968569269854035972&selected=968569269854035972 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Generic worker ran successfully (exit code 67) rebooting #015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Looking for existing task users to delete...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Resolved 29 tasks in total so far.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Creating Windows user task_1534862131...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'net' 'user' 'task_1534862131' 'pWd0_zqP1tkx5DZ1BYhJfgdTpbljc' '/add' '/expires:never' '/passwordchg:no' '/y'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Created new OS user!#015 As far as the schedule task. Windows creates a schedule task for each user to auto-update OneDrive (https://onedrive.live.com/about/en-us/). I had currently diabled it on all the generic-worker 10 nodes. I will reinstall one this morning without disableing OneDrive.

Mark Cornmesser [:markco]

Assignee

Comment 80

•

7 years ago

Attached image onedrive.jpg — Details

Actually I had reimaged a node just for this early this morning. From MS-318 C:\windows\system32>schtasks /query | grep one OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:41:17 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 10:01:26 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:59:17 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 11:05:10 PM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 6:26:05 PM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 4:34:24 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 2:09:21 PM Ready

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Comment 81

•

7 years ago

pmoore: Do you think I should hold off from merging this pull request?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 82

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #81) > pmoore: Do you think I should hold off from merging this pull request? I think we need to test the worker in a staging pool that isn't taking production jobs. There are two problems with the current approach: 1) it impacts real user pushes 2) it isn't possible to direct tasks to specific test machines - any worker with the given provisionerId/workerType name can take a job, so when you submit a task you don't know if it will get run by a generic-worker 8.3.0 instance or a generic-worker 10.8.4 worker. If we can set up a dedicated staging pool, it will be easier to troubleshoot what the issues are. For example, I'm curious if the reboots are working between tasks, so I'd like to submit a task to look at the uptime of a given worker, to confirm it rebooted between the current task and the previous task. This will be possible if we have a staging pool with a different workerType configuration setting (like :dragrom set up for macOS workers). Another thing worth trying is to try to remove one of those directories as an Administrator, and see whether you are able to. The directory deletion is done by the windows service, which is a LocalSystem account. I can also add some additional logging to the worker to say why it can't delete the directories, if that would be helpful. Let's set up a staging pool before merging this.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 83

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #82) > I can also add some additional logging to the worker to say why it can't > delete the directories, if that would be helpful. It looks like this is already there, e.g.: remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied. What does `icacls "C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts"` report? I'm curious what created this directory, and who has permission to delete it / modify its access settings.

Pete Moore [:pmoore][:pete]

Comment 84

•

7 years ago

Note, if the os.RemoveAll(path) call to the go standard library fails to delete the directory, the worker falls back to using the del cmd built in, running as LocalSystem user, namely: > cmd /c del /s /q /f "<path>" Maybe we should run the following if the delete fails[1]: > icacls "<path>" /t /grant:r LocalSystem:(OI)(CI)F and then try again. I suspect LocalSystem account and the task user account don't have permission to delete the "Application Shortcuts" folder that is getting created, so we might need to explicitly grant permission recursively to all files/folders to do this in case of failure. This could be an expensive operation so I propose to only do this in the failure case, and then to retry the delete again if the icacls command is successful. I am still curious what the DACLs are on those "Application Shortcuts" folders though. ---- [1] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/icacls

Pete Moore [:pmoore][:pete]

Comment 85

•

7 years ago

If you can grant me access to one of the machines, I would be happy to investigate further. Alternatively if you're happy to set up a staging pool, I can also troubleshoot from e.g. an interactive loaner (I can't get an interactive loaner without a staging pool, as an 8.3.0 worker is likely to consume the task). Thanks!

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Comment 86

•

7 years ago

Here is what was returned from the icacls command on a directory was unable to be deleted: C:\windows\system32>icacls "C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts" C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F) BUILTIN\Administrators:(I)(OI)(CI)(F) S-1-5-21-1770456216-2325375451-2193181373-1002:(I)(OI)(CI)(F) T-W1064-MS-318\task_1534875459:(I)(OI)(CI)(F) Successfully processed 1 files; Failed processing 0 files I have set a staging pool of 5 nodes, ms-320 through ms-324. These nodes have a workerType of gecko-t-win10-64-hbeta and RDP has been enabled on them. However, when a task is retriggered and using this worker type ( and gecko-t-win10-64-hs) the task is picked up and quickly returns as an exception. The reason is malformed-payload. The only value in the generic-worker config file that changed was the workerType.

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Comment 87

•

7 years ago

For reference: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hbeta

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 88

•

7 years ago

for debugging, here's some links to the delete failure logs. - on hardware instances: https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 - on ec2 instances: https://papertrailapp.com/groups/2488493/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 the failures occur on all windows worker types (hardware, ec2, testers, builders)

Pete Moore [:pmoore][:pete]

Comment 89

•

7 years ago

(In reply to Mark Cornmesser [:markco] from comment #87) > For reference: > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ > gecko-t-win10-64-hbeta Looking at one of the malformed-payload exceptions, the log says: > [taskcluster:error] [mounts] task.dependencies needs to include LpA26q4QSiOPOyP0Bdh6Hw since one or more of its artifacts are mounted See https://tools.taskcluster.net/groups/Cs0lJfSQTS-kB_VDLf-4KA/tasks/Cs0lJfSQTS-kB_VDLf-4KA/runs/0/logs/public%2Flogs%2Flive.log#L26 You just need to add that task as a dependency of the current task. See "dependencies" in https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/references/api#request-payload page.

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Updated

•

7 years ago

Blocks: 1486532

Pete Moore [:pmoore][:pete]

Comment 90

•

7 years ago

Mark, did we decide that bug 1433854 isn't blocking this one, as it isn't a showstopper? Can you check the other dependencies, and say if they really are hard dependencies or not? I'm wondering if we can upgrade to generic-worker 10, to unblock all the downstream bugs, and deal with any open issues afterwards. If they aren't critical issues, that would be my preference, but I do see there is quite a long list of dependencies, just not sure if they really are hard blockers or not. Thanks!

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

No longer depends on: 1451835

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

No longer depends on: 1451832

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

No longer depends on: 1433854

Mark Cornmesser [:markco]

Assignee

Comment 91

•

7 years ago

Generic-work 8 is no longer in use in the datacenter. Windows nodes are now using 10.8.5. I am going to open a separate bug and to further upgrade generic-worker, but I may wait on bug 1433854 being resolved before upgrading.

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(mcornmesser)

Resolution: --- → FIXED

Summary: upgrade generic worker to 10.x.x (match versions on AWS testers) on Win 10 hardware → upgrade generic worker to 10.8.5 on Win 10 hardware

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

No longer depends on: 1474678

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Attachment #9002623 - Flags: feedback?(pmoore)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

gen_worker_10_conf.txt 7 years ago Mark Cornmesser [:markco] 1.81 KB, text/plain		Details
OCC win 10 hw GenericWorker upgrade 7 years ago Mark Cornmesser [:markco] 81 bytes, text/plain	grenade : feedback+	Details
Pull request to land upgrade 7 years ago Mark Cornmesser [:markco] 58 bytes, text/x-github-pull-request	markco : review+	Details \| Review
Improvement on hw gw 10 configurations 7 years ago Mark Cornmesser [:markco] 94 bytes, text/plain	grenade : feedback+	Details
Add support for generic-worker 10 on hardware 7 years ago Mark Cornmesser [:markco] 58 bytes, text/x-github-pull-request	grenade : review+	Details \| Review
onedrive.jpg 7 years ago Mark Cornmesser [:markco] 255.09 KB, image/jpeg		Details