Closed Bug 1443589 Opened 7 years ago Closed 6 years ago

upgrade generic worker to 10.8.5 on Win 10 hardware

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

Attachments

(6 files)

No description provided.
Assignee: relops → mcornmesser
Blocks: 1431161
^
Flags: needinfo?(mcornmesser)
There is a test pool of 25 nodes up. I am hoping that early next week to expand.
Flags: needinfo?(mcornmesser)
Summary: upgrade generic worker to 10.6.0 on Win 10 hardware → upgrade generic worker to 10.x.x (match versions on AWS testers) on Win 10 hardware
Depends on: 1451832, 1451835, 1451837
I gone to bump this up on my priority list for this week. A few weeks ago I was able to get GenericWorker 10.8.1 installed and running on hardware. However, once I put it out in the wild it began to burn tasks. At the time it was test failures not GenericWorker errors. When this happen I did not have the time to really dive into it. I have t-w1064-ms-077 spinning up using https://github.com/mozilla-platform-ops/OpenCloudConfig . Which is will install 10.8.1 . I am using the old SCL3-pupept provisionor, so that I can manually retirgger tests and point it at this particular machine. I am hoping to either have this resolved or gather enough info so I can pull in help during the work week.
++ Thanks Mark.
Attached file gen_worker_10_conf.txt
A copy of the conf file being used
Ran a small handful of random hardware test yesterday. With one test that had previous passed failing. I am going to run through all the hardware tests today and will open additional bugs for failing tests. Other notes: I did not need to use the scl3-puppet provisioner. Worker types gecko-t-* will work with the releng-hardware provisioner. With all the tests failing last time I attempted to the test the updated worker, it may had been local network or DNS issues that affected the test.
There were no apparent issues with Talos Tests. All tests were green: test-windows7-32/opt-talos-chrome-e10s https://tools.taskcluster.net/groups/WVXB-YEpSsWA4K6n24G7ag/tasks/WVXB-YEpSsWA4K6n24G7ag/details test-windows7-32/opt-talos-dromaeojs-e10s https://tools.taskcluster.net/groups/d-XtOWDMS-yKyVIhQa4Xwg/tasks/d-XtOWDMS-yKyVIhQa4Xwg/details test-windows7-32/opt-talos-damp-e10s https://tools.taskcluster.net/groups/KER9T-WdQCqbZ0cmIJWIVA/tasks/KER9T-WdQCqbZ0cmIJWIVA/details test-windows7-32/opt-talos-g4-e10s https://tools.taskcluster.net/groups/JDdMx41VRSqSY2l-c9YnCQ/tasks/JDdMx41VRSqSY2l-c9YnCQ/details test-windows7-32/opt-talos-g5-e10s https://tools.taskcluster.net/groups/YleIIJM0TRCesQF6DMffmQ/tasks/YleIIJM0TRCesQF6DMffmQ/details test-windows7-32/opt-talos-h1-e10s https://tools.taskcluster.net/groups/TABdIBQkSCCrnfuUowsg5g/tasks/TABdIBQkSCCrnfuUowsg5g/details test-windows7-32/opt-talos-perf-reftest-e10s https://tools.taskcluster.net/groups/MCXk-5MQSe-3kfcyYrDTDQ/tasks/MCXk-5MQSe-3kfcyYrDTDQ/details test-windows7-32/opt-talos-perf-reftest-singletons-e10s https://tools.taskcluster.net/groups/fw5utBIkQNec_8haDpBGaw/tasks/fw5utBIkQNec_8haDpBGaw/details test-windows7-32/opt-talos-speedometer-e10s https://tools.taskcluster.net/groups/Aqk1ydjJRsyICRkPZ1ihFw/tasks/Aqk1ydjJRsyICRkPZ1ihFw/details test-windows7-32/opt-talos-tp5o-e10s https://tools.taskcluster.net/groups/ekz1VHmET9iOrQtvu3_bFQ/tasks/ekz1VHmET9iOrQtvu3_bFQ/details test-windows7-32/opt-talos-tp6-e10s https://tools.taskcluster.net/groups/EtFOerjYRz6FpnNDxhsWXw/tasks/EtFOerjYRz6FpnNDxhsWXw/details test-windows7-32/opt-talos-tps-e10s https://tools.taskcluster.net/groups/VJtq2fmQTM2kV-9DKmOLDw/tasks/VJtq2fmQTM2kV-9DKmOLDw/details test-windows10-64/opt-talos-chrome-e10s https://tools.taskcluster.net/groups/Qwm0kL87Q82dWyYQmWGa-w/tasks/Qwm0kL87Q82dWyYQmWGa-w/details test-windows10-64/opt-talos-dromaeojs-e10s https://tools.taskcluster.net/groups/f7camATqTMWoiViB7HEGlA/tasks/f7camATqTMWoiViB7HEGlA/details test-windows10-64/opt-talos-damp-e10s https://tools.taskcluster.net/groups/KcdAenecRsyi_77BtnTGNw/tasks/KcdAenecRsyi_77BtnTGNw/details test-windows10-64/opt-talos-g1-e10s https://tools.taskcluster.net/groups/Rc-q9rmxTHa1Hb8-fb7cFg/tasks/Rc-q9rmxTHa1Hb8-fb7cFg/details test-windows10-64/opt-talos-g4-e10s https://tools.taskcluster.net/groups/HTwW_FXDR2uQrUWWUprosA/tasks/HTwW_FXDR2uQrUWWUprosA/details test-windows10-64/opt-talos-g5-e10s https://tools.taskcluster.net/groups/ULh04Dd9QkuvUrrc3DBU0A/tasks/ULh04Dd9QkuvUrrc3DBU0A/details test-windows10-64/opt-talos-h1-e10s https://tools.taskcluster.net/groups/ACwZbgrXRAiNWk5UwE5LLw/tasks/ACwZbgrXRAiNWk5UwE5LLw/details test-windows10-64/opt-talos-perf-reftest-e10s https://tools.taskcluster.net/groups/ae68QD0zRRmVSfuM_rgFbw/tasks/ae68QD0zRRmVSfuM_rgFbw/details test-windows10-64/opt-talos-perf-reftest-singletons-e10s https://tools.taskcluster.net/groups/UdjKpx_tQKuEbfGhieDbHQ/tasks/UdjKpx_tQKuEbfGhieDbHQ/details test-windows10-64/opt-talos-speedometer-e10s https://tools.taskcluster.net/groups/fg0NhJUoRhyBenVUJs-LAg/tasks/fg0NhJUoRhyBenVUJs-LAg/details test-windows10-64/opt-talos-tp5o-e10s https://tools.taskcluster.net/groups/W46xJWCkSL-7kdQdvbj_sA/tasks/W46xJWCkSL-7kdQdvbj_sA/details test-windows10-64/opt-talos-tp6-e10s https://tools.taskcluster.net/groups/IohJ4iuzQNio0q1fsMRtzA/tasks/IohJ4iuzQNio0q1fsMRtzA/details test-windows10-64/opt-talos-tps-e10s https://tools.taskcluster.net/groups/JPDkd_bWQtCqJkx3QcsoKw/tasks/JPDkd_bWQtCqJkx3QcsoKw/details
I am setting up 2 nodes ms-135 and ms-81 to run as gecko-t-win10-64-hw worker type with GenericWorker 10.8.1. I am not to concerned with the functionality of the worker as much as the extended life of the nodes. After a mass of tests are ran these nodes I will go back through and look at the state of the node concerning disk space and schedule tasks. After this I will update the OCC testing repo and roll out a 5 machine test pool. For the testing repo I am going to include similar functions covered in https://bugzilla.mozilla.org/show_bug.cgi?id=1451837. This includes disk space management and Datacenter location decisions. The next sticky part is figuring out a method in which we can do an incremental roll out of the upgraded worker. I am thinking about creating a second Win 10 OCC manifest and creating a flag during the initial deployment that OCC can check for when deciding on which manifest to use.
The 2 test nodes are up, picking up tasks, and passing tests.
(In reply to Mark Cornmesser [:markco] from comment #11) > The 2 test nodes are up, picking up tasks, and passing tests. Great news!
To upgrade to 10.8.4 There is new manifest and a conditional clause looking for a gw 10 file from deployment. The reason for this is so we can do an incremental upgrade without breaking the entire hardware pool. This also incorporates hard disk clean up from Bug 1451837. As well as removal of KTS from Bug 1454759.
Attachment #8985274 - Flags: feedback?(rthijssen)
Comment on attachment 8985274 [details] OCC win 10 hw GenericWorker upgrade r+, looks good to me. in function hw-DiskManage, there's a reboot that references the lock file. the path to the lock file will need to be passed into the function as a parameter. here's an example: https://github.com/mozilla-releng/OpenCloudConfig/blob/5963cda/userdata/rundsc.ps1#L420
Attachment #8985274 - Flags: feedback?(rthijssen) → feedback+
Bumping GW version 10.8.5 .
Moved back to 10.8.4
Verbally R+ by grenade. Holding off on merging until current issues are resolved.
Attachment #8985473 - Flags: review+
(In reply to Mark Cornmesser [:markco] from comment #16) > Moved back to 10.8.4 10.8.5 should be fine - the production Windows issues from today are unrelated.
Flags: needinfo?(mcornmesser)
Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I will land the patch. The nodes will need to be reimaged for this to take affect.
Flags: needinfo?(mcornmesser)
Patch landed.
Ciduty will start on a rolling install.
:apop will be handling the re-images. I can CCed the ciduty@m.c bugzilla account (under the same email, you can even NeedInfo it!) so we can all have better visibility into the process. Adrian will come back with updates, as he has them.
reimaged the following moonshots: T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019, T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023, T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027, T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031, T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037, T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041, T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045.
Today we re-imaged the following moonshots: T-W1064-MS-{ 061.. to 090 } T-W1064-MS-{ 106.. to 135 } T-W1064-MS-{ 151.. to 171 }
\o/ Thanks guys!
(In reply to Adrian Pop from comment #23) > reimaged the following moonshots: > > T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019, > T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023, > T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027, > T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031, > T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037, > T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041, > T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045. Looking at one of these at random, it doesn't appear to be taking jobs: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-045
Flags: needinfo?(riman)
Flags: needinfo?(apop)
On some of the nodes we hit an issue where generic worker exit and the wrapper script rebooted the node before OCC cleared the in-progress.lock file. When the node came back up, the rundsc powershell script exited without running. The odd bit is that is not across the board. I have added this to the wrapper script: https://github.com/mozilla-releng/OpenCloudConfig/commit/19fac03565089b5c8ef626ebde4be2dc8e684d58 Whenever the GenericWorker exits if the lock file exists it will be deleted. As well as if the manifest hasn't applied with in 15 minutes the lock file will be deleted and the machine will reboot.
Asked CiDuty in #ciduty to not start additional installs until the above patch is tested.
This seems to have affected about 20% of the newly installed machines. The ones that were function continue to function after picking up the new wrapper script. I am running a fresh install on ms-021. If installs through and picks up multiple tasks I will reninstall the other affected nodes on chassis 1. If there is no issue on those I will ask ciduty to resume installing and go through and identify the other affect nodes.
I have dropped the reboot after the manifest hasn't completed within 15 minutes. It was causing an additional loop on its own. I am planning on adding this later with Bug 1470016. Ms-021 successfully installed and picked up and passed multiple tests. With the exception of ms-038 and ms-035, which has other issues, nodes ms-016 through ms-045 are up and running. I am going to let these sit for a while and run. If there is no issues I will ask CiDuty to resume the roll out. I suspect the issue began during the initial run(s) of OCC on the nodes. Possible right after the creation of the first task user. Because the last exit code of GenericWorker in the log was a 67. I suspect rundsc.ps1 was waiting on something when the wrapper script issued a reboot.
So far things are looking OK. I am going to take a look the nodes mention above in the morning pdt.
(In reply to Mark Cornmesser [:markco] from comment #19) > Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I > will land the patch. The nodes will need to be reimaged for this to take > affect. Looking at e.g. https://taskcluster-artifacts.net/TeZ1Sq4eSjC5k-NgAnH5Ww/0/public/logs/live_backing.log it looks like the new machines are running 10.8.4 not 10.8.5. Looking at userdata/Manifest/gecko-t-win10-64-hw-GW10.json I see that version 10.8.5 is specified, but I can see the sha256 is for version 10.8.4. It looks like the following (unreviewed) commit introduced the issue since it updated the version number but did not update the tooltool hash: https://github.com/mozilla-releng/OpenCloudConfig/commit/1e953d46810a2a86c648b30f1f40f526795c1c90 I'm happy to review any patches to OCC.
s/sha256/sha512/
Found the proper version of the generic worker in tooltool (10.8.5), and updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file. Created pull request https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b Please review and merge it if it's ok.
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36) > Found the proper version of the generic worker in tooltool (10.8.5), and > updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file. > > Created pull request > https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b > > > Please review and merge it if it's ok. That looks like a commit to master rather than a PR. :-) But I would have r+'d it as the diff is correct. Thanks.
(In reply to Attila Craciun [:arny] from comment #38) > There is also a PR: > https://github.com/mozilla-releng/OpenCloudConfig/pull/156 ;) Ah indeed! r+'d in github. Many thanks.
This came in the puppet (~30 minutes) > Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' as a report for t-yosemite-r7-256. Did you guys started working on OSX generic worker?
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #40) > This came in the puppet (~30 minutes) > > > Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' > > as a report for t-yosemite-r7-256. > > Did you guys started working on OSX generic worker? dhouse:^
Flags: needinfo?(dhouse)
The generic-worker on OSX was updated yesterday. Looking into taskxluster, this host not appear to the pool:https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010
I went back through and checked the nodes on chassis 1 (16 - 45), and the looping issue did not reoccur. However, there were 3 nodes that went unresponsive. Chassis 2 had 3 nodes that were stuck in the loop, but there I found a high percentage of unresponsive nodes. This is an issue we have been hitting (Bug 1452133). I am going to continue to go through the newly installed nodes and identify issues, reinstall to get them back into the pool, and continue to dive into this.
Flags: needinfo?(dhouse)
In the remaining the nodes that had the updated version of GW, I found 30 nodes which were in an unresponsive or possible shutdown state. Which is a much higher frequency than the older version. It maybe unrelated, but I am going to continue to monitor this. With a possibility of rolling all but the chassis 1 nodes to GW 8. The looping issue seems to have gone away on the newly installed nodes.
Going through the unresponsive nodes, nodes were locked up not shutdown. According to the last reported logs they locked up at different points, when rundsc.ps1 issued a reboot, when the wrapper script issued a reboot, and while waiting for a task. Other nodes had not installed correctly. Either the node was continuing to try to pxe boot or continuing to try to pxe boot over IPv6. Mostly likely it is not related, but the amount on the gw 10 workers was concerning. I am going to continue to monitor.
As for now I am going to hold off on completion of the roll out until Monday next week.
Depends on: 1470338
Some of the nodes that were unresponsive actually had failed to reboot. I have opened up Bug 1470338 to track those.
In the last 20 hours 15 out of 100 nodes went from good to not able to pick up tasks. (16-45) Unresponsive 20 24 32 39 Failed to reboot 23 42 (61-90) Unresponsive 72 Failed to reboot 75 85 (106-135) Unresponsive 107 Failed to reboot 115 118 126 (151-171) Failed to Reboot 154 155 170 The unresponsive nodes seems to be either different or at an accelerated rate than the previous known issue. This is now being tracked in Bug 1470507.
CiDuty: Could you all reinstall all except ms-016 - ms-045 to the old task sequence? We resume the upgrade after Bug 1470507 and Bug 1470338 are resolved.
Flags: needinfo?(riman)
Flags: needinfo?(ciduty)
Flags: needinfo?(apop)
Yes we can. I'm going to start right now.
Flags: needinfo?(ciduty)
I am going to move to using this repo while troubleshooting: https://github.com/mozilla-platform-ops/OpenCloudConfig .
I've re-imaged T-W1064-MS-061..115 to the old task sequence
I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171. Except no. 130 it seems to have some issue not getting past boot and showing a blue screen saying we encountered an error. All the re-imaged workers (except 130) are well alive and completing jobs.
(In reply to Radu Iman[:riman] from comment #50) > Yes we can. I'm going to start right now. (In reply to Zsolt Fay [:zsoltfay] from comment #53) > I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171. > Except no. 130 it seems to have some issue not getting past boot and showing > a blue screen saying we encountered an error. > > All the re-imaged workers (except 130) are well alive and completing jobs. Thank you ciduty for your help with reimaging.
There were multiple complication with between OCC and GenericWorker on that last attempt of the upgrade. There are 2 files that the wrapper script looks for before starting up GW. One is to signal the end of the OCC manifest being applied (EndOfManifest.semaphore), and one is that signals then end of the rundsc.ps1 (task-claim-state.valid ). In both cases the wrapper script would fail on deleting the file because of file locks. Because those were not being removed the wrapper script would jump straight to starting GW. Rundsc.ps1 would then exit once it detected the GW process. This led to state where OCC was never fully applied. I have added multiple catches to delete these files if they exist when GW exits or when there is an early exit of rundsc.ps1. OCC would then apply fully on each boot. However, the vaildation of the GW install would fail and OCC would then install GW again. GW would then try to start using the default wrapper script and fail: "CommandsReturn": [ { "Command": "C:\\generic-worker\\generic-worker.exe", "Arguments": [ "--version" ], "Match": "generic-worker 10.8.5" Which the command returns: :\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8] 2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small. 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8] 2018/07/09 04:57:37 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] For the time being I have removed this validation piece. In addition the user intit script (https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/task-user-init-win10.cmd) has been added to the hw configuration. I have additional logging to the wrapper script looking at the DSC dir after GW exit. As well as an attempt at dumping full list of running task when the nodes fails to reboot, but I suspect the reboot issue was caused by unfinished portions of OCC. Also to get more info the generic-worker-service.log is being sent to papertrail. My hope is that there will be less issues because OCC is fully completing, and if not the additional logging will give us an idea of the possible cause.
Rob: I still need to clean it up a bit, but do you have hny feedback? Any ideas on validation of the GW install?
Attachment #8990654 - Flags: feedback?(rthijssen)
i'm not sure how we can validate the gw version without running `generic-worker.exe --version` have i understood correctly that calling gw with the version flag causes gw to do more than just return the version?
(In reply to Rob Thijssen (:grenade UTC+2) from comment #57) > i'm not sure how we can validate the gw version without running > `generic-worker.exe --version` > have i understood correctly that calling gw with the version flag causes gw > to do more than just return the version? Yes. C:\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8] 2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small. 2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8] 2018/07/09 04:57:37 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]
I have added this to the testing repo: https://github.com/mozilla-platform-ops/OpenCloudConfig/commit/91b6cba9a76d184a4b2a48c3a6d49fffb604b636 If one of the flag files do not exists and Powershell is not running, clear lock file and flag files if they exist, and reboot to try the process again. This sprung from https://bugzilla.mozilla.org/show_bug.cgi?id=1474678 . There seems to be an intermittent issue with the Mellanox network drivers on start up that prevented OCC from getting to externally source files.
Comment on attachment 8990654 [details] Improvement on hw gw 10 configurations pmoore: up until now we have used a validation in occ that looks for a version string in the output from `generic-worker.exe --version` in the form of a complete line that matches: "generic-worker 10.8.5" eg: https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/Manifest/gecko-1-b-win2012.json#L1126 it seems that at some point the output from the version command changed to include a link to the git hash the version was built from. this change means that dsc assumes that the correct version of gw is not installed and goes on to reinstall gw on every occ run. i will patch occ to allow for the extra information in the output from the version command. in future, a heads up that a change like this is being introduced, would be appreciated and will help us to avoid bustage. also, is it intentional that there is so much output from the version command? eg (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log): Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0 C0423D76E8] 2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is too small. 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [C042405C40 C0423D76E8] 2018/07/11 08:23:31 Result: 1 0 The operation completed successfully. generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] is it necessary to make those system calls in order to determine the version? also the output seems superfluous to what a user would expect when requesting the version.
Flags: needinfo?(pmoore)
Attachment #8990654 - Flags: feedback?(rthijssen) → feedback+
OCC is now patched to handle the extra line content. https://github.com/mozilla-releng/OpenCloudConfig/commit/462d9fd the implementation now includes a *like* comparison so the validation for the generic worker version now looks like this: "Like": "generic-worker 10.8.5 *" instead of: "Match": "generic-worker 10.8.5"
pmoore: in light of the switch from "Match" to "Like", it's possible that the regex at https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu.sh#L55 may need adaptation. maybe not, but worth checking...
(In reply to Rob Thijssen (:grenade UTC+2) from comment #60) > Comment on attachment 8990654 [details] > Improvement on hw gw 10 configurations > > pmoore: > > up until now we have used a validation in occ that looks for a version > string in the output from `generic-worker.exe --version` in the form of a > complete line that matches: "generic-worker 10.8.5" > eg: > https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/ > Manifest/gecko-1-b-win2012.json#L1126 > > it seems that at some point the output from the version command changed to > include a link to the git hash the version was built from. this change means > that dsc assumes that the correct version of gw is not installed and goes on > to reinstall gw on every occ run. > > i will patch occ to allow for the extra information in the output from the > version command. > > in future, a heads up that a change like this is being introduced, would be > appreciated and will help us to avoid bustage. Apologies for this, I had forgotten that OCC uses this, indeed I should have given you a heads up. Another option could be to check e.g. the SHA256 of the binary, rather than the output of `generic-worker --version`. This has the advantage that it helps defend against an attack whereby an attacker replaces generic-worker.exe with a different binary that fakes the "version" output but does something malicious when called with the "run" target. However, I will make sure I notify in future if there are any changes, that was a careless oversight of mine. > > also, is it intentional that there is so much output from the version > command? eg > (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/ > SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log): > > Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version > 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0 > C0423D76E8] > 2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is > too small. > 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: > [C042405C40 C0423D76E8] > 2018/07/11 08:23:31 Result: 1 0 The operation completed successfully. > generic-worker 10.8.5 [ revision: > https://github.com/taskcluster/generic-worker/commits/ > 034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ] > > is it necessary to make those system calls in order to determine the > version? also the output seems superfluous to what a user would expect when > requesting the version. Unfortunately this is a limitation of the docopt-go library we are using. The command line parser requires that the help text is passed into the method call that parses the command arguments. We need to parse the arguments to see that --help is called, and the help text for the command includes docs for the parameter "tasksDir". The default value of this property is system dependent, and depends on the Profiles Directory on the system. In other words, the --help output needs to know where the system Profiles Directory is in order to provide the --help text, and the parser needs the full help text before it parses the command line arguments. The reason we log all system calls is because it is very useful for troubleshooting when there are failures, as the go/c boundary is a potential source of failure. I believe the output is sent to standard error rather than standard out, so if standard error is disabled, it won't be shown, but I see that it is less than ideal. However, those system calls are needed so it is also kind of useful to see that they are made.
Flags: needinfo?(pmoore)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #62) > pmoore: in light of the switch from "Match" to "Like", it's possible that > the regex at > https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu. > sh#L55 may need adaptation. maybe not, but worth checking... I think it is ok. Thanks for the heads up!
(In reply to Rob Thijssen (:grenade UTC+2) from comment #60) > it seems that at some point the output from the version command changed to > include a link to the git hash the version was built from. this change means > that dsc assumes that the correct version of gw is not installed and goes on > to reinstall gw on every occ run. Hey Rob, I remember we had a couple of changes in the pipeline to help with these types of issues, but I can't find the bug numbers right now. Two things that would help in this situation are: 1) we only apply OCC manifests during AMI creation (so we wouldn't repeatedly reapply every worker run) 2) if a validation step fails (such as validating the g-w version number) the AMI creation process should fail With both of these safeguards, I think we would stop this issue earlier, and it couldn't get rolled out to production. Can you confirm if it is still intended to implement these changes? Thanks in advance.
Flags: needinfo?(rthijssen)
It looks like generic-worker 8.3.0 is running in production on Windows 10 hardware. See this task from today: https://tools.taskcluster.net/groups/Vbh4i4X1Qm-jcEtWb2TnDQ/tasks/AU4drFJNQM6SN7_lQaOx-w/runs/0/logs/public%2Flogs%2Flive.log#L11 This is a release from April 2017. Also the PR from comment 38 still needs landing, although I'm not sure it will have much effect since I suspect the wrong manifest is getting used. It looks like there are two: * https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json * https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw-GW10.json I suspect the second one is the one that should be used in production, but at least for the log link above, it looks like maybe the first one is getting used.
Flags: needinfo?(mcornmesser)
(In reply to Pete Moore [:pmoore][:pete] from comment #66) > It looks like generic-worker 8.3.0 is running in production on Windows 10 > hardware. Yes, that's the entire point of this bug, to get us onto gw10. We attempted it once, as part of attempting to work around some of the issues we were having on the Moonshot hardware, but found issues with running gw10 on hardware.
Currently generic-worker 10.8.5 is only one ms-016 through ms-045. Up until last week we were hitting the issues mentioned in 55 which was preventing a full deployment. There are two issues that are still preventing a full deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and bug 1474678 which seems to be a network issue that has only affected the gw 10 nodes. The later may also just be related to the chassis in which the nodes are in. I am also a bit concern about bug 1474729 with some nodes seemingly running 2 tasks at once. I am hoping to be able to do a wide deployment the week of the 23rd.
Flags: needinfo?(mcornmesser)
Depends on: 1474678, 1474729
(In reply to Pete Moore [:pmoore][:pete] from comment #65) > Can you confirm if it is still intended to implement these changes? yes
Flags: needinfo?(rthijssen)
(In reply to Mark Cornmesser [:markco] from comment #68) > There are two issues that are still preventing a full > deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and > bug 1474678 which seems to be a network issue that has only affected the gw > 10 nodes. Bug 1470338: Indeed this looks like an OCC issue, probably :grenade is the best person to help with that. Bug 1474678: This looks to be due to bug 1372172 - something in the win10 image is causing the machine to freeze/reboot. I've added my analysis to the bug. Neither of these issues are related to the generic-worker version.
Some of these issues maybe related to https://bugzilla.mozilla.org/show_bug.cgi?id=1475258 . Which is beginning to look like a possible hardware failure.
Current status. I have seen no issues if with the functionality of generic-worker 10 over the last month. Except for some clean up is not happening. Task user home directories are not being removed, and there is one drive schedule task for each user that has been created and never gets deleted. We are trying to get the overall environment to stabilize before deploying the current version of generic-worker. In the test pools, chassis 1 in both mdc 1 and 2, rundsc errors have been resolved and the Mellanox network drivers have been updated. However, we are seeing an issues where node occasionally can not get to external resources. When this happens OCC fails on trying to get to those resources. Also generic-worker is unable to talk to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads to oCC not running again after reboots and generic-worker never receiving the flags it needs to start. We are hoping that a firmware upgrade will prevent this issue from happening. Before upgrading across the board we are blocked on two things. The complete cleanup of the task dirs and old schedule tasks. The other, which is kind of a soft blocker, is the issue of unable to get to external resources on boot. We may want upgrade even if the later is happening. I am going to evaluate that this week.
No longer blocks: T-W1064-MS-087
No longer blocks: T-W1064-MS-066
To address the issue with external resources are not available I have added to catches. During deployment before the first OCC run, there is now a schedule task created to check for the existence of the rundsc.ps1 script: if (!(Test-Path C:\dsc\rundsc.ps1)) { (New-Object Net.WebClient).DownloadFile(("https://raw.githubusercontent.com/markcor/OpenCloudConfig/master/userdata/rundsc.ps1?{0}" -f [Guid]::NewGuid()), 'C:\dsc\rundsc.ps1') while (!(Test-Path "C:\dsc\rundsc.ps1")) { Start-Sleep 10 } Remove-Item -Path c:\dsc\in-progress.lock -force -ErrorAction SilentlyContinue shutdown @('-r', '-t', '0', '-c', 'Rundsc.ps1 did not exists; Restarting', '-f') } If it does not exist, the node will download and reboot. I have also added a step to the rundsc.ps1 that if it can't get to github the lock file will be deleted and reboot: if ($locationType -eq 'DataCenter') { if (!(Test-Connection github.com -quiet)) { Remove-Item -Path $lock -force -ErrorAction SilentlyContinue shutdown @('-r', '-t', '0', '-c', 'reboot; external resources are not available', '-f', '-d', '4:5') | Out-File -filePath $logFile -append } } This has seems to have the kept the nodes from being up and not running generic-worker. To address the issue of the onedrive schedule task for every user created, I have added a script to the deployment that removes onedrive and all its components: Import-Module -DisableNameChecking $PSScriptRoot\..\lib\force-mkdir.psm1 Import-Module -DisableNameChecking $PSScriptRoot\..\lib\take-own.psm1 echo "73 OneDrive process and explorer" taskkill.exe /F /IM "OneDrive.exe" taskkill.exe /F /IM "explorer.exe" echo "Remove OneDrive" if (Test-Path "$env:systemroot\System32\OneDriveSetup.exe") { & "$env:systemroot\System32\OneDriveSetup.exe" /uninstall } if (Test-Path "$env:systemroot\SysWOW64\OneDriveSetup.exe") { & "$env:systemroot\SysWOW64\OneDriveSetup.exe" /uninstall } echo "Disable OneDrive via Group Policies" force-mkdir "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive" sp "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive" "DisableFileSyncNGSC" 1 echo "Removing OneDrive leftovers trash" rm -Recurse -Force -ErrorAction SilentlyContinue "$env:localappdata\Microsoft\OneDrive" rm -Recurse -Force -ErrorAction SilentlyContinue "$env:programdata\Microsoft OneDrive" rm -Recurse -Force -ErrorAction SilentlyContinue "C:\OneDriveTemp" echo "Remove Onedrive from explorer sidebar" New-PSDrive -PSProvider "Registry" -Root "HKEY_CLASSES_ROOT" -Name "HKCR" mkdir -Force "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" sp "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0 mkdir -Force "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" sp "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0 Remove-PSDrive "HKCR" echo "Removing run option for new users" reg load "hku\Default" "C:\Users\Default\NTUSER.DAT" reg delete "HKEY_USERS\Default\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" /v "OneDriveSetup" /f reg unload "hku\Default" echo "Removing startmenu junk entry" rm -Force -ErrorAction SilentlyContinue "$env:userprofile\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\OneDrive.lnk" echo "Restarting explorer..." start "explorer.exe" echo "Wait for EX reload.." sleep 15 echo "Removing additional OneDrive leftovers" foreach ($item in (ls "$env:WinDir\WinSxS\*onedrive*")) { Takeown-Folder $item.FullName rm -Recurse -Force $item.FullName } To address the task user home directories that have been left behind, I am trying to add an OCC function to remove those directories once they hit a day old. Once that is addressed I will submit a PR for review for generic-worker 10.8.5. Shortly after I will look at getting the win 10 hardware nodes to the most current version.
I have added : if (Test-Path C:\dsc\GW10.semaphore) { $currenttaskuser = get-localuser -name task* $currenttaskname = $currenttaskuser.name Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default", "Administrator","Public" | Remove-Item -Force -Recurse } To address the task user home directories. This will find the current user and then delete the other task user directories.
See comment 73 and comment 74. I will update the source repo before merging.
Attachment #9002623 - Flags: review?(rthijssen)
Attachment #9002623 - Flags: feedback?(pmoore)
Blocks: 1484870
Comment on attachment 9002623 [details] [review] Add support for generic-worker 10 on hardware lgtm. some comments in gh review regarding references to forked repo.
Attachment #9002623 - Flags: review?(rthijssen) → review+
(In reply to Mark Cornmesser [:markco] from comment #72) > Current status. > > I have seen no issues if with the functionality of generic-worker 10 over > the last month. Except for some clean up is not happening. Task user home > directories are not being removed, and there is one drive schedule task for > each user that has been created and never gets deleted. Please can you provide example logs showing user directory deletion failing / not happening, config settings used, links to failed tasks and/or links to papertrail worker logs etc to support this claim? What are the scheduled tasks? generic-worker 10 doesn't create any scheduled tasks. Are you sure you are running the correct version? > > We are trying to get the overall environment to stabilize before deploying > the current version of generic-worker. In the test pools, chassis 1 in both > mdc 1 and 2, rundsc errors have been resolved and the Mellanox network > drivers have been updated. However, we are seeing an issues where node > occasionally can not get to external resources. When this happens OCC fails > on trying to get to those resources. Also generic-worker is unable to talk > to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads > to oCC not running again after reboots and generic-worker never receiving > the flags it needs to start. We are hoping that a firmware upgrade will > prevent this issue from happening. > > Before upgrading across the board we are blocked on two things. The complete > cleanup of the task dirs and old schedule tasks. The other, which is kind of > a soft blocker, is the issue of unable to get to external resources on boot. > We may want upgrade even if the later is happening. I am going to evaluate > that this week. As above, generic-worker 10 doesn't create any scheduled tasks - please provide details of what the scheduled tasks are. Also please provide links to worker logs or task logs or screenshots, or any evidence to support your claims. Thanks! This will help people to support you.
Flags: needinfo?(mcornmesser)
(In reply to Mark Cornmesser [:markco] from comment #74) > I have added : > > if (Test-Path C:\dsc\GW10.semaphore) { > $currenttaskuser = get-localuser -name task* > $currenttaskname = $currenttaskuser.name > Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default", > "Administrator","Public" | Remove-Item -Force -Recurse > } > > To address the task user home directories. This will find the current user > and then delete the other task user directories. This shouldn't be necessary. If there is an underlying problem, it should be addressed where the problem is rather than building a workaround on top. Adding more workarounds will lead to an increasingly chaotic and unmaintainable system. It is not useful or sufficient to claim that something isn't working. Claims in bugs /always/ need to be backed up with evidence (worker logs / task logs / config settings / screenshots / console dumps / whatever ....) - otherwise they just hearsay, serve little purpose, and nobody can support you. Many thanks.
From ms-016, which was reimaged about 3 hours ago: C:\windows\system32>dir c:\Users Volume in drive C is Windows Volume Serial Number is C898-9E3D Directory of c:\Users 08/21/2018 02:35 PM <DIR> . 08/21/2018 02:35 PM <DIR> .. 08/20/2018 10:32 PM <DIR> Administrator 08/20/2018 09:06 PM <DIR> Public 08/21/2018 11:49 AM <DIR> task_1534850768 08/21/2018 01:06 PM <DIR> task_1534851983 08/21/2018 02:39 PM <DIR> task_1534856554 08/21/2018 02:39 PM <DIR> task_1534862131 0 File(s) 0 bytes 8 Dir(s) 41,465,036,800 bytes free https://papertrailapp.com/groups/1141234/events?focus=968569269854035972&selected=968569269854035972 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Generic worker ran successfully (exit code 67) rebooting #015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Looking for existing task users to delete...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Resolved 29 tasks in total so far.#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Creating Windows user task_1534862131...#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'net' 'user' 'task_1534862131' 'pWd0_zqP1tkx5DZ1BYhJfgdTpbljc' '/add' '/expires:never' '/passwordchg:no' '/y'#015 Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Created new OS user!#015 As far as the schedule task. Windows creates a schedule task for each user to auto-update OneDrive (https://onedrive.live.com/about/en-us/). I had currently diabled it on all the generic-worker 10 nodes. I will reinstall one this morning without disableing OneDrive.
Attached image onedrive.jpg
Actually I had reimaged a node just for this early this morning. From MS-318 C:\windows\system32>schtasks /query | grep one OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:41:17 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 10:01:26 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:59:17 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 11:05:10 PM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 6:26:05 PM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 4:34:24 AM Ready OneDrive Standalone Update Task-S-1-5-21 8/22/2018 2:09:21 PM Ready
Flags: needinfo?(mcornmesser)
pmoore: Do you think I should hold off from merging this pull request?
Flags: needinfo?(pmoore)
(In reply to Mark Cornmesser [:markco] from comment #81) > pmoore: Do you think I should hold off from merging this pull request? I think we need to test the worker in a staging pool that isn't taking production jobs. There are two problems with the current approach: 1) it impacts real user pushes 2) it isn't possible to direct tasks to specific test machines - any worker with the given provisionerId/workerType name can take a job, so when you submit a task you don't know if it will get run by a generic-worker 8.3.0 instance or a generic-worker 10.8.4 worker. If we can set up a dedicated staging pool, it will be easier to troubleshoot what the issues are. For example, I'm curious if the reboots are working between tasks, so I'd like to submit a task to look at the uptime of a given worker, to confirm it rebooted between the current task and the previous task. This will be possible if we have a staging pool with a different workerType configuration setting (like :dragrom set up for macOS workers). Another thing worth trying is to try to remove one of those directories as an Administrator, and see whether you are able to. The directory deletion is done by the windows service, which is a LocalSystem account. I can also add some additional logging to the worker to say why it can't delete the directories, if that would be helpful. Let's set up a staging pool before merging this.
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #82) > I can also add some additional logging to the worker to say why it can't > delete the directories, if that would be helpful. It looks like this is already there, e.g.: remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied. What does `icacls "C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts"` report? I'm curious what created this directory, and who has permission to delete it / modify its access settings.
Note, if the os.RemoveAll(path) call to the go standard library fails to delete the directory, the worker falls back to using the del cmd built in, running as LocalSystem user, namely: > cmd /c del /s /q /f "<path>" Maybe we should run the following if the delete fails[1]: > icacls "<path>" /t /grant:r LocalSystem:(OI)(CI)F and then try again. I suspect LocalSystem account and the task user account don't have permission to delete the "Application Shortcuts" folder that is getting created, so we might need to explicitly grant permission recursively to all files/folders to do this in case of failure. This could be an expensive operation so I propose to only do this in the failure case, and then to retry the delete again if the icacls command is successful. I am still curious what the DACLs are on those "Application Shortcuts" folders though. ---- [1] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/icacls
If you can grant me access to one of the machines, I would be happy to investigate further. Alternatively if you're happy to set up a staging pool, I can also troubleshoot from e.g. an interactive loaner (I can't get an interactive loaner without a staging pool, as an 8.3.0 worker is likely to consume the task). Thanks!
Flags: needinfo?(mcornmesser)
Here is what was returned from the icacls command on a directory was unable to be deleted: C:\windows\system32>icacls "C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts" C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F) BUILTIN\Administrators:(I)(OI)(CI)(F) S-1-5-21-1770456216-2325375451-2193181373-1002:(I)(OI)(CI)(F) T-W1064-MS-318\task_1534875459:(I)(OI)(CI)(F) Successfully processed 1 files; Failed processing 0 files I have set a staging pool of 5 nodes, ms-320 through ms-324. These nodes have a workerType of gecko-t-win10-64-hbeta and RDP has been enabled on them. However, when a task is retriggered and using this worker type ( and gecko-t-win10-64-hs) the task is picked up and quickly returns as an exception. The reason is malformed-payload. The only value in the generic-worker config file that changed was the workerType.
Flags: needinfo?(mcornmesser)
for debugging, here's some links to the delete failure logs. - on hardware instances: https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 - on ec2 instances: https://papertrailapp.com/groups/2488493/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 the failures occur on all windows worker types (hardware, ec2, testers, builders)
(In reply to Mark Cornmesser [:markco] from comment #87) > For reference: > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ > gecko-t-win10-64-hbeta Looking at one of the malformed-payload exceptions, the log says: > [taskcluster:error] [mounts] task.dependencies needs to include LpA26q4QSiOPOyP0Bdh6Hw since one or more of its artifacts are mounted See https://tools.taskcluster.net/groups/Cs0lJfSQTS-kB_VDLf-4KA/tasks/Cs0lJfSQTS-kB_VDLf-4KA/runs/0/logs/public%2Flogs%2Flive.log#L26 You just need to add that task as a dependency of the current task. See "dependencies" in https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/references/api#request-payload page.
See Also: → 1433854
Mark, did we decide that bug 1433854 isn't blocking this one, as it isn't a showstopper? Can you check the other dependencies, and say if they really are hard dependencies or not? I'm wondering if we can upgrade to generic-worker 10, to unblock all the downstream bugs, and deal with any open issues afterwards. If they aren't critical issues, that would be my preference, but I do see there is quite a long list of dependencies, just not sure if they really are hard blockers or not. Thanks!
Flags: needinfo?(mcornmesser)
No longer depends on: 1451835
No longer depends on: 1451832
No longer depends on: 1433854
Generic-work 8 is no longer in use in the datacenter. Windows nodes are now using 10.8.5. I am going to open a separate bug and to further upgrade generic-worker, but I may wait on bug 1433854 being resolved before upgrading.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mcornmesser)
Resolution: --- → FIXED
Summary: upgrade generic worker to 10.x.x (match versions on AWS testers) on Win 10 hardware → upgrade generic worker to 10.8.5 on Win 10 hardware
No longer depends on: 1474678
Attachment #9002623 - Flags: feedback?(pmoore)
See Also: → 1488195
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: