Closed
Bug 1443589
Opened 7 years ago
Closed 6 years ago
upgrade generic worker to 10.8.5 on Win 10 hardware
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
Attachments
(6 files)
No description provided.
Assignee | ||
Updated•7 years ago
|
Assignee: relops → mcornmesser
Comment 1•7 years ago
|
||
Any ETA for this?
Assignee | ||
Comment 3•7 years ago
|
||
There is a test pool of 25 nodes up. I am hoping that early next week to expand.
Flags: needinfo?(mcornmesser)
Assignee | ||
Updated•7 years ago
|
Summary: upgrade generic worker to 10.6.0 on Win 10 hardware → upgrade generic worker to 10.x.x (match versions on AWS testers) on Win 10 hardware
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 4•7 years ago
|
||
I gone to bump this up on my priority list for this week.
A few weeks ago I was able to get GenericWorker 10.8.1 installed and running on hardware. However, once I put it out in the wild it began to burn tasks. At the time it was test failures not GenericWorker errors. When this happen I did not have the time to really dive into it.
I have t-w1064-ms-077 spinning up using https://github.com/mozilla-platform-ops/OpenCloudConfig . Which is will install 10.8.1 .
I am using the old SCL3-pupept provisionor, so that I can manually retirgger tests and point it at this particular machine.
I am hoping to either have this resolved or gather enough info so I can pull in help during the work week.
Comment 5•7 years ago
|
||
++
Thanks Mark.
Assignee | ||
Comment 6•7 years ago
|
||
A copy of the conf file being used
Assignee | ||
Comment 7•7 years ago
|
||
Ran a small handful of random hardware test yesterday. With one test that had previous passed failing.
I am going to run through all the hardware tests today and will open additional bugs for failing tests.
Other notes:
I did not need to use the scl3-puppet provisioner. Worker types gecko-t-* will work with the releng-hardware provisioner.
With all the tests failing last time I attempted to the test the updated worker, it may had been local network or DNS issues that affected the test.
Assignee | ||
Comment 8•7 years ago
|
||
There were no apparent issues with Talos Tests. All tests were green:
test-windows7-32/opt-talos-chrome-e10s https://tools.taskcluster.net/groups/WVXB-YEpSsWA4K6n24G7ag/tasks/WVXB-YEpSsWA4K6n24G7ag/details
test-windows7-32/opt-talos-dromaeojs-e10s https://tools.taskcluster.net/groups/d-XtOWDMS-yKyVIhQa4Xwg/tasks/d-XtOWDMS-yKyVIhQa4Xwg/details
test-windows7-32/opt-talos-damp-e10s https://tools.taskcluster.net/groups/KER9T-WdQCqbZ0cmIJWIVA/tasks/KER9T-WdQCqbZ0cmIJWIVA/details
test-windows7-32/opt-talos-g4-e10s https://tools.taskcluster.net/groups/JDdMx41VRSqSY2l-c9YnCQ/tasks/JDdMx41VRSqSY2l-c9YnCQ/details
test-windows7-32/opt-talos-g5-e10s https://tools.taskcluster.net/groups/YleIIJM0TRCesQF6DMffmQ/tasks/YleIIJM0TRCesQF6DMffmQ/details
test-windows7-32/opt-talos-h1-e10s https://tools.taskcluster.net/groups/TABdIBQkSCCrnfuUowsg5g/tasks/TABdIBQkSCCrnfuUowsg5g/details
test-windows7-32/opt-talos-perf-reftest-e10s https://tools.taskcluster.net/groups/MCXk-5MQSe-3kfcyYrDTDQ/tasks/MCXk-5MQSe-3kfcyYrDTDQ/details
test-windows7-32/opt-talos-perf-reftest-singletons-e10s https://tools.taskcluster.net/groups/fw5utBIkQNec_8haDpBGaw/tasks/fw5utBIkQNec_8haDpBGaw/details
test-windows7-32/opt-talos-speedometer-e10s https://tools.taskcluster.net/groups/Aqk1ydjJRsyICRkPZ1ihFw/tasks/Aqk1ydjJRsyICRkPZ1ihFw/details
test-windows7-32/opt-talos-tp5o-e10s https://tools.taskcluster.net/groups/ekz1VHmET9iOrQtvu3_bFQ/tasks/ekz1VHmET9iOrQtvu3_bFQ/details
test-windows7-32/opt-talos-tp6-e10s https://tools.taskcluster.net/groups/EtFOerjYRz6FpnNDxhsWXw/tasks/EtFOerjYRz6FpnNDxhsWXw/details
test-windows7-32/opt-talos-tps-e10s https://tools.taskcluster.net/groups/VJtq2fmQTM2kV-9DKmOLDw/tasks/VJtq2fmQTM2kV-9DKmOLDw/details
test-windows10-64/opt-talos-chrome-e10s https://tools.taskcluster.net/groups/Qwm0kL87Q82dWyYQmWGa-w/tasks/Qwm0kL87Q82dWyYQmWGa-w/details
test-windows10-64/opt-talos-dromaeojs-e10s https://tools.taskcluster.net/groups/f7camATqTMWoiViB7HEGlA/tasks/f7camATqTMWoiViB7HEGlA/details
test-windows10-64/opt-talos-damp-e10s https://tools.taskcluster.net/groups/KcdAenecRsyi_77BtnTGNw/tasks/KcdAenecRsyi_77BtnTGNw/details
test-windows10-64/opt-talos-g1-e10s https://tools.taskcluster.net/groups/Rc-q9rmxTHa1Hb8-fb7cFg/tasks/Rc-q9rmxTHa1Hb8-fb7cFg/details
test-windows10-64/opt-talos-g4-e10s https://tools.taskcluster.net/groups/HTwW_FXDR2uQrUWWUprosA/tasks/HTwW_FXDR2uQrUWWUprosA/details
test-windows10-64/opt-talos-g5-e10s https://tools.taskcluster.net/groups/ULh04Dd9QkuvUrrc3DBU0A/tasks/ULh04Dd9QkuvUrrc3DBU0A/details
test-windows10-64/opt-talos-h1-e10s https://tools.taskcluster.net/groups/ACwZbgrXRAiNWk5UwE5LLw/tasks/ACwZbgrXRAiNWk5UwE5LLw/details
test-windows10-64/opt-talos-perf-reftest-e10s https://tools.taskcluster.net/groups/ae68QD0zRRmVSfuM_rgFbw/tasks/ae68QD0zRRmVSfuM_rgFbw/details
test-windows10-64/opt-talos-perf-reftest-singletons-e10s https://tools.taskcluster.net/groups/UdjKpx_tQKuEbfGhieDbHQ/tasks/UdjKpx_tQKuEbfGhieDbHQ/details
test-windows10-64/opt-talos-speedometer-e10s https://tools.taskcluster.net/groups/fg0NhJUoRhyBenVUJs-LAg/tasks/fg0NhJUoRhyBenVUJs-LAg/details
test-windows10-64/opt-talos-tp5o-e10s https://tools.taskcluster.net/groups/W46xJWCkSL-7kdQdvbj_sA/tasks/W46xJWCkSL-7kdQdvbj_sA/details
test-windows10-64/opt-talos-tp6-e10s https://tools.taskcluster.net/groups/IohJ4iuzQNio0q1fsMRtzA/tasks/IohJ4iuzQNio0q1fsMRtzA/details
test-windows10-64/opt-talos-tps-e10s https://tools.taskcluster.net/groups/JPDkd_bWQtCqJkx3QcsoKw/tasks/JPDkd_bWQtCqJkx3QcsoKw/details
Assignee | ||
Comment 9•7 years ago
|
||
Reftest as well:
test-windows10-64/opt-reftest-e10s-1 https://tools.taskcluster.net/groups/PkIubl-8R1eRyvrsqfJuQg/tasks/PkIubl-8R1eRyvrsqfJuQg/details
test-windows10-64/opt-reftest-e10s-2 https://tools.taskcluster.net/groups/V4NTRYHxTE-OrLeGT9Gx2Q/tasks/V4NTRYHxTE-OrLeGT9Gx2Q/details
Assignee | ||
Comment 10•7 years ago
|
||
I am setting up 2 nodes ms-135 and ms-81 to run as gecko-t-win10-64-hw worker type with GenericWorker 10.8.1. I am not to concerned with the functionality of the worker as much as the extended life of the nodes. After a mass of tests are ran these nodes I will go back through and look at the state of the node concerning disk space and schedule tasks. After this I will update the OCC testing repo and roll out a 5 machine test pool.
For the testing repo I am going to include similar functions covered in https://bugzilla.mozilla.org/show_bug.cgi?id=1451837. This includes disk space management and Datacenter location decisions.
The next sticky part is figuring out a method in which we can do an incremental roll out of the upgraded worker. I am thinking about creating a second Win 10 OCC manifest and creating a flag during the initial deployment that OCC can check for when deciding on which manifest to use.
Assignee | ||
Comment 11•7 years ago
|
||
The 2 test nodes are up, picking up tasks, and passing tests.
Comment 12•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #11)
> The 2 test nodes are up, picking up tasks, and passing tests.
Great news!
Assignee | ||
Comment 13•7 years ago
|
||
To upgrade to 10.8.4
There is new manifest and a conditional clause looking for a gw 10 file from deployment. The reason for this is so we can do an incremental upgrade without breaking the entire hardware pool.
This also incorporates hard disk clean up from Bug 1451837. As well as removal of KTS from Bug 1454759.
Attachment #8985274 -
Flags: feedback?(rthijssen)
Comment 14•7 years ago
|
||
Comment on attachment 8985274 [details]
OCC win 10 hw GenericWorker upgrade
r+, looks good to me.
in function hw-DiskManage, there's a reboot that references the lock file. the path to the lock file will need to be passed into the function as a parameter.
here's an example:
https://github.com/mozilla-releng/OpenCloudConfig/blob/5963cda/userdata/rundsc.ps1#L420
Attachment #8985274 -
Flags: feedback?(rthijssen) → feedback+
Assignee | ||
Comment 15•7 years ago
|
||
Bumping GW version 10.8.5 .
Assignee | ||
Comment 16•7 years ago
|
||
Moved back to 10.8.4
Assignee | ||
Comment 17•7 years ago
|
||
Verbally R+ by grenade.
Holding off on merging until current issues are resolved.
Attachment #8985473 -
Flags: review+
Comment 18•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #16)
> Moved back to 10.8.4
10.8.5 should be fine - the production Windows issues from today are unrelated.
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 19•7 years ago
|
||
Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I will land the patch. The nodes will need to be reimaged for this to take affect.
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 20•7 years ago
|
||
Patch landed.
Assignee | ||
Comment 21•7 years ago
|
||
Ciduty will start on a rolling install.
Comment 22•7 years ago
|
||
:apop will be handling the re-images.
I can CCed the ciduty@m.c bugzilla account (under the same email, you can even NeedInfo it!) so we can all have better visibility into the process.
Adrian will come back with updates, as he has them.
Comment 23•7 years ago
|
||
reimaged the following moonshots:
T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019, T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023, T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027, T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031, T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037, T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041, T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045.
Comment 24•7 years ago
|
||
Today we re-imaged the following moonshots:
T-W1064-MS-{ 061.. to 090 }
T-W1064-MS-{ 106.. to 135 }
T-W1064-MS-{ 151.. to 171 }
Comment 25•7 years ago
|
||
\o/
Thanks guys!
Comment 26•7 years ago
|
||
(In reply to Adrian Pop from comment #23)
> reimaged the following moonshots:
>
> T-W1064-MS-016, T-W1064-MS-017, T-W1064-MS-018, T-W1064-MS-019,
> T-W1064-MS-020, T-W1064-MS-021, T-W1064-MS-022, T-W1064-MS-023,
> T-W1064-MS-024, T-W1064-MS-025, T-W1064-MS-026, T-W1064-MS-027,
> T-W1064-MS-028, T-W1064-MS-029, T-W1064-MS-030, T-W1064-MS-031,
> T-W1064-MS-032, T-W1064-MS-035, T-W1064-MS-036, T-W1064-MS-037,
> T-W1064-MS-038, T-W1064-MS-039, T-W1064-MS-040, T-W1064-MS-041,
> T-W1064-MS-042, T-W1064-MS-043, T-W1064-MS-044, T-W1064-MS-045.
Looking at one of these at random, it doesn't appear to be taking jobs:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-045
Comment 27•7 years ago
|
||
Seems to be some infinite looping in:
https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/run-hw-generic-worker-8-and-reboot.bat#L3-L7
Let's please call off the rollout until this is solved.
Flags: needinfo?(riman)
Flags: needinfo?(apop)
Comment 28•7 years ago
|
||
Assignee | ||
Comment 29•7 years ago
|
||
On some of the nodes we hit an issue where generic worker exit and the wrapper script rebooted the node before OCC cleared the in-progress.lock file. When the node came back up, the rundsc powershell script exited without running.
The odd bit is that is not across the board.
I have added this to the wrapper script: https://github.com/mozilla-releng/OpenCloudConfig/commit/19fac03565089b5c8ef626ebde4be2dc8e684d58
Whenever the GenericWorker exits if the lock file exists it will be deleted. As well as if the manifest hasn't applied with in 15 minutes the lock file will be deleted and the machine will reboot.
Assignee | ||
Comment 30•7 years ago
|
||
Asked CiDuty in #ciduty to not start additional installs until the above patch is tested.
Assignee | ||
Comment 31•7 years ago
|
||
This seems to have affected about 20% of the newly installed machines.
The ones that were function continue to function after picking up the new wrapper script.
I am running a fresh install on ms-021. If installs through and picks up multiple tasks I will reninstall the other affected nodes on chassis 1. If there is no issue on those I will ask ciduty to resume installing and go through and identify the other affect nodes.
Assignee | ||
Comment 32•7 years ago
|
||
I have dropped the reboot after the manifest hasn't completed within 15 minutes. It was causing an additional loop on its own. I am planning on adding this later with Bug 1470016.
Ms-021 successfully installed and picked up and passed multiple tests. With the exception of ms-038 and ms-035, which has other issues, nodes ms-016 through ms-045 are up and running. I am going to let these sit for a while and run. If there is no issues I will ask CiDuty to resume the roll out.
I suspect the issue began during the initial run(s) of OCC on the nodes. Possible right after the creation of the first task user. Because the last exit code of GenericWorker in the log was a 67. I suspect rundsc.ps1 was waiting on something when the wrapper script issued a reboot.
Assignee | ||
Comment 33•7 years ago
|
||
So far things are looking OK. I am going to take a look the nodes mention above in the morning pdt.
Comment 34•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #19)
> Rgr. I have bumped the test pool to 10.8.5. If all looks good tomorrow I
> will land the patch. The nodes will need to be reimaged for this to take
> affect.
Looking at e.g. https://taskcluster-artifacts.net/TeZ1Sq4eSjC5k-NgAnH5Ww/0/public/logs/live_backing.log it looks like the new machines are running 10.8.4 not 10.8.5.
Looking at userdata/Manifest/gecko-t-win10-64-hw-GW10.json I see that version 10.8.5 is specified, but I can see the sha256 is for version 10.8.4.
It looks like the following (unreviewed) commit introduced the issue since it updated the version number but did not update the tooltool hash:
https://github.com/mozilla-releng/OpenCloudConfig/commit/1e953d46810a2a86c648b30f1f40f526795c1c90
I'm happy to review any patches to OCC.
Comment 35•7 years ago
|
||
s/sha256/sha512/
Comment 36•7 years ago
|
||
Found the proper version of the generic worker in tooltool (10.8.5), and updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file.
Created pull request
https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b
Please review and merge it if it's ok.
Comment 37•7 years ago
|
||
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36)
> Found the proper version of the generic worker in tooltool (10.8.5), and
> updated the userdata/Manifest/gecko-t-win10-64-hw-GW10.json file.
>
> Created pull request
> https://github.com/bccrisan/OpenCloudConfig/commit/c02c09c5b07a2a9f40c7378dea3eff4ff597708b
>
>
> Please review and merge it if it's ok.
That looks like a commit to master rather than a PR. :-)
But I would have r+'d it as the diff is correct. Thanks.
Comment 38•7 years ago
|
||
There is also a PR: https://github.com/mozilla-releng/OpenCloudConfig/pull/156 ;)
Comment 39•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #38)
> There is also a PR:
> https://github.com/mozilla-releng/OpenCloudConfig/pull/156 ;)
Ah indeed! r+'d in github. Many thanks.
Comment 40•7 years ago
|
||
This came in the puppet (~30 minutes)
> Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker'
as a report for t-yosemite-r7-256.
Did you guys started working on OSX generic worker?
Assignee | ||
Comment 41•7 years ago
|
||
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #40)
> This came in the puppet (~30 minutes)
>
> > Thu Jun 21 08:49:06 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker'
>
> as a report for t-yosemite-r7-256.
>
> Did you guys started working on OSX generic worker?
dhouse:^
Flags: needinfo?(dhouse)
Comment 42•7 years ago
|
||
The generic-worker on OSX was updated yesterday. Looking into taskxluster, this host not appear to the pool:https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010
Assignee | ||
Comment 43•7 years ago
|
||
I went back through and checked the nodes on chassis 1 (16 - 45), and the looping issue did not reoccur. However, there were 3 nodes that went unresponsive.
Chassis 2 had 3 nodes that were stuck in the loop, but there I found a high percentage of unresponsive nodes. This is an issue we have been hitting (Bug 1452133). I am going to continue to go through the newly installed nodes and identify issues, reinstall to get them back into the pool, and continue to dive into this.
Flags: needinfo?(dhouse)
Assignee | ||
Comment 44•7 years ago
|
||
In the remaining the nodes that had the updated version of GW, I found 30 nodes which were in an unresponsive or possible shutdown state. Which is a much higher frequency than the older version. It maybe unrelated, but I am going to continue to monitor this. With a possibility of rolling all but the chassis 1 nodes to GW 8.
The looping issue seems to have gone away on the newly installed nodes.
Assignee | ||
Comment 45•7 years ago
|
||
Going through the unresponsive nodes, nodes were locked up not shutdown. According to the last reported logs they locked up at different points, when rundsc.ps1 issued a reboot, when the wrapper script issued a reboot, and while waiting for a task. Other nodes had not installed correctly. Either the node was continuing to try to pxe boot or continuing to try to pxe boot over IPv6. Mostly likely it is not related, but the amount on the gw 10 workers was concerning. I am going to continue to monitor.
Assignee | ||
Comment 46•7 years ago
|
||
As for now I am going to hold off on completion of the roll out until Monday next week.
Assignee | ||
Comment 47•7 years ago
|
||
Some of the nodes that were unresponsive actually had failed to reboot. I have opened up Bug 1470338 to track those.
Assignee | ||
Comment 48•7 years ago
|
||
In the last 20 hours 15 out of 100 nodes went from good to not able to pick up tasks.
(16-45)
Unresponsive
20
24
32
39
Failed to reboot
23
42
(61-90)
Unresponsive
72
Failed to reboot
75
85
(106-135)
Unresponsive
107
Failed to reboot
115
118
126
(151-171)
Failed to Reboot
154
155
170
The unresponsive nodes seems to be either different or at an accelerated rate than the previous known issue. This is now being tracked in Bug 1470507.
Assignee | ||
Comment 49•7 years ago
|
||
CiDuty: Could you all reinstall all except ms-016 - ms-045 to the old task sequence? We resume the upgrade after Bug 1470507 and Bug 1470338 are resolved.
Flags: needinfo?(riman)
Flags: needinfo?(ciduty)
Flags: needinfo?(apop)
Assignee | ||
Comment 51•7 years ago
|
||
I am going to move to using this repo while troubleshooting: https://github.com/mozilla-platform-ops/OpenCloudConfig .
Comment 52•7 years ago
|
||
I've re-imaged T-W1064-MS-061..115 to the old task sequence
Comment 53•7 years ago
|
||
I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171.
Except no. 130 it seems to have some issue not getting past boot and showing a blue screen saying we encountered an error.
All the re-imaged workers (except 130) are well alive and completing jobs.
Comment 54•7 years ago
|
||
(In reply to Radu Iman[:riman] from comment #50)
> Yes we can. I'm going to start right now.
(In reply to Zsolt Fay [:zsoltfay] from comment #53)
> I've re-imaged the rest of them to old sequence: T-W1064-MS-116..171.
> Except no. 130 it seems to have some issue not getting past boot and showing
> a blue screen saying we encountered an error.
>
> All the re-imaged workers (except 130) are well alive and completing jobs.
Thank you ciduty for your help with reimaging.
Assignee | ||
Comment 55•7 years ago
|
||
There were multiple complication with between OCC and GenericWorker on that last attempt of the upgrade. There are 2 files that the wrapper script looks for before starting up GW. One is to signal the end of the OCC manifest being applied (EndOfManifest.semaphore), and one is that signals then end of the rundsc.ps1 (task-claim-state.valid ). In both cases the wrapper script would fail on deleting the file because of file locks. Because those were not being removed the wrapper script would jump straight to starting GW. Rundsc.ps1 would then exit once it detected the GW process. This led to state where OCC was never fully applied.
I have added multiple catches to delete these files if they exist when GW exits or when there is an early exit of rundsc.ps1. OCC would then apply fully on each boot. However, the vaildation of the GW install would fail and OCC would then install GW again. GW would then try to start using the default wrapper script and fail:
"CommandsReturn": [
{
"Command": "C:\\generic-worker\\generic-worker.exe",
"Arguments": [
"--version"
],
"Match": "generic-worker 10.8.5"
Which the command returns:
:\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version
2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8]
2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small.
2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8]
2018/07/09 04:57:37 Result: 1 0 The operation completed successfully.
generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]
For the time being I have removed this validation piece.
In addition the user intit script (https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/task-user-init-win10.cmd) has been added to the hw configuration.
I have additional logging to the wrapper script looking at the DSC dir after GW exit. As well as an attempt at dumping full list of running task when the nodes fails to reboot, but I suspect the reboot issue was caused by unfinished portions of OCC.
Also to get more info the generic-worker-service.log is being sent to papertrail.
My hope is that there will be less issues because OCC is fully completing, and if not the additional logging will give us an idea of the possible cause.
Assignee | ||
Comment 56•7 years ago
|
||
Rob: I still need to clean it up a bit, but do you have hny feedback? Any ideas on validation of the GW install?
Attachment #8990654 -
Flags: feedback?(rthijssen)
Comment 57•7 years ago
|
||
i'm not sure how we can validate the gw version without running `generic-worker.exe --version`
have i understood correctly that calling gw with the version flag causes gw to do more than just return the version?
Assignee | ||
Comment 58•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #57)
> i'm not sure how we can validate the gw version without running
> `generic-worker.exe --version`
> have i understood correctly that calling gw with the version flag causes gw
> to do more than just return the version?
Yes.
C:\Users\task_1531104133>c:\generic-worker\generic-worker.exe --version
2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [0 C0423C56D8]
2018/07/09 04:57:37 Result: 0 0 The data area passed to a system call is too small.
2018/07/09 04:57:37 Making system call GetProfilesDirectoryW with args: [C0423F5D00 C0423C56D8]
2018/07/09 04:57:37 Result: 1 0 The operation completed successfully.
generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]
Assignee | ||
Comment 59•7 years ago
|
||
I have added this to the testing repo: https://github.com/mozilla-platform-ops/OpenCloudConfig/commit/91b6cba9a76d184a4b2a48c3a6d49fffb604b636
If one of the flag files do not exists and Powershell is not running, clear lock file and flag files if they exist, and reboot to try the process again.
This sprung from https://bugzilla.mozilla.org/show_bug.cgi?id=1474678 . There seems to be an intermittent issue with the Mellanox network drivers on start up that prevented OCC from getting to externally source files.
Comment 60•7 years ago
|
||
Comment on attachment 8990654 [details]
Improvement on hw gw 10 configurations
pmoore:
up until now we have used a validation in occ that looks for a version string in the output from `generic-worker.exe --version` in the form of a complete line that matches: "generic-worker 10.8.5"
eg: https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/Manifest/gecko-1-b-win2012.json#L1126
it seems that at some point the output from the version command changed to include a link to the git hash the version was built from. this change means that dsc assumes that the correct version of gw is not installed and goes on to reinstall gw on every occ run.
i will patch occ to allow for the extra information in the output from the version command.
in future, a heads up that a change like this is being introduced, would be appreciated and will help us to avoid bustage.
also, is it intentional that there is so much output from the version command? eg (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log):
Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version
2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0 C0423D76E8]
2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is too small.
2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [C042405C40 C0423D76E8]
2018/07/11 08:23:31 Result: 1 0 The operation completed successfully.
generic-worker 10.8.5 [ revision: https://github.com/taskcluster/generic-worker/commits/034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]
is it necessary to make those system calls in order to determine the version? also the output seems superfluous to what a user would expect when requesting the version.
Flags: needinfo?(pmoore)
Attachment #8990654 -
Flags: feedback?(rthijssen) → feedback+
Comment 61•7 years ago
|
||
OCC is now patched to handle the extra line content.
https://github.com/mozilla-releng/OpenCloudConfig/commit/462d9fd
the implementation now includes a *like* comparison so the validation for the generic worker version now looks like this:
"Like": "generic-worker 10.8.5 *"
instead of:
"Match": "generic-worker 10.8.5"
Comment 62•7 years ago
|
||
pmoore: in light of the switch from "Match" to "Like", it's possible that the regex at https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu.sh#L55 may need adaptation. maybe not, but worth checking...
Comment 63•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #60)
> Comment on attachment 8990654 [details]
> Improvement on hw gw 10 configurations
>
> pmoore:
>
> up until now we have used a validation in occ that looks for a version
> string in the output from `generic-worker.exe --version` in the form of a
> complete line that matches: "generic-worker 10.8.5"
> eg:
> https://github.com/mozilla-releng/OpenCloudConfig/blob/bdbb7ea/userdata/
> Manifest/gecko-1-b-win2012.json#L1126
>
> it seems that at some point the output from the version command changed to
> include a link to the git hash the version was built from. this change means
> that dsc assumes that the correct version of gw is not installed and goes on
> to reinstall gw on every occ run.
>
> i will patch occ to allow for the extra information in the output from the
> version command.
>
> in future, a heads up that a change like this is being introduced, would be
> appreciated and will help us to avoid bustage.
Apologies for this, I had forgotten that OCC uses this, indeed I should have given you a heads up.
Another option could be to check e.g. the SHA256 of the binary, rather than the output of `generic-worker --version`. This has the advantage that it helps defend against an attack whereby an attacker replaces generic-worker.exe with a different binary that fakes the "version" output but does something malicious when called with the "run" target.
However, I will make sure I notify in future if there are any changes, that was a careless oversight of mine.
>
> also, is it intentional that there is so much output from the version
> command? eg
> (https://tools.taskcluster.net/groups/SdP2Gz9zS8-6F5ROWG-RUQ/tasks/
> SdP2Gz9zS8-6F5ROWG-RUQ/runs/0/logs/public%2Flogs%2Flive.log):
>
> Z:\task_1531296888>C:\generic-worker\generic-worker.exe --version
> 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args: [0
> C0423D76E8]
> 2018/07/11 08:23:31 Result: 0 0 The data area passed to a system call is
> too small.
> 2018/07/11 08:23:31 Making system call GetProfilesDirectoryW with args:
> [C042405C40 C0423D76E8]
> 2018/07/11 08:23:31 Result: 1 0 The operation completed successfully.
> generic-worker 10.8.5 [ revision:
> https://github.com/taskcluster/generic-worker/commits/
> 034c836cfd18ebe8b7fb6dbfabccea4bcd0fa1f6 ]
>
> is it necessary to make those system calls in order to determine the
> version? also the output seems superfluous to what a user would expect when
> requesting the version.
Unfortunately this is a limitation of the docopt-go library we are using. The command line parser requires that the help text is passed into the method call that parses the command arguments. We need to parse the arguments to see that --help is called, and the help text for the command includes docs for the parameter "tasksDir". The default value of this property is system dependent, and depends on the Profiles Directory on the system. In other words, the --help output needs to know where the system Profiles Directory is in order to provide the --help text, and the parser needs the full help text before it parses the command line arguments.
The reason we log all system calls is because it is very useful for troubleshooting when there are failures, as the go/c boundary is a potential source of failure. I believe the output is sent to standard error rather than standard out, so if standard error is disabled, it won't be shown, but I see that it is less than ideal. However, those system calls are needed so it is also kind of useful to see that they are made.
Flags: needinfo?(pmoore)
Comment 64•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #62)
> pmoore: in light of the switch from "Match" to "Like", it's possible that
> the regex at
> https://github.com/petemoore/myscrapbook/blob/master/upgrade-gw-betas-cu.
> sh#L55 may need adaptation. maybe not, but worth checking...
I think it is ok. Thanks for the heads up!
Comment 65•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #60)
> it seems that at some point the output from the version command changed to
> include a link to the git hash the version was built from. this change means
> that dsc assumes that the correct version of gw is not installed and goes on
> to reinstall gw on every occ run.
Hey Rob,
I remember we had a couple of changes in the pipeline to help with these types of issues, but I can't find the bug numbers right now. Two things that would help in this situation are:
1) we only apply OCC manifests during AMI creation (so we wouldn't repeatedly reapply every worker run)
2) if a validation step fails (such as validating the g-w version number) the AMI creation process should fail
With both of these safeguards, I think we would stop this issue earlier, and it couldn't get rolled out to production. Can you confirm if it is still intended to implement these changes?
Thanks in advance.
Flags: needinfo?(rthijssen)
Comment 66•7 years ago
|
||
It looks like generic-worker 8.3.0 is running in production on Windows 10 hardware. See this task from today:
https://tools.taskcluster.net/groups/Vbh4i4X1Qm-jcEtWb2TnDQ/tasks/AU4drFJNQM6SN7_lQaOx-w/runs/0/logs/public%2Flogs%2Flive.log#L11
This is a release from April 2017.
Also the PR from comment 38 still needs landing, although I'm not sure it will have much effect since I suspect the wrong manifest is getting used.
It looks like there are two:
* https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json
* https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw-GW10.json
I suspect the second one is the one that should be used in production, but at least for the log link above, it looks like maybe the first one is getting used.
Flags: needinfo?(mcornmesser)
Comment 67•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #66)
> It looks like generic-worker 8.3.0 is running in production on Windows 10
> hardware.
Yes, that's the entire point of this bug, to get us onto gw10. We attempted it once, as part of attempting to work around some of the issues we were having on the Moonshot hardware, but found issues with running gw10 on hardware.
Assignee | ||
Comment 68•7 years ago
|
||
Currently generic-worker 10.8.5 is only one ms-016 through ms-045. Up until last week we were hitting the issues mentioned in 55 which was preventing a full deployment. There are two issues that are still preventing a full deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and bug 1474678 which seems to be a network issue that has only affected the gw 10 nodes. The later may also just be related to the chassis in which the nodes are in. I am also a bit concern about bug 1474729 with some nodes seemingly running 2 tasks at once.
I am hoping to be able to do a wide deployment the week of the 23rd.
Flags: needinfo?(mcornmesser)
Assignee | ||
Updated•7 years ago
|
Comment 69•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #65)
> Can you confirm if it is still intended to implement these changes?
yes
Flags: needinfo?(rthijssen)
Comment 70•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #68)
> There are two issues that are still preventing a full
> deployment bug 1470338 which is gw 10 nodes occasional fail to reboot and
> bug 1474678 which seems to be a network issue that has only affected the gw
> 10 nodes.
Bug 1470338: Indeed this looks like an OCC issue, probably :grenade is the best person to help with that.
Bug 1474678: This looks to be due to bug 1372172 - something in the win10 image is causing the machine to freeze/reboot. I've added my analysis to the bug.
Neither of these issues are related to the generic-worker version.
Assignee | ||
Comment 71•7 years ago
|
||
Some of these issues maybe related to https://bugzilla.mozilla.org/show_bug.cgi?id=1475258 . Which is beginning to look like a possible hardware failure.
Assignee | ||
Comment 72•6 years ago
|
||
Current status.
I have seen no issues if with the functionality of generic-worker 10 over the last month. Except for some clean up is not happening. Task user home directories are not being removed, and there is one drive schedule task for each user that has been created and never gets deleted.
We are trying to get the overall environment to stabilize before deploying the current version of generic-worker. In the test pools, chassis 1 in both mdc 1 and 2, rundsc errors have been resolved and the Mellanox network drivers have been updated. However, we are seeing an issues where node occasionally can not get to external resources. When this happens OCC fails on trying to get to those resources. Also generic-worker is unable to talk to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads to oCC not running again after reboots and generic-worker never receiving the flags it needs to start. We are hoping that a firmware upgrade will prevent this issue from happening.
Before upgrading across the board we are blocked on two things. The complete cleanup of the task dirs and old schedule tasks. The other, which is kind of a soft blocker, is the issue of unable to get to external resources on boot. We may want upgrade even if the later is happening. I am going to evaluate that this week.
Updated•6 years ago
|
Blocks: T-W1064-MS-066
Updated•6 years ago
|
Blocks: T-W1064-MS-087
Updated•6 years ago
|
No longer blocks: T-W1064-MS-087
Updated•6 years ago
|
No longer blocks: T-W1064-MS-066
Assignee | ||
Comment 73•6 years ago
|
||
To address the issue with external resources are not available I have added to catches. During deployment before the first OCC run, there is now a schedule task created to check for the existence of the rundsc.ps1 script:
if (!(Test-Path C:\dsc\rundsc.ps1)) {
(New-Object Net.WebClient).DownloadFile(("https://raw.githubusercontent.com/markcor/OpenCloudConfig/master/userdata/rundsc.ps1?{0}" -f [Guid]::NewGuid()), 'C:\dsc\rundsc.ps1')
while (!(Test-Path "C:\dsc\rundsc.ps1")) { Start-Sleep 10 }
Remove-Item -Path c:\dsc\in-progress.lock -force -ErrorAction SilentlyContinue
shutdown @('-r', '-t', '0', '-c', 'Rundsc.ps1 did not exists; Restarting', '-f')
}
If it does not exist, the node will download and reboot. I have also added a step to the rundsc.ps1 that if it can't get to github the lock file will be deleted and reboot:
if ($locationType -eq 'DataCenter') {
if (!(Test-Connection github.com -quiet)) {
Remove-Item -Path $lock -force -ErrorAction SilentlyContinue
shutdown @('-r', '-t', '0', '-c', 'reboot; external resources are not available', '-f', '-d', '4:5') | Out-File -filePath $logFile -append
}
}
This has seems to have the kept the nodes from being up and not running generic-worker.
To address the issue of the onedrive schedule task for every user created, I have added a script to the deployment that removes onedrive and all its components:
Import-Module -DisableNameChecking $PSScriptRoot\..\lib\force-mkdir.psm1
Import-Module -DisableNameChecking $PSScriptRoot\..\lib\take-own.psm1
echo "73 OneDrive process and explorer"
taskkill.exe /F /IM "OneDrive.exe"
taskkill.exe /F /IM "explorer.exe"
echo "Remove OneDrive"
if (Test-Path "$env:systemroot\System32\OneDriveSetup.exe") {
& "$env:systemroot\System32\OneDriveSetup.exe" /uninstall
}
if (Test-Path "$env:systemroot\SysWOW64\OneDriveSetup.exe") {
& "$env:systemroot\SysWOW64\OneDriveSetup.exe" /uninstall
}
echo "Disable OneDrive via Group Policies"
force-mkdir "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive"
sp "HKLM:\SOFTWARE\Wow6432Node\Policies\Microsoft\Windows\OneDrive" "DisableFileSyncNGSC" 1
echo "Removing OneDrive leftovers trash"
rm -Recurse -Force -ErrorAction SilentlyContinue "$env:localappdata\Microsoft\OneDrive"
rm -Recurse -Force -ErrorAction SilentlyContinue "$env:programdata\Microsoft OneDrive"
rm -Recurse -Force -ErrorAction SilentlyContinue "C:\OneDriveTemp"
echo "Remove Onedrive from explorer sidebar"
New-PSDrive -PSProvider "Registry" -Root "HKEY_CLASSES_ROOT" -Name "HKCR"
mkdir -Force "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}"
sp "HKCR:\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0
mkdir -Force "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}"
sp "HKCR:\Wow6432Node\CLSID\{018D5C66-4533-4307-9B53-224DE2ED1FE6}" "System.IsPinnedToNameSpaceTree" 0
Remove-PSDrive "HKCR"
echo "Removing run option for new users"
reg load "hku\Default" "C:\Users\Default\NTUSER.DAT"
reg delete "HKEY_USERS\Default\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" /v "OneDriveSetup" /f
reg unload "hku\Default"
echo "Removing startmenu junk entry"
rm -Force -ErrorAction SilentlyContinue "$env:userprofile\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\OneDrive.lnk"
echo "Restarting explorer..."
start "explorer.exe"
echo "Wait for EX reload.."
sleep 15
echo "Removing additional OneDrive leftovers"
foreach ($item in (ls "$env:WinDir\WinSxS\*onedrive*")) {
Takeown-Folder $item.FullName
rm -Recurse -Force $item.FullName
}
To address the task user home directories that have been left behind, I am trying to add an OCC function to remove those directories once they hit a day old. Once that is addressed I will submit a PR for review for generic-worker 10.8.5. Shortly after I will look at getting the win 10 hardware nodes to the most current version.
Assignee | ||
Comment 74•6 years ago
|
||
I have added :
if (Test-Path C:\dsc\GW10.semaphore) {
$currenttaskuser = get-localuser -name task*
$currenttaskname = $currenttaskuser.name
Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default", "Administrator","Public" | Remove-Item -Force -Recurse
}
To address the task user home directories. This will find the current user and then delete the other task user directories.
Assignee | ||
Comment 75•6 years ago
|
||
See comment 73 and comment 74.
I will update the source repo before merging.
Attachment #9002623 -
Flags: review?(rthijssen)
Attachment #9002623 -
Flags: feedback?(pmoore)
Comment 76•6 years ago
|
||
Comment on attachment 9002623 [details] [review]
Add support for generic-worker 10 on hardware
lgtm. some comments in gh review regarding references to forked repo.
Attachment #9002623 -
Flags: review?(rthijssen) → review+
Comment 77•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #72)
> Current status.
>
> I have seen no issues if with the functionality of generic-worker 10 over
> the last month. Except for some clean up is not happening. Task user home
> directories are not being removed, and there is one drive schedule task for
> each user that has been created and never gets deleted.
Please can you provide example logs showing user directory deletion failing / not happening, config settings used, links to failed tasks and/or links to papertrail worker logs etc to support this claim?
What are the scheduled tasks? generic-worker 10 doesn't create any scheduled tasks. Are you sure you are running the correct version?
>
> We are trying to get the overall environment to stabilize before deploying
> the current version of generic-worker. In the test pools, chassis 1 in both
> mdc 1 and 2, rundsc errors have been resolved and the Mellanox network
> drivers have been updated. However, we are seeing an issues where node
> occasionally can not get to external resources. When this happens OCC fails
> on trying to get to those resources. Also generic-worker is unable to talk
> to taskcluster. When this happens rundsc.ps1 is never downloaded. This leads
> to oCC not running again after reboots and generic-worker never receiving
> the flags it needs to start. We are hoping that a firmware upgrade will
> prevent this issue from happening.
>
> Before upgrading across the board we are blocked on two things. The complete
> cleanup of the task dirs and old schedule tasks. The other, which is kind of
> a soft blocker, is the issue of unable to get to external resources on boot.
> We may want upgrade even if the later is happening. I am going to evaluate
> that this week.
As above, generic-worker 10 doesn't create any scheduled tasks - please provide details of what the scheduled tasks are.
Also please provide links to worker logs or task logs or screenshots, or any evidence to support your claims. Thanks! This will help people to support you.
Flags: needinfo?(mcornmesser)
Comment 78•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #74)
> I have added :
>
> if (Test-Path C:\dsc\GW10.semaphore) {
> $currenttaskuser = get-localuser -name task*
> $currenttaskname = $currenttaskuser.name
> Get-ChildItem "c:\Users" -exclude "$currenttaskname", "Default",
> "Administrator","Public" | Remove-Item -Force -Recurse
> }
>
> To address the task user home directories. This will find the current user
> and then delete the other task user directories.
This shouldn't be necessary. If there is an underlying problem, it should be addressed where the problem is rather than building a workaround on top. Adding more workarounds will lead to an increasingly chaotic and unmaintainable system.
It is not useful or sufficient to claim that something isn't working. Claims in bugs /always/ need to be backed up with evidence (worker logs / task logs / config settings / screenshots / console dumps / whatever ....) - otherwise they just hearsay, serve little purpose, and nobody can support you.
Many thanks.
Assignee | ||
Comment 79•6 years ago
|
||
From ms-016, which was reimaged about 3 hours ago:
C:\windows\system32>dir c:\Users
Volume in drive C is Windows
Volume Serial Number is C898-9E3D
Directory of c:\Users
08/21/2018 02:35 PM <DIR> .
08/21/2018 02:35 PM <DIR> ..
08/20/2018 10:32 PM <DIR> Administrator
08/20/2018 09:06 PM <DIR> Public
08/21/2018 11:49 AM <DIR> task_1534850768
08/21/2018 01:06 PM <DIR> task_1534851983
08/21/2018 02:39 PM <DIR> task_1534856554
08/21/2018 02:39 PM <DIR> task_1534862131
0 File(s) 0 bytes
8 Dir(s) 41,465,036,800 bytes free
https://papertrailapp.com/groups/1141234/events?focus=968569269854035972&selected=968569269854035972
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Generic worker ran successfully (exit code 67) rebooting #015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:30 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via os.RemoveAll(path) call as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534850768' with os.RemoveAll(path) method#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534850768\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534850768' via del command as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534850768'#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via os.RemoveAll(path) call as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 WARNING: could not delete directory 'C:\Users\task_1534851983' with os.RemoveAll(path) method#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Trying to remove directory 'C:\Users\task_1534851983' via del command as GenericWorker user...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'cmd' '/c' 'del' '/s' '/q' '/f' 'C:\Users\task_1534851983'#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Looking for existing task users to delete...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Resolved 29 tasks in total so far.#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Creating Windows user task_1534862131...#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Running command: 'net' 'user' 'task_1534862131' 'pWd0_zqP1tkx5DZ1BYhJfgdTpbljc' '/add' '/expires:never' '/passwordchg:no' '/y'#015
Aug 21 07:35:31 T-W1064-MS-016.mdc1.mozilla.com generic-worker: 2018/08/21 14:35:31 Created new OS user!#015
As far as the schedule task. Windows creates a schedule task for each user to auto-update OneDrive (https://onedrive.live.com/about/en-us/). I had currently diabled it on all the generic-worker 10 nodes. I will reinstall one this morning without disableing OneDrive.
Assignee | ||
Comment 80•6 years ago
|
||
Actually I had reimaged a node just for this early this morning. From MS-318
C:\windows\system32>schtasks /query | grep one
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:41:17 AM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 10:01:26 AM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 5:59:17 AM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 11:05:10 PM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 6:26:05 PM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 4:34:24 AM Ready
OneDrive Standalone Update Task-S-1-5-21 8/22/2018 2:09:21 PM Ready
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 81•6 years ago
|
||
pmoore: Do you think I should hold off from merging this pull request?
Flags: needinfo?(pmoore)
Comment 82•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #81)
> pmoore: Do you think I should hold off from merging this pull request?
I think we need to test the worker in a staging pool that isn't taking production jobs. There are two problems with the current approach:
1) it impacts real user pushes
2) it isn't possible to direct tasks to specific test machines - any worker with the given provisionerId/workerType name can take a job, so when you submit a task you don't know if it will get run by a generic-worker 8.3.0 instance or a generic-worker 10.8.4 worker.
If we can set up a dedicated staging pool, it will be easier to troubleshoot what the issues are. For example, I'm curious if the reboots are working between tasks, so I'd like to submit a task to look at the uptime of a given worker, to confirm it rebooted between the current task and the previous task. This will be possible if we have a staging pool with a different workerType configuration setting (like :dragrom set up for macOS workers).
Another thing worth trying is to try to remove one of those directories as an Administrator, and see whether you are able to. The directory deletion is done by the windows service, which is a LocalSystem account.
I can also add some additional logging to the worker to say why it can't delete the directories, if that would be helpful.
Let's set up a staging pool before merging this.
Flags: needinfo?(pmoore)
Comment 83•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #82)
> I can also add some additional logging to the worker to say why it can't
> delete the directories, if that would be helpful.
It looks like this is already there, e.g.:
remove C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts: Access is denied.
What does `icacls "C:\Users\task_1534851983\AppData\Local\Microsoft\Windows\Application Shortcuts"` report?
I'm curious what created this directory, and who has permission to delete it / modify its access settings.
Comment 84•6 years ago
|
||
Note, if the os.RemoveAll(path) call to the go standard library fails to delete the directory, the worker falls back to using the del cmd built in, running as LocalSystem user, namely:
> cmd /c del /s /q /f "<path>"
Maybe we should run the following if the delete fails[1]:
> icacls "<path>" /t /grant:r LocalSystem:(OI)(CI)F
and then try again.
I suspect LocalSystem account and the task user account don't have permission to delete the "Application Shortcuts" folder that is getting created, so we might need to explicitly grant permission recursively to all files/folders to do this in case of failure. This could be an expensive operation so I propose to only do this in the failure case, and then to retry the delete again if the icacls command is successful.
I am still curious what the DACLs are on those "Application Shortcuts" folders though.
----
[1] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/icacls
Comment 85•6 years ago
|
||
If you can grant me access to one of the machines, I would be happy to investigate further. Alternatively if you're happy to set up a staging pool, I can also troubleshoot from e.g. an interactive loaner (I can't get an interactive loaner without a staging pool, as an 8.3.0 worker is likely to consume the task).
Thanks!
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 86•6 years ago
|
||
Here is what was returned from the icacls command on a directory was unable to be deleted:
C:\windows\system32>icacls "C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts"
C:\Users\task_1534854599\AppData\Local\Microsoft\Windows\Application Shortcuts NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)
BUILTIN\Administrators:(I)(OI)(CI)(F)
S-1-5-21-1770456216-2325375451-2193181373-1002:(I)(OI)(CI)(F)
T-W1064-MS-318\task_1534875459:(I)(OI)(CI)(F)
Successfully processed 1 files; Failed processing 0 files
I have set a staging pool of 5 nodes, ms-320 through ms-324. These nodes have a workerType of gecko-t-win10-64-hbeta and RDP has been enabled on them. However, when a task is retriggered and using this worker type ( and gecko-t-win10-64-hs) the task is picked up and quickly returns as an exception. The reason is malformed-payload. The only value in the generic-worker config file that changed was the workerType.
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 87•6 years ago
|
||
Comment 88•6 years ago
|
||
for debugging, here's some links to the delete failure logs.
- on hardware instances:
https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22
- on ec2 instances:
https://papertrailapp.com/groups/2488493/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22
the failures occur on all windows worker types (hardware, ec2, testers, builders)
Comment 89•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #87)
> For reference:
> https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> gecko-t-win10-64-hbeta
Looking at one of the malformed-payload exceptions, the log says:
> [taskcluster:error] [mounts] task.dependencies needs to include LpA26q4QSiOPOyP0Bdh6Hw since one or more of its artifacts are mounted
See https://tools.taskcluster.net/groups/Cs0lJfSQTS-kB_VDLf-4KA/tasks/Cs0lJfSQTS-kB_VDLf-4KA/runs/0/logs/public%2Flogs%2Flive.log#L26
You just need to add that task as a dependency of the current task. See "dependencies" in https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/references/api#request-payload page.
Comment 90•6 years ago
|
||
Mark, did we decide that bug 1433854 isn't blocking this one, as it isn't a showstopper?
Can you check the other dependencies, and say if they really are hard dependencies or not? I'm wondering if we can upgrade to generic-worker 10, to unblock all the downstream bugs, and deal with any open issues afterwards. If they aren't critical issues, that would be my preference, but I do see there is quite a long list of dependencies, just not sure if they really are hard blockers or not.
Thanks!
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 91•6 years ago
|
||
Generic-work 8 is no longer in use in the datacenter. Windows nodes are now using 10.8.5. I am going to open a separate bug and to further upgrade generic-worker, but I may wait on bug 1433854 being resolved before upgrading.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mcornmesser)
Resolution: --- → FIXED
Summary: upgrade generic worker to 10.x.x (match versions on AWS testers) on Win 10 hardware → upgrade generic worker to 10.8.5 on Win 10 hardware
Updated•6 years ago
|
Attachment #9002623 -
Flags: feedback?(pmoore)
You need to log in
before you can comment on or make changes to this bug.
Description
•