Closed Bug 1470338 Opened 7 years ago Closed 6 years ago

Windows 10 GenericWorker 10 nodes occasionally fail to reboot

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

No description provided.
Assignee: nobody → mcornmesser
I have come across a handful of nodes that are up and running, but GenericWorker is not running. Generic worker is finishing and exiting with exit code 67. Examples below comes from T-w1064-ms-042. Last entries in the GenericWorker log: 2018/06/21 23:23:56 Making system call SHSetKnownFolderPath with args: [D4C870 0 77C C042290280] 2018/06/21 23:23:56 Result: 0 0 The system could not find the environment option that was entered. 2018/06/21 23:23:56 Making system call SHGetKnownFolderPath with args: [D4C870 8000 77C C0422C2178] 2018/06/21 23:23:56 Result: 0 0 Cannot create a file when that file already exists. 2018/06/21 23:23:56 Making system call CoTaskMemFree with args: [17C0680] 2018/06/21 23:23:56 Result: 1 0 The operation completed successfully. 2018/06/21 23:23:56 Exiting worker so it can reboot... 2018/06/21 23:23:56 Saving file file-caches.json (absolute path: C:\generic-worker\file-caches.json) 2018/06/21 23:23:56 Saving file directory-caches.json (absolute path: C:\generic-worker\directory-caches.json) 2018/06/21 23:23:56 Exiting worker with exit code 67 Also within at this time or with in a 7 seconds 3 user directories were created: drwxr-xr-x 15 Administrator Administrators 8192 Jun 21 23:23 task_1529616203 drwxr-xr-x 30 Administrator Administrators 8192 Jun 21 23:16 task_1529617819 drwxr-xr-x 22 Administrator Administrators 4096 Jun 21 23:23 task_1529623436 Current user is task_1529617819 The papertrail link from the same time. https://papertrailapp.com/groups/1958653/events?focus=946596505102483460&q=ms-042&selected=946596505102483460 The reoccurring errors after the GW exit doesn't not seem to be unique to the problem nodes. I do not believe we hit this behavior with GW 8. For now I am going to let ms-042 and ms-023 be and reinstall any other machines affected like this. pmoore: any thoughts?
Flags: needinfo?(pmoore)
Blocks: 1443589
This looks suspicious: Resource Id is [Script]FirewallRule_ICMPv4In and Source Info is C:\windows\TEMP\xDynamicConfig.ps1::583::9::Script. Error Message is PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. .#015 This seems to break DSC From userdata/xDynamicConfig.ps1 I see: 583 Script ('FirewallRule_{0}' -f $item.ComponentName) { From userdata/Manifest/gecko-t-win10-64-hw-GW10.json I see: 1224 { 1225 "ComponentName": "ICMPv4In", 1226 "ComponentType": "FirewallRule", 1227 "Protocol": "ICMPv4", 1228 "Direction": "Inbound", 1229 "Action": "Allow" 1230 }, I see a silimar issue here with MSFT_ScriptResource, no idea if it is related: https://github.com/PowerShell/DscResources/issues/316 I also spotted that this line isn't writing to C:\generic-worker\generic-worker.log: https://github.com/mozilla-releng/OpenCloudConfig/blob/cbb61fd372ea8c36bd8e040024aef794eadc4a72/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L54 The other lines that write to the log file have `>> C:\generic-worker\generic-worker.log`. I'd suggest adding some more logging between these lines: https://github.com/mozilla-releng/OpenCloudConfig/blob/cbb61fd372ea8c36bd8e040024aef794eadc4a72/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L49-L57
Flags: needinfo?(pmoore)
I don't believe those errors are unique to the machines that are failing to reboot. I am going to work on resolving those to see if they have an affect or not.
Currently using this repo: https://github.com/mozilla-platform-ops/OpenCloudConfig Removed the ping firewall declarations from manifest, so the error will be gone for now. I am not quite understanding what is causing the error. Since the error is about a string replacement and the variable that is replacing it has a value. After the initial reboot the wrapper script added the following: echo Generic worker ran successfully (exit code %GW_EXIT_CODE%) rebooting >> C:\generic-worker\generic-worker.log if exist C:\generic-worker\rebootcount.txt del /Q /F C:\generic-worker\rebootcount.txt if exist C:\DSC\in-progress.lock del /Q /F C:\DSC\in-progress.lock shutdown /r /t 0 /f /c "Rebooting as generic worker ran successfully" >> C:\generic-worker\generic-worker.log set /a minutes=0 set /a reboots=1 :WaitingOnReboot echo "Generic worker has exited. Waiting on reboot. %minutes% minutes since command." >> C:\generic-worker\generic-worker.log ping -n 60 127.0.0.1 1>/nul if %minutes% GRT 10 Goto InfoGather set /a %minutes%=%minutes%=1 Goto WaitingOnReboot exit :InfoGather echo dumping tasklsit to C:\generic-worker\tasklist*.txt >> C:\generic-worker\generic-worker.log tasklist >> C:\generic-worker\tasklist%time:~-5%.txt set /a minutes=0 set /a %reboots%+1 echo "Attempting reboot again. %reboots% attempted." >> C:\generic-worker\generic-worker.log shutdown /r /t 0 /f /c "Attemptng additional reboot." >> C:\generic-worker\generic-worker.log Goto WaitingOnReboot Added the redirect to get the exit code into logging. Added a redirect to to hopefully log the output of the shutdown command. Added if 10 minutes pass since reboot dump task list and attempt additional reboot. Testing a fresh install nowon ms-016. If it works out I will reinstall ms-017 - ms-045 pointing them to this repo. Also looking into adding commands to :InfoGather to get more info when a machine hits this state.
Though OCC was not running at the time GenericWorker is running, LCM (Local Configuration Manager) , the under the hood part of DSC, was continuing to run. LCM was had a ConfigurationMode setting of ApplyandMonitor and ConfigurationModeFrequencyMins setting of 15 minutes. LCM was checking the DSC mof file and attempted to reapply it every 15 minutes. I am currently testing on ms-016 to change this. I have added the following to he rundsc.ps1: if ($locationType -eq 'DataCenter') { hw-DiskManage [DSCLocalConfigurationManager()] configuration LCMConfig { Node localhost { Settings { ConfigurationMode = 'ApplyOnly' ConfigurationModeFrequencyMins = '600' } } } LCMconfig Set-DscLocalConfigurationManager -path LCMconfig } Being pulled from: https://github.com/markcor/OpenCloudConfig/blob/master/userdata/rundsc.ps1 If it does not have any ill effects I will update the rundsc.ps1 for the other nodes running gw 10.
I was unsuccessful in trying to change the LCM setting. However, I think the issue may have been cause by OCC not fully completing and possible delaying a reboot or holding something open. More details here: https://bugzilla.mozilla.org/show_bug.cgi?id=1443589#c55
There was no failure to reboot. The node would reboot and then have no network connectivity. Updating the Mellanox network drivers had greatly reduced this behavior.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.