Closed
Bug 1470338
Opened 7 years ago
Closed 6 years ago
Windows 10 GenericWorker 10 nodes occasionally fail to reboot
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)
Infrastructure & Operations
RelOps: OpenCloudConfig
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
No description provided.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → mcornmesser
Assignee | ||
Comment 1•7 years ago
|
||
I have come across a handful of nodes that are up and running, but GenericWorker is not running. Generic worker is finishing and exiting with exit code 67. Examples below comes from T-w1064-ms-042.
Last entries in the GenericWorker log:
2018/06/21 23:23:56 Making system call SHSetKnownFolderPath with args: [D4C870 0 77C C042290280]
2018/06/21 23:23:56 Result: 0 0 The system could not find the environment option that was entered.
2018/06/21 23:23:56 Making system call SHGetKnownFolderPath with args: [D4C870 8000 77C C0422C2178]
2018/06/21 23:23:56 Result: 0 0 Cannot create a file when that file already exists.
2018/06/21 23:23:56 Making system call CoTaskMemFree with args: [17C0680]
2018/06/21 23:23:56 Result: 1 0 The operation completed successfully.
2018/06/21 23:23:56 Exiting worker so it can reboot...
2018/06/21 23:23:56 Saving file file-caches.json (absolute path: C:\generic-worker\file-caches.json)
2018/06/21 23:23:56 Saving file directory-caches.json (absolute path: C:\generic-worker\directory-caches.json)
2018/06/21 23:23:56 Exiting worker with exit code 67
Also within at this time or with in a 7 seconds 3 user directories were created:
drwxr-xr-x 15 Administrator Administrators 8192 Jun 21 23:23 task_1529616203
drwxr-xr-x 30 Administrator Administrators 8192 Jun 21 23:16 task_1529617819
drwxr-xr-x 22 Administrator Administrators 4096 Jun 21 23:23 task_1529623436
Current user is task_1529617819
The papertrail link from the same time.
https://papertrailapp.com/groups/1958653/events?focus=946596505102483460&q=ms-042&selected=946596505102483460
The reoccurring errors after the GW exit doesn't not seem to be unique to the problem nodes.
I do not believe we hit this behavior with GW 8.
For now I am going to let ms-042 and ms-023 be and reinstall any other machines affected like this.
pmoore: any thoughts?
Flags: needinfo?(pmoore)
Comment 2•7 years ago
|
||
This looks suspicious:
Resource Id is [Script]FirewallRule_ICMPv4In and Source Info is C:\windows\TEMP\xDynamicConfig.ps1::583::9::Script. Error Message is PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.. .#015
This seems to break DSC
From userdata/xDynamicConfig.ps1 I see:
583 Script ('FirewallRule_{0}' -f $item.ComponentName) {
From userdata/Manifest/gecko-t-win10-64-hw-GW10.json I see:
1224 {
1225 "ComponentName": "ICMPv4In",
1226 "ComponentType": "FirewallRule",
1227 "Protocol": "ICMPv4",
1228 "Direction": "Inbound",
1229 "Action": "Allow"
1230 },
I see a silimar issue here with MSFT_ScriptResource, no idea if it is related:
https://github.com/PowerShell/DscResources/issues/316
I also spotted that this line isn't writing to C:\generic-worker\generic-worker.log: https://github.com/mozilla-releng/OpenCloudConfig/blob/cbb61fd372ea8c36bd8e040024aef794eadc4a72/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L54
The other lines that write to the log file have `>> C:\generic-worker\generic-worker.log`.
I'd suggest adding some more logging between these lines:
https://github.com/mozilla-releng/OpenCloudConfig/blob/cbb61fd372ea8c36bd8e040024aef794eadc4a72/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L49-L57
Flags: needinfo?(pmoore)
Assignee | ||
Comment 3•7 years ago
|
||
I don't believe those errors are unique to the machines that are failing to reboot. I am going to work on resolving those to see if they have an affect or not.
Assignee | ||
Comment 4•7 years ago
|
||
Currently using this repo: https://github.com/mozilla-platform-ops/OpenCloudConfig
Removed the ping firewall declarations from manifest, so the error will be gone for now. I am not quite understanding what is causing the error. Since the error is about a string replacement and the variable that is replacing it has a value.
After the initial reboot the wrapper script added the following:
echo Generic worker ran successfully (exit code %GW_EXIT_CODE%) rebooting >> C:\generic-worker\generic-worker.log
if exist C:\generic-worker\rebootcount.txt del /Q /F C:\generic-worker\rebootcount.txt
if exist C:\DSC\in-progress.lock del /Q /F C:\DSC\in-progress.lock
shutdown /r /t 0 /f /c "Rebooting as generic worker ran successfully" >> C:\generic-worker\generic-worker.log
set /a minutes=0
set /a reboots=1
:WaitingOnReboot
echo "Generic worker has exited. Waiting on reboot. %minutes% minutes since command." >> C:\generic-worker\generic-worker.log
ping -n 60 127.0.0.1 1>/nul
if %minutes% GRT 10 Goto InfoGather
set /a %minutes%=%minutes%=1
Goto WaitingOnReboot
exit
:InfoGather
echo dumping tasklsit to C:\generic-worker\tasklist*.txt >> C:\generic-worker\generic-worker.log
tasklist >> C:\generic-worker\tasklist%time:~-5%.txt
set /a minutes=0
set /a %reboots%+1
echo "Attempting reboot again. %reboots% attempted." >> C:\generic-worker\generic-worker.log
shutdown /r /t 0 /f /c "Attemptng additional reboot." >> C:\generic-worker\generic-worker.log
Goto WaitingOnReboot
Added the redirect to get the exit code into logging.
Added a redirect to to hopefully log the output of the shutdown command.
Added if 10 minutes pass since reboot dump task list and attempt additional reboot.
Testing a fresh install nowon ms-016. If it works out I will reinstall ms-017 - ms-045 pointing them to this repo. Also looking into adding commands to :InfoGather to get more info when a machine hits this state.
Assignee | ||
Comment 5•7 years ago
|
||
Though OCC was not running at the time GenericWorker is running, LCM (Local Configuration Manager) , the under the hood part of DSC, was continuing to run. LCM was had a ConfigurationMode setting of ApplyandMonitor and ConfigurationModeFrequencyMins setting of 15 minutes. LCM was checking the DSC mof file and attempted to reapply it every 15 minutes.
I am currently testing on ms-016 to change this. I have added the following to he rundsc.ps1:
if ($locationType -eq 'DataCenter') {
hw-DiskManage
[DSCLocalConfigurationManager()]
configuration LCMConfig
{
Node localhost
{
Settings
{
ConfigurationMode = 'ApplyOnly'
ConfigurationModeFrequencyMins = '600'
}
}
}
LCMconfig
Set-DscLocalConfigurationManager -path LCMconfig
}
Being pulled from: https://github.com/markcor/OpenCloudConfig/blob/master/userdata/rundsc.ps1
If it does not have any ill effects I will update the rundsc.ps1 for the other nodes running gw 10.
Assignee | ||
Comment 6•7 years ago
|
||
I was unsuccessful in trying to change the LCM setting. However, I think the issue may have been cause by OCC not fully completing and possible delaying a reboot or holding something open. More details here: https://bugzilla.mozilla.org/show_bug.cgi?id=1443589#c55
Assignee | ||
Comment 7•6 years ago
|
||
There was no failure to reboot. The node would reboot and then have no network connectivity. Updating the Mellanox network drivers had greatly reduced this behavior.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•