Up to 20 minutes in between tasks for 10-64-hw workers
Categories
(Infrastructure & Operations :: RelOps: Windows OS, task)
Tracking
(Not tracked)
People
(Reporter: markco, Assigned: markco)
References
Details
Attachments
(2 files)
Assignee | ||
Comment 1•5 years ago
|
||
The longer a Windows Moonshot hardware worker up thge time in between tasks increase significantly.
For example ms-080 is an worker that has been up for 2 months:
C:\Users\task_1565882064\Desktop\New folder>systeminfo|find /i "original"
Original Install Date: 6/11/2019, 2:43:39 PM
The time between generic-worker exit and when it is ready for a new tasks is just under 25 minutes.
Aug 15 08:32:38 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: Exiting worker with exit code 67
Aug 15 08:57:23 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: Running command: 'net' 'user' 'task_1565875491' '/delete' The command completed successfully. #015
Aug 15 08:57:23 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: Disk available: 227034537984 bytes#015
Aug 15 08:57:45 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: No task claimed. Idle for 1m26.9475791s (will exit if no task claimed in 1h58m33.0524209s).#015
For comparision ms-151 was redeployed yesterday, and the time between exit and ready for new task is under 3 minutes.
From https://papertrailapp.com/groups/1141234/events?focus=1098673257613271057&selected=1098673257613271057
Aug 15 08:02:19 T-W1064-MS-151.mdc1.mozilla.com-1 generic-worker: Exiting worker with exit code 67#015
Aug 15 08:04:53 T-W1064-MS-151.mdc1.mozilla.com-1 generic-worker: Looking for existing task users to delete...#015
Aug 15 08:04:53 T-W1064-MS-151.mdc1.mozilla.com-1 generic-worker: Attempting to remove Windows user task_1565875111...#015
Aug 15 08:04:53 T-W1064-MS-151.mdc1.mozilla.com-1 generic-worker: Running command: 'net' 'user' 'task_1565875111' '/delete' The command completed successfully. #0
Assignee | ||
Comment 2•5 years ago
|
||
Using ms-080 to gather data and troubleshoot
Regarding the start up task and Puppet.
The generic-worker exit and node reboot:
Aug 15 08:32:39 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker-wrapper: "Generic worker has created a new user successfully (exit code 67)" #015
Aug 15 08:32:39 T-W1064-MS-080.mdc1.mozilla.com-1 User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-080) has initiated the restart of computer T-W1064-MS-080 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: New user created; rebooting#015
Aug 15 08:32:44 T-W1064-MS-080.mdc1.mozilla.com-1 nxlog: WARNING received a system shutdown request#015
The MaintainSystem starts up just over an minute later:
Aug 15 08:33:54 T-W1064-MS-080.mdc1.mozilla.com-1 MaintainSystem: Run-MaintainSystem :: begin - 2019-08-15T15:33:53.6325494Z#015
Puppet starts 22 seconds later:
Aug 15 08:34:16 T-W1064-MS-080.mdc1.mozilla.com-1 MaintainSystem: Puppet-Run :: Initiating Puppet apply
Commands run by Puppet timeout:
Aug 15 08:34:17 T-W1064-MS-080.mdc1.mozilla.com-1 puppet-run: 2019-08-15 14:57:41 +0000 /Stage[main]/Win_firewall::Allow_ping/Windows::Firewall_rule[ICMP Ping]/Exec[windows-firewall-ICMP Ping]/unless (err): Check "C:\WINDOWS\system32\netsh.exe advfirewall firewall show rule name="ICMP Ping"" exceeded timeout#015
Aug 15 08:34:17 T-W1064-MS-080.mdc1.mozilla.com-1 puppet-run: 2019-08-15 15:02:41 +0000 /Stage[main]/Win_nxlog::Fw_exception/Windows_firewall::Exception[nxlog]/Exec[set rule papertrail 1]/unless (err): Check "C:\windows\system32\netsh.exe advfirewall firewall show rule name="papertrail 1"" exceeded timeout#015
Aug 15 08:34:17 T-W1064-MS-080.mdc1.mozilla.com-1 puppet-run: 2019-08-15 15:07:42 +0000 /Stage[main]/Win_kms::Fw_exception/Windows_firewall::Exception[KMS_in]/Exec[set rule Allow KMS in]/unless (err): Check "C:\windows\system32\netsh.exe advfirewall firewall show rule name="Allow KMS in"" exceeded timeout#015
Puppet completes 22 minutes later:
Aug 15 08:56:17 T-W1064-MS-080.mdc1.mozilla.com-1 MaintainSystem: Puppet-Run :: end - 2019-08-15T15:56:16.2279898Z#015
For comparision tthe newly deployed machine runs Pupept in 40 seconds. This would be teh same configuration check and apply.
Aug 15 06:25:11 T-W1064-MS-151.mdc1.mozilla.com-1 MaintainSystem: Puppet-Run :: Initiating Puppet apply
Aug 15 06:25:51 T-W1064-MS-151.mdc1.mozilla.com-1 MaintainSystem: Puppet-Run :: Puppet apply successful#015
I suspect with the command run by Puppet timing out that node is in a state in which the system is busy and/or certain process are not available.
Assignee | ||
Comment 3•5 years ago
|
||
During the time span between start up and the ready for a task the following event appears numerous time in the application event viewer:
Fault bucket , type 0
Event Name: AppxDeploymentFailureBlue
Response: Not available
Cab Id: 0
Problem signature:
P1: 800705AA
P2: 16
P3: 3
P4: windows.repositoryExtension
P5: 10.0.15063.0
P6: 5401
P7:
P8:
P9:
P10:
Attached files:
\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER5E53.tmp.WERInternalMetadata.xml
These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\NonCritical_800705AA_5fda5324563d4262d6b3ffc2163fabc72ef1d80_00000000_11bdf1ce
Analysis symbol:
Rechecking for solution: 0
Report Id: 3b7fa83c-5129-409b-b846-d313c2060b58
Report Status: 528484
Hashed bucket:
The debug message level is information. It occurs multiple times per second. The only difference between the message the path that follows "These files may be available here:"
Assignee | ||
Comment 4•5 years ago
|
||
This is what is on the display during the time between start up and the MaintainSystem script finishing.
Assignee | ||
Comment 5•5 years ago
|
||
Logs below start here: https://papertrailapp.com/groups/1141234/events?focus=1098808615248224273&q=ms-080&selected=1098808615248224273 . It is a bit confusing there because there is some old logs before the previous reboot got transmitted after the next boot.
On this node I have disabled the MaintainSchedule task and removed the flag file delete from the generic-worker wrapper script.
The user init script is still running, and I added a check for an additional registry key under HKLM (disregard this query for now).
The logging starts here:
Aug 15 17:00:11 T-W1064-MS-080.mdc1.mozilla.com-1 nxlog: INFO input file 'C:/generic-worker/generic-worker.log' was truncated, restarting from the beginning#015
Aug 15 17:00:11 T-W1064-MS-080.mdc1.mozilla.com-1 nxlog: INFO nxlog-ce-2.10.2150 started#015
Aug 15 17:00:11 T-W1064-MS-080.mdc1.mozilla.com-1 nxlog: INFO input file 'C:/generic-worker/generic-worker-wrapper.log' was truncated, restarting from the beginning#015
Aug 15 17:00:11 T-W1064-MS-080.mdc1.mozilla.com-1 nxlog: INFO connecting to log-aggregator.srv.releng.mdc1.mozilla.com:514#015
Aug 15 17:00:11 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: 1 file(s) copied. #015
The user init script begins outputting:
Aug 15 17:00:17 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>goto CheckForStuckRects3 #015
Aug 15 17:00:17 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>echo Wait for registry setting to exist before changing it... Wait for registry setting to exist before changing it... #015
Aug 15 17:00:17 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>reg query HKLM\SOFTWARE\Mozilla\ronin_puppet\workertype /ve ERROR: The system was unable to find the specified registry key or value. #015
Aug 15 17:00:17 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>reg query HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3 /ve ERROR: The system was unable to find the specified registry key or value.
The user init scripts loops for about 23 minutes. Until:
Aug 15 17:23:51 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>echo Setting visual effects Setting visual effects #015
Aug 15 17:23:51 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>reg add "HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\VisualEffects" /v "VisualFXSetting" /t REG_DWORD /d 1 /f The operation completed successfully. #015
Aug 15 17:23:51 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565912075>echo Completed task user initialisation. Completed task user initialisation.#015
Aug 15 17:23:51 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: Exit Code: 0 User Time: 0s Kernel Time: 46.875ms Wall Time: 23m35.9239723s Result: SUCCEEDED#015
Aug 15 17:23:51 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: Creating Windows user task_1565913613...#015
It seems the task users registry hive is not accessible during this time.
Assignee | ||
Comment 6•5 years ago
|
||
I corrected the other reg query. It appears that other parts of the registry is accessible after startup:
Aug 15 17:59:28 T-W1064-MS-080.mdc1.mozilla.com-1 generic-worker: C:\Users\task_1565913613>reg query HKLM\SOFTWARE\Mozilla\ronin_puppet\ /ve HKEY_LOCAL_MACHINE\SOFTWARE\Mozilla\ronin_puppet (Default) REG_SZ (value not set)
Assignee | ||
Comment 7•5 years ago
|
||
I continued to look at this through out the week. This does seem to be a progress issue the longer the deployment lives.
For instance a node last week that was around 20 minutes is now near 26 minutes in between tasks:
https://papertrailapp.com/groups/1141234/events?focus=1101653746712211466&q=ms-325&selected=1101653746712211466
Aug 23 13:25:43 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Exiting worker with exit code 68#015
Aug 23 13:25:43 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker-wrapper: "Generic worker has reached idle timeout set in config file, win_generic_worker::hw_config::idle_timeout seconds (exit code 68)" #015
Aug 23 13:25:43 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker-wrapper: "Generic-worker panic! Issue with enviroment or worker bug. (exit code 68)" #015
Aug 23 13:25:43 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker-wrapper: "For more information on the exit codes run C:\generic-worker\generic-worker.exe --help or go to https://github.com/taskcluster/generic-worker" #015
Aug 23 13:25:44 T-W1064-MS-325.mdc2.mozilla.com-1 User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-325) has initiated the restart of computer T-W1064-MS-325 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Generic worker has reached idle timeout; rebooting#015
https://papertrailapp.com/groups/1141234/events?focus=1101660119273467935&selected=1101660119273467935
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: C:\Users\task_1566584723>echo Setting visual effects Setting visual effects #015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: C:\Users\task_1566584723>reg add "HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\VisualEffects" /v "VisualFXSetting" /t REG_DWORD /d 1 /f The operation completed successfully. #015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: C:\Users\task_1566584723>echo Completed task user initialisation. Completed task user initialisation.#015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Exit Code: 0 User Time: 0s Kernel Time: 0s Wall Time: 53.0421912s Result: SUCCEEDED#015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Creating Windows user task_1566593408...#015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Running command: 'net' 'user' 'task_1566593408' 'pWd0_vkSN7awkNSlhyc2fyKNRMm8T' '/add' '/expires:never' '/passwordchg:no' '/y'#015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Created new OS user!#015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Running command: 'wmic' 'useraccount' 'where' 'name='task_1566593408'' 'set' 'passwordexpires=false' Updating property(s) of '\T-W1064-MS-325\ROOT\CIMV2:Win32_UserAccount.Domain="T-W1064-MS-325",Name="task_1566593408"' Property(s) update successful. #015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Running command: 'net' 'localgroup' 'Remote Desktop Users' '/add' 'task_1566593408' The command completed successfully. #015
Aug 23 13:51:02 T-W1064-MS-325.mdc2.mozilla.com-1 generic-worker: Making system call LogonUserW with args: [C04233B0A0 C0421FD7D0 C0424A9F40 2 0 C0421FD7D8]#015
The screenshot is from this machine. The only area I have found that continuous grows is the sub keys of HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList. It appears to be a profile listed for each of the task users that has been created with the first starting at 1001 in the final field.
Pmoore: Any ideas or suggestions on where to look or what to try? If possible could we jump on a video chat and troubleshoot this?
If I can identify the cause I might be able to address this from the maintainsystem schedule task or Puppet.
Comment 8•5 years ago
|
||
Hi Mark,
Sure let's hop on vidyo to go through it together. Can you send me a meeting request? Thanks!
Pete
Assignee | ||
Comment 9•5 years ago
|
||
Sent a request for 2019-09-12.
I will set aside a few workers for us to look at prior to the meeting,
Assignee | ||
Comment 10•5 years ago
|
||
I have set aside ms-063, ms-065, ms-070, and ms-074 for troubleshooting.
Assignee | ||
Updated•5 years ago
|
Description
•