Closed Bug 1324540 Opened 7 years ago Closed 7 years ago

Generic Worker not starting up after reboot on win7

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

Details

Generic worker runs ok on win7, but after a task completes, and it reboots, it looks like the worker isn't starting up, e.g. from https://papertrailapp.com/systems/601379563/events:


Dec 19 21:04:56 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com USER32: The process C:\windows\system32\shutdown.exe (I-0FB5817DB32A6) has initiated the restart of computer I-0FB5817DB32A6 on behalf of user I-0FB5817DB32A6\GenericWorker for the following reason: No title for this reason could be found   Reason Code: 0x800000ff   Shutdown Type: restart   Comment: Rebooting as generic worker ran successfully 
Dec 19 21:05:39 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Apache_Service: The Apache service named  reported the following error:  >>> Syntax error on line 179 of C:/Program Files/Apache Software Foundation/Apache2.2/conf/httpd.conf:     . 
Dec 19 21:05:39 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Apache_Service: The Apache service named  reported the following error:  >>> DocumentRoot must be a directory     . 
Dec 19 21:05:39 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Service_Control_Manager: The Apache2.2 service terminated with service-specific error Incorrect function.. 
Dec 19 21:05:40 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-WMI: Event filter with query "SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA "Win32_Processor" AND TargetInstance.LoadPercentage > 99" could not be reactivated in namespace "//./root/CIMV2" because of error 0x80041003. Events cannot be delivered through this filter until the problem is corrected. 
Dec 19 21:05:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-CAPI2: Failed extract of third-party root list from auto update cab at: <http://www.download.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.  . 
Dec 19 21:07:10 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 19 21:07:10 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 19 21:07:10 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 8:05:21 PM; uptime: 00:01:47.0292000. 
Dec 19 21:07:10 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed some validity checks and will be retested shortly. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to invoke the consistency check for task of type REBOOT. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is running LCM method PerformRequiredConfigurationChecks with the flag set to 2. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The local configuration manager started. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Operation Consistency Check or Pull started by user sid S-1-5-20 from computer NULL. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to start the consistency timer for the first time with value 15 minutes. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer successfully created the consistency timer. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to start the refresh timer for the first time with value 30 minutes. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer successfully created the refresh timer. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Running consistency engine. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Consistency Engine did not find a current or pending configuration to apply 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Consistency engine was run successfully. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Job runs under the following LCM setting.   ConfigurationMode: ApplyAndMonitor   ConfigurationModeFrequencyMins: 15   RefreshMode: PUSH   RefreshFrequencyMins: 30   RebootNodeIfNeeded: NONE   DebugMode: False 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {C9BA27C1-C626-11E6-8FA7-12D204CE79EC} :   Operation Consistency Check or Pull completed successfully. 
Dec 19 21:07:41 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer has successfully run LCM method PerformRequiredConfigurationChecks with flag 2. 
Dec 19 21:08:42 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The local configuration manager was shut down. 
Dec 19 21:09:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 19 21:09:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 19 21:09:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 8:05:21 PM; uptime: 00:03:41.3616000. 
Dec 19 21:09:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed some validity checks and will be retested shortly. 
Dec 19 21:11:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 19 21:11:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 19 21:11:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 8:05:21 PM; uptime: 00:05:41.3412000. 
Dec 19 21:11:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed validity check and will be halted. 
Dec 19 21:11:05 i-0fb5817db32a6908b.gecko-t-win7-32.use1.mozilla.com USER32: The process C:\windows\system32\shutdown.exe (I-0FB5817DB32A6) has initiated the shutdown of computer I-0FB5817DB32A6 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: shutdown   Comment: HaltOnIdle :: instance failed validity checks 


We see above from the "uptime" message, that the reboot did take place, so it looks like for some reason the worker doesn't start up correctly afterwards.

It also doesn't appear to log anything after the reboot, which suggests a problem with either the run-generic-worker.bat script getting executed, or exiting before it starts generic-worker.exe.
Rob,

I'm wondering if after the reboot, C:\dsc\task-claim-state.valid doesn't get created, so the worker doesn't ever start up, waiting for this file.

Can you confirm if it would be recreated after a reboot?

Thanks!
Flags: needinfo?(rthijssen)
no it won't exist. it's created by dsc but deleted by that bat file.

when dsc detects that generic worker is running, it deletes all dsc related jobs and triggers to prevent dsc from running again. it can no longer create the file since it will never run again on that instance.
Flags: needinfo?(rthijssen)
It looks like it is still a problem:

https://papertrailapp.com/groups/853883/events?q=%3A%3A+instance+failed+validity+checks&q_id=15587143

The commit in comment 4 looks good to me, I'm not sure what is happening here.
when i load that papertrail search it looks like the last validity check failure was yesterday afternoon (Dec 19 15:22:05) which looks correct to me (no more terminations since then). is this what you're seeing too?
I see some also from today :(

https://papertrailapp.com/systems/601769973/events?highlight=747643510680899631&focus=747643510680899631

Dec 20 00:15:26 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com USER32: The process C:\windows\system32\shutdown.exe (I-044ABD4887A18) has initiated the restart of computer I-044ABD4887A18 on behalf of user I-044ABD4887A18\GenericWorker for the following reason: No title for this reason could be found   Reason Code: 0x800000ff   Shutdown Type: restart   Comment: Rebooting as generic worker ran successfully 
Dec 20 00:16:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Apache_Service: The Apache service named  reported the following error:  >>> Syntax error on line 179 of C:/Program Files/Apache Software Foundation/Apache2.2/conf/httpd.conf:     . 
Dec 20 00:16:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Apache_Service: The Apache service named  reported the following error:  >>> DocumentRoot must be a directory     . 
Dec 20 00:16:07 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Service_Control_Manager: The Apache2.2 service terminated with service-specific error Incorrect function.. 
Dec 20 00:16:07 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-WMI: Event filter with query "SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA "Win32_Processor" AND TargetInstance.LoadPercentage > 99" could not be reactivated in namespace "//./root/CIMV2" because of error 0x80041003. Events cannot be delivered through this filter until the problem is corrected. 
Dec 20 00:16:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-CAPI2: Failed extract of third-party root list from auto update cab at: <http://www.download.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.  . 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to invoke the consistency check for task of type REBOOT. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is running LCM method PerformRequiredConfigurationChecks with the flag set to 2. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The local configuration manager started. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Operation Consistency Check or Pull started by user sid S-1-5-20 from computer NULL. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to start the consistency timer for the first time with value 15 minutes. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer successfully created the consistency timer. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer is going to start the refresh timer for the first time with value 30 minutes. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer successfully created the refresh timer. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Running consistency engine. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Consistency Engine did not find a current or pending configuration to apply 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Consistency engine was run successfully. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Job runs under the following LCM setting.   ConfigurationMode: ApplyAndMonitor   ConfigurationModeFrequencyMins: 15   RefreshMode: PUSH   RefreshFrequencyMins: 30   RebootNodeIfNeeded: NONE   DebugMode: False 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: Job {64E71311-C641-11E6-90E0-0A76F039C68E} :   Operation Consistency Check or Pull completed successfully. 
Dec 20 00:18:09 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The DscTimer has successfully run LCM method PerformRequiredConfigurationChecks with flag 2. 
Dec 20 00:18:10 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 20 00:18:10 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 20 00:18:10 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 11:15:48 PM; uptime: 00:02:19.4682000. 
Dec 20 00:18:10 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed some validity checks and will be retested shortly. 
Dec 20 00:19:08 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com Microsoft-Windows-DSC: The local configuration manager was shut down. 
Dec 20 00:20:04 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 20 00:20:04 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 20 00:20:04 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 11:15:48 PM; uptime: 00:04:14.3778000. 
Dec 20 00:20:04 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed some validity checks and will be retested shortly. 
Dec 20 00:22:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: generic-worker is not running. 
Dec 20 00:22:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Is-Running :: OpenCloudConfig is not running. 
Dec 20 00:22:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 12/19/2016 11:15:48 PM; uptime: 00:06:14.3730000. 
Dec 20 00:22:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com HaltOnIdle: instance failed validity check and will be halted. 
Dec 20 00:22:05 i-044abd4887a18905a.gecko-t-win7-32.use1.mozilla.com USER32: The process C:\windows\system32\shutdown.exe (I-044ABD4887A18) has initiated the shutdown of computer I-044ABD4887A18 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: shutdown   Comment: HaltOnIdle :: instance failed validity checks
This was immediately after running

https://tools.taskcluster.net/task-inspector/#KPyhlwBeQ1eG4IbXBB_Sfg/0

This is the instance data for that task:

[taskcluster 2016-12-19T22:30:49.122Z] Worker Type (gecko-t-win7-32) settings:
[taskcluster 2016-12-19T22:30:49.199Z]   {
[taskcluster 2016-12-19T22:30:49.199Z]     "aws": {
[taskcluster 2016-12-19T22:30:49.199Z]       "ami-id": "ami-64151f73",
[taskcluster 2016-12-19T22:30:49.199Z]       "availability-zone": "us-east-1d",
[taskcluster 2016-12-19T22:30:49.199Z]       "instance-id": "i-044abd4887a18905a",
[taskcluster 2016-12-19T22:30:49.199Z]       "instance-type": "c4.2xlarge",
[taskcluster 2016-12-19T22:30:49.199Z]       "local-ipv4": "172.31.24.9",
[taskcluster 2016-12-19T22:30:49.199Z]       "public-hostname": "ec2-54-91-101-0.compute-1.amazonaws.com",
[taskcluster 2016-12-19T22:30:49.199Z]       "public-ipv4": "54.91.101.0"
[taskcluster 2016-12-19T22:30:49.199Z]     },
[taskcluster 2016-12-19T22:30:49.199Z]     "generic-worker": {
[taskcluster 2016-12-19T22:30:49.199Z]       "go-arch": "386",
[taskcluster 2016-12-19T22:30:49.200Z]       "go-os": "windows",
[taskcluster 2016-12-19T22:30:49.200Z]       "go-version": "go1.7.3",
[taskcluster 2016-12-19T22:30:49.200Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v7.2.1",
[taskcluster 2016-12-19T22:30:49.200Z]       "version": "7.2.1"
[taskcluster 2016-12-19T22:30:49.200Z]     },
[taskcluster 2016-12-19T22:30:49.200Z]     "machine-setup": {
[taskcluster 2016-12-19T22:30:49.200Z]       "ami-created": "2016-12-19 13:48:56.819Z",
[taskcluster 2016-12-19T22:30:49.200Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/c6a78c766f1a4903b8509191dea23ddb0a345534/userdata/Manifest/gecko-t-win7-32.json"
[taskcluster 2016-12-19T22:30:49.200Z]     }
[taskcluster 2016-12-19T22:30:49.200Z]   }


Looking at that commit ("https://github.com/mozilla-releng/OpenCloudConfig/blob/c6a78c766f1a4903b8509191dea23ddb0a345534/userdata/Manifest/gecko-t-win7-32.json") - I see it is the one immediately prior to your change - so it seems this was just a stale worker.

So I think we're good! Sorry for the noise!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.