Closed Bug 1493759 Opened 6 years ago Closed 6 years ago

Windows 10 hardware nodes generic-worker wrapper script seems to terminate

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

Attachments

(2 files)

It looks like the script starts and then stops around 75 seconds later (based on the timeout here https://github.com/mozilla-releng/OpenCloudConfig/blob/6c3afdc8298ec84f874504ba6fc815639c8f2306/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L3 ) From ms-153: https://papertrailapp.com/groups/1141234/events?focus=980871647563243575&selected=980871647563243575 Sep 24 04:19:30 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:20:47 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015 Sep 24 06:20:53 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:20:58 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:03 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:14 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:19 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:24 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:29 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:39 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:44 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:49 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:54 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:21:59 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 06:22:04 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:20 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015 Sep 24 08:23:26 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:31 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:36 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:41 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:47 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:52 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:23:57 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:02 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:07 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:12 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:17 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:22 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:27 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:32 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 08:24:37 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:13 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:19 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:24 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:29 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:39 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:44 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:49 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:54 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:19:59 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:20:04 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:20:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 Sep 24 10:20:14 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 OCC runs and ends but appears to be missing portions: Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: begin#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: CLRVersion: 4.0.30319.42000, PSVersion: 5.1.15063.0#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: SecurityProtocol: Tls, Tls11, Tls12#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected TLS v1.2 in security protocol support list#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected registry setting to use strong cryptography on 64 bit .Net Framework#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected registry setting to use strong cryptography on 32 bit .Net Framework#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: SecurityProtocol: Tls, Tls11, Tls12#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: end#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Enabling Windows update service#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: hw-DiskManage :: path: C:\windows\SoftwareDistribution\Download\*, deleted.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: hw-DiskManage :: path: C:\ProgramData\Package Cache, deleted.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Current free space of drive C 34523 MB#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Current free space percentage of drive C 55%#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: userdata run starting.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: system timezone set to UTC.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: system clock synchronised.#015 Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: begin#015 Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: scheduled task: PrepLoaner deleted.#015 Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: C:\dsc\PrepLoaner.ps1 deleted.#015 Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: C:\dsc\PrepLoaner.ps1 downloaded from https://raw.githubusercontent.com/mozilla-releng/OpenCloudConfig/master/userdata/PrepLoaner.ps1?1645ab71-6071-46cf-87bd-ddca6c8b894e.#015 Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: scheduled task: PrepLoaner created.#015 Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: end#015 However the flags files that wrapper script are being created (time in papertrail are PDT and local machine is UTC) C:\dsc>dir Volume in drive C is Windows Volume Serial Number is 6465-AA44 Directory of C:\dsc 09/24/2018 05:21 PM <DIR> . 09/24/2018 05:21 PM <DIR> .. 09/14/2018 02:10 AM 29,184 configmymonitor.exe 09/14/2018 12:52 AM <DIR> DeviceManagment 09/24/2018 05:21 PM 1 EndOfManifest.semaphore 03/27/2018 02:39 AM 2,649 graphic_reg.ps1 06/11/2018 06:36 PM 0 GW10.semaphore 07/30/2018 08:54 PM 448 hw-startup-check.ps1 09/24/2018 05:19 PM 0 in-progress.lock 08/21/2018 05:27 PM 2,122 location.ps1 09/14/2018 02:03 AM <DIR> MozillaMaintenance 08/23/2018 06:02 PM 2,020 OCC_1st_runw10.bat 08/16/2018 07:59 PM 2,209 onedrive.ps1 09/24/2018 05:19 PM 20,621 PrepLoaner.ps1 09/24/2018 06:37 AM 73,456 rundsc.ps1 09/14/2018 02:00 AM <DIR> semaphore 03/20/2018 05:29 AM 67 ultravnc.ini 10/24/2017 05:01 AM 173 vncserver.inf 13 File(s) 132,950 bytes 5 Dir(s) 34,493,997,056 bytes free
Blocks: 1490398
> However the flags files that wrapper script are being created However the flags files that the wrapper script is waiting on are being created.
This node was working as expected earlier: Sep 24 00:11:48 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 2018/09/24 07:11:47 Exiting worker with exit code 67#015 Sep 24 00:11:57 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "Generic worker has exited. Waiting on reboot. 0 minutes since command." #015 Sep 24 00:13:13 T-W1064-MS-153.mdc1.mozilla.com generic-worker: ^CTerminate batch job (Y/N)? 1 file(s) copied.#015 Sep 24 00:13:15 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015 And OCC was completing successfully before this time as well: https://papertrailapp.com/groups/1141234/events?focus=980770682264784898&q=ms-153+AND+OpenCloudConfig&selected=980770682264784898 Sep 23 23:37:28 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json"#015 Sep 23 23:39:33 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180924063430.userdata-run.zip created.#015 Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015 Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: waiting for generic-worker process to start.#015 Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json"#015 Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: generic-worker running process detected 756 ms after task-claim-state.valid flag set.#015 Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: process priority for generic worker altered from Normal to AboveNormal.#015 Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: userdata run completed#015 Sep 24 at 12:11 AM Sep 24 00:13:13 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: begin#015
In process of troubleshooting I deleted: the lock file flag files the gpg key all old user directories There was also multiple reboots. The wrapper script then completed and run the generic-worker exe command. I will need to find another node exhibiting this behavior to try to square down what is causing the issue.
i've found a candidate for where this problem could be coming from. gecko-t-win10-64-gpu-b instances running a new ami (still based on 1703) started to exhibit a similar behaviour today. i noticed some messages in the pt logs appearing half way through the dsc run which say: >> TerminatingError(): "The pipeline has been stopped." a little googling suggested that a microsoft update to a powershell component changed the way that output is handled in a powershell transcript. we have a validation called FilesContain which is used as a validator for some components like the MozillaBuild install which requires that a file at C:\mozilla-build\VERSION contains the correct version number of MozillaBuild. the validation has been working fine for a couple years but today on gecko-t-win10-64-gpu-b when the version file does not exist, the validation triggers an output that isn't handled well and causes the transcript code to throw the TerminatingError mentioned above. i've modified the code to explicitly handle missing files as well as the already handled files-contain or files-don't-contain statuses and i've seen no further TerminatingError exceptions since patching (https://github.com/mozilla-releng/OpenCloudConfig/commit/d32d034). i will continue to monitor over the weekend. let me know if you start to see the hardware instances fall into line as well. they should automatically update the offending module (OCC-Validate.psm1) when xDynamicConfig.ps1 runs between tasks. hopefully it was the same issue affecting win 10 hardware as we saw in gecko-t-win10-64-gpu-b.
Depends on: T-W1064-MS-116
Found a few workers that aren't in TC, re-imaging them changes nothing in their state and the logs remain the same. T-W1064-MS-159 < rebooted, reimaged, GW is not running T-W1064-MS-211 < reimaged, GW is not running T-W1064-MS-214 < reimaged, GW is not running T-W1064-MS-478 < reimaged, GW is not running T-W1064-MS-543 < reimaged, GW is not running T-W1064-MS-581 < reimaged, GW is not running T-W1064-MS-589 < reimaged, GW is not running All of the above share a similarity in the logs. None have GW running. Spotted these log entries in a few of them: https://papertrailapp.com/systems/1730894031/events?focus=984350109993177146&selected=984350109993177146
The following workers do not take up tasks or are missing from TC. I've rebooted and re-imaged each of them, but nothing changes. T-W1064-MS-{016, 041, 066, 069, 076, 077, 111, 120, 129, 152, 159, 164, 172, 176, 177, 199, 202, 211, 222, 263, 319, 410, 422, 478, 543, 562, 589, 596} These logs can be seen on each machines. generic worker: Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015 Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: waiting for generic-worker process to start.#015 Oct 05 20:22:15 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-DSC: The local configuration manager was shut down.#015 mlx: Oct 05 20:40:15 T-W1064-MS-129.mdc1.mozilla.com mlx4eth63: Mellanox ConnectX-3 Pro Ethernet Adapter #2 device detected that the link connected to port 2 is down. This can occur if the physical link is disconnected or damaged, or if the other end-port is down.#015
> generic worker: > Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: > generic-worker installation detected.#015 > Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: waiting for > generic-worker process to start.#015 > Oct 05 20:22:15 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-DSC: The > local configuration manager was shut down.#015 This looks to be the main cause for machines that ciduty finds in a broken state > mlx: > Oct 05 20:40:15 T-W1064-MS-129.mdc1.mozilla.com mlx4eth63: Mellanox > ConnectX-3 Pro Ethernet Adapter #2 device detected that the link connected > to port 2 is down. > This can occur if the physical link is disconnected or damaged, or if the > other end-port is down.#015 As discussed with markco, this is expected as moonshots have 2 ports and the second one isn't used.
Attached file GitHub Pull Request
If generic-worker.exe is not running within 30 minutes of startup reboot the machine. This doesn't fix the problem but provides a quicker recovery time once this problem is hit.
Attachment #9016344 - Flags: review?(rthijssen)
Comment on attachment 9016344 [details] [review] GitHub Pull Request lgtm
Attachment #9016344 - Flags: review?(rthijssen) → review+
Attached image ms-017.jpg
I was tailing the logs for ms-017. I noticed that it appeared to stop running OCC and generic-worker wrapper script. This appeared in the logs around the same time: Oct 11 10:57:03 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: The Function Discovery Resource Publication service hung on starting.#015 Oct 11 10:57:03 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: The HomeGroup Provider service depends on the Function Discovery Resource Publication service which failed to start because of the following error: After starting, the service hung in a start-pending state.#015 Oct 11 10:57:11 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: 2018-10-11 17:57:10 T-W1064-MS-017.mdc1.mozilla.com ERROR 7023 [The description for EventID 7023 from source Service Control Manager cannot be found: The description string for parameter https://papertrailapp.com/groups/1141234/events?focus=987101768829800481&selected=987101768829800481
During my shift, I have found some machines that didn't took jobs for many hours from 7 h until a day. I have rebooted the following machines : 024, 027, 042, 074, 085, 086, 121, 127, 179, 217, 220, 221, 223, 241, 242, 245, 251, 262, 281, 417, 455, 499, 517, 524, 559, 569, 597 After a while I've checked them and still most of them didn't got jobs, so, I've reimaged them.
Post upgrade to generic-worker 10.11.2 this behavior has seen to gone away. However there is still a similar behavior with rundsc.ps1 not completing (Bug 1499801).
See Also: → 1499801
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: