Closed
Bug 1493759
Opened 6 years ago
Closed 6 years ago
Windows 10 hardware nodes generic-worker wrapper script seems to terminate
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)
Infrastructure & Operations
RelOps: OpenCloudConfig
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
Attachments
(2 files)
It looks like the script starts and then stops around 75 seconds later (based on the timeout here https://github.com/mozilla-releng/OpenCloudConfig/blob/6c3afdc8298ec84f874504ba6fc815639c8f2306/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L3 )
From ms-153:
https://papertrailapp.com/groups/1141234/events?focus=980871647563243575&selected=980871647563243575
Sep 24 04:19:30 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:20:47 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015
Sep 24 06:20:53 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:20:58 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:03 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:14 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:19 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:24 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:29 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:39 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:44 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:49 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:54 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:21:59 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 06:22:04 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:20 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015
Sep 24 08:23:26 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:31 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:36 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:41 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:47 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:52 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:23:57 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:02 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:07 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:12 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:17 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:22 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:27 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:32 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 08:24:37 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 1 file(s) copied.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:13 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:19 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:24 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:29 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:39 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:44 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:49 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:54 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:19:59 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:20:04 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:20:09 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
Sep 24 10:20:14 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
OCC runs and ends but appears to be missing portions:
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: begin#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: CLRVersion: 4.0.30319.42000, PSVersion: 5.1.15063.0#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: SecurityProtocol: Tls, Tls11, Tls12#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected TLS v1.2 in security protocol support list#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected registry setting to use strong cryptography on 64 bit .Net Framework#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: Detected registry setting to use strong cryptography on 32 bit .Net Framework#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: SecurityProtocol: Tls, Tls11, Tls12#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: end#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Enabling Windows update service#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: hw-DiskManage :: path: C:\windows\SoftwareDistribution\Download\*, deleted.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: hw-DiskManage :: path: C:\ProgramData\Package Cache, deleted.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Current free space of drive C 34523 MB#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Current free space percentage of drive C 55%#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: userdata run starting.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: system timezone set to UTC.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: system clock synchronised.#015
Sep 24 10:19:09 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: begin#015
Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: scheduled task: PrepLoaner deleted.#015
Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: C:\dsc\PrepLoaner.ps1 deleted.#015
Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: C:\dsc\PrepLoaner.ps1 downloaded from https://raw.githubusercontent.com/mozilla-releng/OpenCloudConfig/master/userdata/PrepLoaner.ps1?1645ab71-6071-46cf-87bd-ddca6c8b894e.#015
Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: scheduled task: PrepLoaner created.#015
Sep 24 10:19:11 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Create-ScheduledPowershellTask :: end#015
However the flags files that wrapper script are being created (time in papertrail are PDT and local machine is UTC)
C:\dsc>dir
Volume in drive C is Windows
Volume Serial Number is 6465-AA44
Directory of C:\dsc
09/24/2018 05:21 PM <DIR> .
09/24/2018 05:21 PM <DIR> ..
09/14/2018 02:10 AM 29,184 configmymonitor.exe
09/14/2018 12:52 AM <DIR> DeviceManagment
09/24/2018 05:21 PM 1 EndOfManifest.semaphore
03/27/2018 02:39 AM 2,649 graphic_reg.ps1
06/11/2018 06:36 PM 0 GW10.semaphore
07/30/2018 08:54 PM 448 hw-startup-check.ps1
09/24/2018 05:19 PM 0 in-progress.lock
08/21/2018 05:27 PM 2,122 location.ps1
09/14/2018 02:03 AM <DIR> MozillaMaintenance
08/23/2018 06:02 PM 2,020 OCC_1st_runw10.bat
08/16/2018 07:59 PM 2,209 onedrive.ps1
09/24/2018 05:19 PM 20,621 PrepLoaner.ps1
09/24/2018 06:37 AM 73,456 rundsc.ps1
09/14/2018 02:00 AM <DIR> semaphore
03/20/2018 05:29 AM 67 ultravnc.ini
10/24/2017 05:01 AM 173 vncserver.inf
13 File(s) 132,950 bytes
5 Dir(s) 34,493,997,056 bytes free
Assignee | ||
Comment 1•6 years ago
|
||
> However the flags files that wrapper script are being created
However the flags files that the wrapper script is waiting on are being created.
Assignee | ||
Comment 2•6 years ago
|
||
This node was working as expected earlier:
Sep 24 00:11:48 T-W1064-MS-153.mdc1.mozilla.com generic-worker: 2018/09/24 07:11:47 Exiting worker with exit code 67#015
Sep 24 00:11:57 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "Generic worker has exited. Waiting on reboot. 0 minutes since command." #015
Sep 24 00:13:13 T-W1064-MS-153.mdc1.mozilla.com generic-worker: ^CTerminate batch job (Y/N)? 1 file(s) copied.#015
Sep 24 00:13:15 T-W1064-MS-153.mdc1.mozilla.com generic-worker: Checking for manifest completetion #015
And OCC was completing successfully before this time as well:
https://papertrailapp.com/groups/1141234/events?focus=980770682264784898&q=ms-153+AND+OpenCloudConfig&selected=980770682264784898
Sep 23 23:37:28 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json"#015
Sep 23 23:39:33 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: log archive C:\log\20180924063430.userdata-run.zip created.#015
Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015
Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: waiting for generic-worker process to start.#015
Sep 23 23:39:34 T-W1064-MS-153.mdc1.mozilla.com generic-worker: "manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json"#015
Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: generic-worker running process detected 756 ms after task-claim-state.valid flag set.#015
Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: process priority for generic worker altered from Normal to AboveNormal.#015
Sep 23 23:39:35 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: userdata run completed#015
Sep 24 at 12:11 AM
Sep 24 00:13:13 T-W1064-MS-153.mdc1.mozilla.com OpenCloudConfig: Set-DefaultStrongCryptography :: begin#015
Assignee | ||
Comment 3•6 years ago
|
||
In process of troubleshooting I deleted:
the lock file
flag files
the gpg key
all old user directories
There was also multiple reboots. The wrapper script then completed and run the generic-worker exe command. I will need to find another node exhibiting this behavior to try to square down what is causing the issue.
Assignee | ||
Comment 4•6 years ago
|
||
The task user init script is not completing either, https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/task-user-init-win10.cmd. The task bar is not hidden.
Comment 5•6 years ago
|
||
i've found a candidate for where this problem could be coming from.
gecko-t-win10-64-gpu-b instances running a new ami (still based on 1703) started to exhibit a similar behaviour today. i noticed some messages in the pt logs appearing half way through the dsc run which say:
>> TerminatingError(): "The pipeline has been stopped."
a little googling suggested that a microsoft update to a powershell component changed the way that output is handled in a powershell transcript.
we have a validation called FilesContain which is used as a validator for some components like the MozillaBuild install which requires that a file at C:\mozilla-build\VERSION contains the correct version number of MozillaBuild.
the validation has been working fine for a couple years but today on gecko-t-win10-64-gpu-b when the version file does not exist, the validation triggers an output that isn't handled well and causes the transcript code to throw the TerminatingError mentioned above.
i've modified the code to explicitly handle missing files as well as the already handled files-contain or files-don't-contain statuses and i've seen no further TerminatingError exceptions since patching (https://github.com/mozilla-releng/OpenCloudConfig/commit/d32d034).
i will continue to monitor over the weekend. let me know if you start to see the hardware instances fall into line as well. they should automatically update the offending module (OCC-Validate.psm1) when xDynamicConfig.ps1 runs between tasks.
hopefully it was the same issue affecting win 10 hardware as we saw in gecko-t-win10-64-gpu-b.
Assignee | ||
Updated•6 years ago
|
Depends on: T-W1064-MS-116
Comment 6•6 years ago
|
||
Found a few workers that aren't in TC, re-imaging them changes nothing in their state and the logs remain the same.
T-W1064-MS-159 < rebooted, reimaged, GW is not running
T-W1064-MS-211 < reimaged, GW is not running
T-W1064-MS-214 < reimaged, GW is not running
T-W1064-MS-478 < reimaged, GW is not running
T-W1064-MS-543 < reimaged, GW is not running
T-W1064-MS-581 < reimaged, GW is not running
T-W1064-MS-589 < reimaged, GW is not running
All of the above share a similarity in the logs. None have GW running. Spotted these log entries in a few of them:
https://papertrailapp.com/systems/1730894031/events?focus=984350109993177146&selected=984350109993177146
Comment 7•6 years ago
|
||
The following workers do not take up tasks or are missing from TC. I've rebooted and re-imaged each of them, but nothing changes.
T-W1064-MS-{016, 041, 066, 069, 076, 077, 111, 120, 129, 152, 159, 164, 172, 176, 177, 199, 202, 211, 222, 263, 319, 410, 422, 478, 543, 562, 589, 596}
These logs can be seen on each machines.
generic worker:
Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: generic-worker installation detected.#015
Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: waiting for generic-worker process to start.#015
Oct 05 20:22:15 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-DSC: The local configuration manager was shut down.#015
mlx:
Oct 05 20:40:15 T-W1064-MS-129.mdc1.mozilla.com mlx4eth63: Mellanox ConnectX-3 Pro Ethernet Adapter #2 device detected that the link connected to port 2 is down.
This can occur if the physical link is disconnected or damaged, or if the other end-port is down.#015
Comment 8•6 years ago
|
||
> generic worker:
> Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig:
> generic-worker installation detected.#015
> Oct 05 20:21:46 T-W1064-MS-016.mdc1.mozilla.com OpenCloudConfig: waiting for
> generic-worker process to start.#015
> Oct 05 20:22:15 T-W1064-MS-016.mdc1.mozilla.com Microsoft-Windows-DSC: The
> local configuration manager was shut down.#015
This looks to be the main cause for machines that ciduty finds in a broken state
> mlx:
> Oct 05 20:40:15 T-W1064-MS-129.mdc1.mozilla.com mlx4eth63: Mellanox
> ConnectX-3 Pro Ethernet Adapter #2 device detected that the link connected
> to port 2 is down.
> This can occur if the physical link is disconnected or damaged, or if the
> other end-port is down.#015
As discussed with markco, this is expected as moonshots have 2 ports and the second one isn't used.
Assignee | ||
Comment 9•6 years ago
|
||
If generic-worker.exe is not running within 30 minutes of startup reboot the machine. This doesn't fix the problem but provides a quicker recovery time once this problem is hit.
Assignee | ||
Updated•6 years ago
|
Attachment #9016344 -
Flags: review?(rthijssen)
Comment 10•6 years ago
|
||
Comment on attachment 9016344 [details] [review]
GitHub Pull Request
lgtm
Attachment #9016344 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 11•6 years ago
|
||
I was tailing the logs for ms-017. I noticed that it appeared to stop running OCC and generic-worker wrapper script. This appeared in the logs around the same time:
Oct 11 10:57:03 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: The Function Discovery Resource Publication service hung on starting.#015
Oct 11 10:57:03 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: The HomeGroup Provider service depends on the Function Discovery Resource Publication service which failed to start because of the following error: After starting, the service hung in a start-pending state.#015
Oct 11 10:57:11 T-W1064-MS-017.mdc1.mozilla.com Service_Control_Manager: 2018-10-11 17:57:10 T-W1064-MS-017.mdc1.mozilla.com ERROR 7023 [The description for EventID 7023 from source Service Control Manager cannot be found: The description string for parameter
https://papertrailapp.com/groups/1141234/events?focus=987101768829800481&selected=987101768829800481
Comment 12•6 years ago
|
||
During my shift, I have found some machines that didn't took jobs for many hours from 7 h until a day. I have rebooted the following machines :
024, 027, 042, 074, 085, 086, 121, 127, 179, 217, 220, 221, 223, 241, 242, 245, 251, 262, 281, 417, 455, 499, 517, 524, 559, 569, 597
After a while I've checked them and still most of them didn't got jobs, so, I've reimaged them.
Assignee | ||
Comment 13•6 years ago
|
||
Post upgrade to generic-worker 10.11.2 this behavior has seen to gone away. However there is still a similar behavior with rundsc.ps1 not completing (Bug 1499801).
See Also: → 1499801
Assignee | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•