Closed Bug 1496526 Opened 6 years ago Closed 6 years ago

Generic-worker service is not being installed on newly deploy Windows moonshot nodes

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

Attachments

(1 file, 1 obsolete file)

https://bugzilla.mozilla.org/show_bug.cgi?id=1493759#c6 (In reply to Zsolt Fay [:zfay] from comment #6) > Found a few workers that aren't in TC, re-imaging them changes nothing in > their state and the logs remain the same. > T-W1064-MS-159 < rebooted, reimaged, GW is not running > T-W1064-MS-211 < reimaged, GW is not running > T-W1064-MS-214 < reimaged, GW is not running > T-W1064-MS-478 < reimaged, GW is not running > T-W1064-MS-543 < reimaged, GW is not running > T-W1064-MS-581 < reimaged, GW is not running > T-W1064-MS-589 < reimaged, GW is not running > > All of the above share a similarity in the logs. None have GW running. > Spotted these log entries in a few of them: > https://papertrailapp.com/systems/1730894031/ > events?focus=984350109993177146&selected=984350109993177146 I did a fresh install on ms-016. Starting here is paper trail: https://papertrailapp.com/groups/1141234/events?focus=984562529676201988&q=ms-016&selected=984562529676201988 The node deployed through and ran OCC, but will continue to reboot with message: Oct 04 11:12:24 T-W1064-MS-016.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-016) has initiated the restart of computer T-W1064-MS-016 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Unresponsive Reason Code: 0x40005 Shutdown Type: restart Comment: reboot to rouse the generic worker#015 The generic-worker service is not present: C:\Users\Administrator>sc queryex type= service state= all | find /i "generic" C:\Users\Administrator> And nssm had not been extracted to C:\ C:\>dir Volume in drive C is Windows Volume Serial Number is 50AC-80E8 Directory of C:\ 10/04/2018 04:37 PM <DIR> builds 10/04/2018 06:14 PM <DIR> dsc 10/04/2018 05:47 PM <DIR> generic-worker 10/04/2018 05:47 PM <DIR> hg-shared 10/04/2018 05:47 PM <SYMLINKD> home [C:\Users] 10/04/2018 04:38 PM <DIR> Intel 10/04/2018 06:29 PM <DIR> log 10/04/2018 05:47 PM <DIR> mozilla-build 03/18/2017 09:03 PM <DIR> PerfLogs 10/04/2018 05:47 PM <DIR> pip-cache 10/04/2018 05:47 PM <DIR> ProcessExplorer 10/04/2018 05:47 PM <DIR> ProcessMonitor 10/04/2018 05:51 PM <DIR> Program Files 10/04/2018 05:49 PM <DIR> Program Files (x86) 10/04/2018 05:47 PM <DIR> tooltool-cache 10/04/2018 04:34 PM <DIR> Users 10/04/2018 05:46 PM <DIR> Windows 0 File(s) 0 bytes 17 Dir(s) 44,975,190,016 bytes free In the logs it shows: Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-016]: [[Script]CommandRun_NSSMInstall] Performing the operation "Set-TargetResource" on target "Executing the SetScript with the user supplied credential".#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-016]: LCM: [ End Set ] [[Script]CommandRun_NSSMInstall] in 0.0100 seconds.#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: #015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: This command cannot be run due to the error: The system cannot find the file specified. #015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + CategoryInfo : InvalidOperation: (:) [], CimException#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + FullyQualifiedErrorId : ProviderOperationExecutionFailure#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + PSComputerName : localhost#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: PowerShell DSC resource MSFT_ScriptResource failed to execute#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: Set-TargetResource functionality with error message: This command cannot be#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: run due to the error: The system cannot find the file specified.#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + CategoryInfo : InvalidOperation: (:) [], CimException#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + FullyQualifiedErrorId : ProviderOperationExecutionFailure#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: + PSComputerName : localhost#015 Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: #015 But the file is downloaded: Oct 04 11:14:10 T-W1064-MS-016.mdc1.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-016]: [[Log]Log_FileDownload_NSSMDownload] FileDownload: NSSMDownload, completed
Assignee: nobody → mcornmesser
There are multiple errors based on not able to find a file or path: ct 04 21:40:17 T-W1064-MS-211.mdc1.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-211]: LCM: [ End Set ] [[Script]CommandRun_OpenSshUnzip] in 0.0100 seconds.#015 Oct 04 21:40:17 T-W1064-MS-211.mdc1.mozilla.com dsc-run: PowerShell DSC resource MSFT_ScriptResource failed to execute Set-TargetResource functionality with error message: #015 Oct 04 21:40:17 T-W1064-MS-211.mdc1.mozilla.com dsc-run: This command cannot be run due to the error: The system cannot find the file specified. #015 Oct 04 21:40:17 T-W1064-MS-211.mdc1.mozilla.com dsc-run: + CategoryInfo : InvalidOperation: (:) [], CimException#015 I also found an error specif to OCC-Validate: Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: WARNING: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The names of some imported commands from the module 'OCC-Validate' include unapproved verbs that might make them less discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose parameter. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Log-Validation' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] Importing function 'Log-Validation'.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Validate-All' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] Importing function 'Validate-All'.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Validate-CommandsReturnOrNotRequested' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] Importing function 'Validate-CommandsReturnOrNotRequested'.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Validate-FilesContainOrNotRequested' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] Importing function 'Validate-FilesContainOrNotRequested'.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Validate-PathsExistOrNotRequested' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] Importing function 'Validate-PathsExistOrNotRequested'.#015 Oct 04 21:40:18 T-W1064-MS-543.mdc2.mozilla.com dsc-run: VERBOSE: [T-W1064-MS-543]: [[Script]InstallSupportingModules] The 'Validate-PathsNotExistOrNotRequested' command in the OCC-Validate' module was imported, but because its name does not include an approved verb, it might be difficult to find. For a list of approved verbs, type Get-Verb.#015 I am going to test with an older version of the file.
Attached patch GitHub Pull Request (obsolete) — Splinter Review
This pull request seems to be the cause https://github.com/mozilla-releng/OpenCloudConfig/commit/a64ffcf36957a16d5e880966deebe58023ec7bdf . Without this code incorporated DSC was able to install the needed packages. I think we should roll this back until Rob returns from PTO.
Attachment #9014932 - Flags: review?(pmoore)
Attachment #9014932 - Attachment is patch: true
Attachment #9014932 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #9014932 - Flags: review?(pmoore) → review-
Attachment #9014932 - Attachment is obsolete: true
Attachment #9015281 - Flags: review?(pmoore)
Attachment #9015281 - Attachment is patch: true
Attachment #9015281 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #9015281 - Flags: review?(pmoore) → review+
See Also: → 1495035
Apologies Mark. that patch (https://github.com/mozilla-releng/OpenCloudConfig/commit/a64ffcf36957a16d5e880966deebe58023ec7bdf) was indeed faulty. i got back from pto today and noticed the problem (in papertrail), but didn't see this bug. i spent the day fixing it with a number of patches (debugging and testing) and actually had the errors fixed and committed with this push: https://github.com/mozilla-releng/OpenCloudConfig/commit/4c6ea88bbf09ec9210e3e6b4f4ff598a6b874e49 Unfortunately, because I wasn't aware of this bug and hadn't posted here, revert merges were made: - https://github.com/mozilla-releng/OpenCloudConfig/commit/96991ff8c218cb34e03de05c7813af977307e9e6 - https://github.com/mozilla-releng/OpenCloudConfig/commit/3464c36537c1a6449606db9ef824fdbcdaac3205 These didn't take into account that I had already completed and merged working patches so the reverts actually broke things again. The breakages are most easily seen with these searches: - hardware: https://papertrailapp.com/groups/1958653/events?q=program%3Adsc-run%20%22SendConfigurationApply%20function%20did%20not%20succeed%22 - ec2: https://papertrailapp.com/groups/2488493/events?q=program%3Adsc-run%20%22SendConfigurationApply%20function%20did%20not%20succeed%22 As soon as I reverted the reverts (https://github.com/mozilla-releng/OpenCloudConfig/commit/43e7daf2cd1e33c618cd24f1e0462944e6f3e708), the problem was sorted again. Please let me know if you see any issues. Note that the "approved verb" errors mentioned in comment 1 are safe to ignore. They're just warnings indicating that Powershell prefers the use of approved verbs in function names. eg: Start-Something instead of Begin-Something ("Start" is an approved verb "Begin" isn't). Those messages don't indicate that something is broken, just that the approved verb naming convention has been ignored. The problem in the original patch was to do with my failure to use the new unique filename (introduced by the patch) everywhere that the file is referenced. This was corrected here: https://github.com/mozilla-releng/OpenCloudConfig/commit/4c6ea88bbf09ec9210e3e6b4f4ff598a6b874e49
i've been monitoring the restarting hardware nodes and see that the procmon and procexp errors are no more (these were caused by the earlier defective patch, then fixed as per comment 4). there is another error relating to the install of the Windows SDK. i don't know yet if this is a new error, or something that's been around for a while. this error is easiest to spot with this search: https://papertrailapp.com/groups/1958653/events?q=SendConfigurationApply%20ExeInstall_Windows_SDK one thing that used to cause us problems with SDK installs was that they don't always return an exit code of 0. we'll need to check if this is what's going on by manually running the sdk installer on a hardware instance and then checking the exit code. eg: sdk-instal.exe /q echo %errorlevel% if the exit code is not 0 but the install was a success, we can simply modify the manifest component to allow whatever the exit code was. eg if the exit code we want to allow is "7": { "ComponentName": "Windows_SDK", ... "AllowedExitCodes": [ "7" ], ... } if the exit code is 0, this wasn't the issue. if the manual install fails, we might learn why the dsc install is failing.
> there is another error relating to the install of the Windows SDK. i don't > know yet if this is a new error, or something that's been around for a > while. this error is easiest to spot with this search: > https://papertrailapp.com/groups/1958653/ > events?q=SendConfigurationApply%20ExeInstall_Windows_SDK This is an old error that seems to be non-impacting. I am going to create a bug to keep track of it, but I don't when I will actually get to it.
Re-imaged all 121 workers which were missing from TC: T-W1064-MS-016 T-W1064-MS-214 T-W1064-MS-424 T-W1064-MS-020 T-W1064-MS-219 T-W1064-MS-427 T-W1064-MS-022 T-W1064-MS-222 T-W1064-MS-428 T-W1064-MS-031 T-W1064-MS-243 T-W1064-MS-429 T-W1064-MS-034 T-W1064-MS-248 T-W1064-MS-434 T-W1064-MS-041 T-W1064-MS-252 T-W1064-MS-435 T-W1064-MS-062 T-W1064-MS-255 T-W1064-MS-480 T-W1064-MS-063 T-W1064-MS-260 T-W1064-MS-497 T-W1064-MS-064 T-W1064-MS-262 T-W1064-MS-501 T-W1064-MS-066 T-W1064-MS-263 T-W1064-MS-502 T-W1064-MS-069 T-W1064-MS-266 T-W1064-MS-503 T-W1064-MS-070 T-W1064-MS-282 T-W1064-MS-504 T-W1064-MS-076 T-W1064-MS-283 T-W1064-MS-505 T-W1064-MS-077 T-W1064-MS-285 T-W1064-MS-506 T-W1064-MS-081 T-W1064-MS-289 T-W1064-MS-507 T-W1064-MS-090 T-W1064-MS-292 T-W1064-MS-508 T-W1064-MS-107 T-W1064-MS-293 T-W1064-MS-511 T-W1064-MS-108 T-W1064-MS-319 T-W1064-MS-512 T-W1064-MS-110 T-W1064-MS-326 T-W1064-MS-518 T-W1064-MS-111 T-W1064-MS-328 T-W1064-MS-547 T-W1064-MS-118 T-W1064-MS-329 T-W1064-MS-548 T-W1064-MS-120 T-W1064-MS-331 T-W1064-MS-550 T-W1064-MS-128 T-W1064-MS-332 T-W1064-MS-554 T-W1064-MS-129 T-W1064-MS-333 T-W1064-MS-556 T-W1064-MS-133 T-W1064-MS-334 T-W1064-MS-560 T-W1064-MS-152 T-W1064-MS-337 T-W1064-MS-561 T-W1064-MS-154 T-W1064-MS-338 T-W1064-MS-562 T-W1064-MS-155 T-W1064-MS-339 T-W1064-MS-564 T-W1064-MS-159 T-W1064-MS-342 T-W1064-MS-565 T-W1064-MS-164 T-W1064-MS-365 T-W1064-MS-570 T-W1064-MS-165 T-W1064-MS-367 T-W1064-MS-582 T-W1064-MS-170 T-W1064-MS-374 T-W1064-MS-586 T-W1064-MS-172 T-W1064-MS-382 T-W1064-MS-588 T-W1064-MS-173 T-W1064-MS-384 T-W1064-MS-589 T-W1064-MS-176 T-W1064-MS-406 T-W1064-MS-590 T-W1064-MS-177 T-W1064-MS-409 T-W1064-MS-593 T-W1064-MS-199 T-W1064-MS-410 T-W1064-MS-595 T-W1064-MS-201 T-W1064-MS-413 T-W1064-MS-596 T-W1064-MS-202 T-W1064-MS-418 T-W1064-MS-598 T-W1064-MS-204 T-W1064-MS-422 T-W1064-MS-205 T-W1064-MS-423
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: