Closed Bug 1384752 Opened 7 years ago Closed 7 years ago

Windows loaners terminate unexpectedly after ~5 minutes

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jryans, Assigned: grenade)

Details

Attachments

(2 files)

I am trying to use the process described at https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance to get a loaner instance of type gecko-t-win10-64-gpu. I am able to get one and login, but then it dies while I am still connected after about ~5 minutes. Here's an example task: https://tools.taskcluster.net/groups/fyJYxmiQQ2awc-_AjW46XQ/tasks/fyJYxmiQQ2awc-_AjW46XQ/details "instance-id": "i-0a8aeb5807a99bc8d"
:jonasfj said to ask :pmoore and :grenade.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
the old mechanism of waiting for live logs to be deleted to determine that the task has completed, no longer works, so this change queries the taskcluster api to check for task completion. i've modified the wiki instructions to include the task id in the loan request and the prep loaner script to query the taskcluster api.
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Flags: needinfo?(rthijssen)
Attachment #8890745 - Flags: review?(pmoore)
J. Ryan, the patch above (when landed) should resolve the terminating instance problem, however you should also know that gecko-t-win10-64-gpu is a problematic worker type right now. you'll be able to connect, etc., but we have issues with accessing the GPU correctly from firefox tests atm. that issue is tracked in bug 1382625
Flags: needinfo?(pmoore)
Attachment #8890745 - Flags: review?(pmoore) → review+
Looks like your patch was merged, but I got the same result after trying again (task: HE_oVFF3TgSyRlHgE5dklw, instance: i-0311aceaff012b6f4). Are more steps needed to deploy the change? Or further changes?
Flags: needinfo?(rthijssen)
jryans, thanks for the retest and update. especially the task and instance ids which i was able to use to find the cause of the failure (failure to kill the running gw process). i will try a new patch now...
Flags: needinfo?(rthijssen)
$gwService | Stop-Service -PassThru | Set-Service -StartupType disabled this command was causing the script to hang or fail due to warnings about the service stop needing some time to complete. removed in favour of a more robust check.
Attachment #8892000 - Flags: review?(pmoore)
Attachment #8892000 - Attachment is patch: true
Attachment #8892000 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8892000 - Flags: review?(pmoore) → review+
Still seems to die after a few minutes, I tried again with task: CM38VSfnQMWd7SJVqUEx9g, instance: i-0e55aa3b964eddeec.
Flags: needinfo?(rthijssen)
pmoore: the generic worker service is proving difficult to terminate in g-w 10+ last entries from the prep loaner script in the log are these (https://papertrailapp.com/groups/3695693/events?q=i-0e55aa3b964eddeec): Jul 31 21:56:19 i-0e55aa3b964eddeec.gecko-t-win10-64-gpu.euc1.mozilla.com PrepLoaner: loan request task completion detected Jul 31 21:56:20 i-0e55aa3b964eddeec.gecko-t-win10-64-gpu.euc1.mozilla.com PrepLoaner: Remove-GenericWorker :: begin Jul 31 21:56:20 i-0e55aa3b964eddeec.gecko-t-win10-64-gpu.euc1.mozilla.com PrepLoaner: Remove-GenericWorker :: attempting to stop running generic-worker service. Jul 31 21:56:21 i-0e55aa3b964eddeec.gecko-t-win10-64-gpu.euc1.mozilla.com nssm: Failed to open process handle for process with PID 4160 when terminating service Generic Worker: The parameter is incorrect. the line following the log entry about "attempting to stop running generic-worker service" is (https://github.com/mozilla-releng/OpenCloudConfig/blob/cbcf0124ad793ec1201c597cdbeb94c329e259c2/userdata/PrepLoaner.ps1#L139): $gwService | Stop-Service -Force -WarningAction SilentlyContinue which seems to trigger the error message from nssm. can you suggest a way to reliably stop the generic worker service? normal methods are not working.
Flags: needinfo?(rthijssen) → needinfo?(pmoore)
also testing to see if ignoring the service stop error gets us further: https://github.com/mozilla-releng/OpenCloudConfig/commit/51840448adc6b7bc0fe4803b73e9b7930096e913
FWIW, I got two win2012 loaners killed after what appears to be exactly 10 minutes. The last attempt was Zrm3NuP6Qv6llTxx1wbzMw.
Ideally the run-generic-worker.bat script would run the task that creates the loaner, and exit naturally, rather than be killed by something external. Can that logic just be put in run-generic-worker.bat, and the worker set to only run one task at a time?
Flags: needinfo?(pmoore)
it's the nssm service that won't die. the generic-worker.exe process is successfully terminated. since the bat file only controls the gw exe process, altering it won't affect the nssm service. the error seems to be with an invalid process handle/pid being set in the service and the service not having a crash mechanism to handle it. it simply hangs in the running state instead of terminating. most windows services can be stopped with the `net stop` command but this one is refusing to die. i think it's a bug.
Thanks for the analysis Rob. FWIW these are the nssm commands that the generic worker installation process calls: https://github.com/taskcluster/generic-worker/blob/9bd8dd4f2422b706dbf2a3636ab524c6995b4606/plat_windows.go#L432-L461 The NSSM commands docs are here: https://nssm.cc/commands I'll set aside some time later in the week to see if I can get to the bottom of this.
I suspect that what is happening is the following: * The loaner task runs, generates the Z:\\loan-request.json file * The generic-worker process exits cleanly (without rebooting, due to the presence of the loan-request.json file) * At some point later, HaltOnIdle script spots the z:\\loan-request.json file, and creates the new user, and the machine gets rebooted * Since generic-worker is installed as a service and enabled on boot, it starts up after the reboot * The generic-worker sees from the registry that a task user has been created, and waits 5 minutes for that user to login * The generic-worker is not able to get an access token to the task user's login, and shuts down the machine, giving up If this is the workflow, I think the only change needed is *not* to kill the generic-worker in the HaltOnIdle script, but instead to just disable the windows service if it exists, or even remove it (I say "if it exists" because in older versions of the worker still on other platforms, it isn't installed as a windows service, but as a scheduled task). That could either be done with: sc delete "Generic Worker" or if nssm.exe is in the PATH: nssm remove "Generic Worker" confirm
(In reply to Pete Moore [:pmoore][:pete] from comment #15) > * At some point later, HaltOnIdle script spots the z:\\loan-request.json > file, and creates the new user, and the machine gets rebooted It looks like the reboot occurs here: https://github.com/mozilla-releng/OpenCloudConfig/blob/fa6b5514209f61f2cd36dc28be6da1801942bf54/userdata/PrepLoaner.ps1#L388
Flags: needinfo?(rthijssen)
Oh, having looked at the code and the bug comments more closely, I see you are already deleting the service. Sorry! Maybe nssm stop "Generic Worker" nssm remove "Generic Worker" confirm might work in place of the existing commands?
So I see that the loaner task waits for HaltOnIdle to create z:\loan\credentials.txt.gpg before exiting. So I think the flow is: 1) The loaner task runs, generates the Z:\\loan-request.json file 2) The HaltOnIdle script spots the z:\\loan-request.json file, and runs PrepLoaner 3) PrepLoaner creates the new Administrator password, performs cleanup 4) PrepLoaner tries but fails to kill generic-worker.exe process 5) PrepLoaner tries but fails to delete the "Generic Worker" service 6) loaner task completes successfully and generic-worker uploads the loaner task artifacts (encrypted credentials etc) 7) PrepLoaner reboots the machine 8) The generic-worker sees from the registry that a task user has been created, and waits 5 minutes for that user to login 9) The generic-worker is not able to get an access token to the task user's login, and shuts down the machine, giving up I think we don't want to run step 4, because step 6 requires that generic-worker is running. Or maybe step 4 and 6 are racing, and step 6 always wins. I think it might work if: * PrepLoaner creates Z:\loan * PrepLoaner waits for generic-worker.exe process to naturally complete * PrepLoaner removes the 'Generic Worker' service I can imagine there could be an NSSM bug, whereby the service still shows as running, even when run-generic-worker.bat script has completed, and then attempts to stop or remove the service fail, because it can't stop it. If that is the case, it might be worth putting run-generic-worker.bat in an infinite goto loop after it exits and sees there is a loan, so that the process is still running, and the service stop/remove works. That could be an nssm bug, which this workaround might address.
... or if the windows service can be disabled on boot, rather than deleted, that might work around the problem
from the log output i had seen, generic-worker.exe is successfully killed. it's only the nssm service stop that fails. there are already a number of workarounds in place (including the win10 only reboot). what is really needed is for the nssm service to stop, when a stop is requested. we've also tried disabling it on boot. it's basically not possible to stop or disable it. it just refuses to stop and restarts on reboot because the command to disable it (before the reboot) fails. there's not really a lot we can do if this service cant be killed. i've already tried a lot of "workarounds". it hasn't worked. we need the nssm service to stop when requested.
Flags: needinfo?(rthijssen)
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #20) > from the log output i had seen, generic-worker.exe is successfully killed. > it's only the nssm service stop that fails. We should not kill the generic-worker, but let it complete successfully (which happens automatically after the loaner task completes). > there are already a number of workarounds in place (including the win10 only > reboot). what is really needed is for the nssm service to stop, when a stop > is requested. we've also tried disabling it on boot. it's basically not > possible to stop or disable it. it just refuses to stop and restarts on > reboot because the command to disable it (before the reboot) fails. Have you tried the stop command from comment 17? > there's not really a lot we can do if this service cant be killed. i've > already tried a lot of "workarounds". it hasn't worked. we need the nssm > service to stop when requested. NSSM is a utility we're using rather than something we wrote ourselves, so it is more difficult to fix its behaviour. I will have a stab later today on a live machine to see if I can reproduce the problem, and find a way to stop the service running after the reboot.
(In reply to Pete Moore [:pmoore][:pete] from comment #21) > We should not kill the generic-worker, but let it complete successfully > (which happens automatically after the loaner task completes). we wait for the task to complete: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/PrepLoaner.ps1#L379 the kill (and the attempt to stop the service) is to prevent gw from taking further tasks. > Have you tried the stop command from comment 17? no, i will.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: