Closed Bug 1779815 Opened 2 years ago Closed 2 years ago

Some Azure GPU workers are not shutting down on idle

Categories

(Taskcluster :: Workers, defect)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: markco, Unassigned)

Details

Attachments

(1 file)

This is specific to GPU workers. There is a behavior where generic-worker will have an exit status of 68, the shutdown command is issued, then approx 10 minutes later the VM will reboot.

As an example vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g :

 Jul 15 08:54:33 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com worker-runner-service 2022/07/15 15:54:32 Immediate shutdown being issued... 
Jul 15 08:54:33 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com worker-runner-service 2022/07/15 15:54:32 Running command: 'C:\Windows\System32\shutdown.exe' /s /t 3 /c 'generic-worker idle timeout' 
Jul 15 08:54:33 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com worker-runner-service 2022/07/15 15:54:32 exit status 68 
Jul 15 08:54:33 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com User32 The process C:\Windows\System32\shutdown.exe (VM-MVRRFWKBQ8EG) has initiated the shutdown of computer VM-MVRRFWKBQ8EG on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found   Reason Code: 0x800000ff   Shutdown Type: shutdown   Comment: generic-worker idle timeout 
Jul 15 08:54:36 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Local multi-user session manager received system shutdown message 
Jul 15 09:05:44 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Remote Desktop Services: Session logoff succeeded:    User: VM-MVRRFWKBQ8EG\task_165789619135048  Session ID: 1 
Jul 15 09:05:44 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Plugin RDSAppXPlugin has been successfully initialized 
Jul 15 09:05:44 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-WinRM The WinRM service is not listening for WS-Management requests.      User Action    If you did not intentionally stop the service, use the following command to see the WinRM configuration:      winrm enumerate winrm/config/listener 
Jul 15 09:05:45 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Begin session arbitration:    User: VM-MVRRFWKBQ8EG\task_165789866807760  Session ID: 1 
Jul 15 09:05:45 vm-mvrrfwkbq8egqlh08mxufqeqjwkyyyqcs1g.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager End session arbitration:    User: VM-MVRRFWKBQ8EG\task_165789866807760  Session ID: 1 

This behavior has been observed on 2 pools. Win10-64-2004-gpu which is using generic-worker version 43.0.0, and Win10-64-2004-gpu-beta which is using generic-worker version 44.17.0.

The generic-worker configuration are practically the same for both pools in ci-configuration:

      "workerConfig": {
        "genericWorker": {
          "config": {
            "tasksDir": "Z:\\",
            "cachesDir": "Z:\\caches",
            "workerType": "win10-64-2004-gpu-beta",
            "wstAudience": "firefoxcitc",
            "downloadsDir": "Z:\\downloads",
            "wstServerURL": "https://firefoxci-websocktunnel.services.mozilla.com/",
            "provisionerId": "gecko-t",
            "sentryProject": "generic-worker",
            "disableReboots": false,
            "cleanUpTaskDirs": true,
            "idleTimeoutSecs": 1800,
            "livelogExecutable": "C:\\generic-worker\\livelog.exe",
            "numberOfTasksToRun": 0,
            "runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd",
            "taskclusterProxyPort": 80,
            "runTasksAsCurrentUser": false,
            "shutdownMachineOnIdle": true,
            "ed25519SigningKeyLocation": "C:\\generic-worker\\ed25519-private.key",
            "taskclusterProxyExecutable": "C:\\generic-worker\\taskcluster-proxy.exe",
            "shutdownMachineOnInternalError": true
          }

I wonder if this is related to the VM size Standard_NV6 not fully shutting down after the command is issued.

Flags: needinfo?(pmoore)

It appears that this issue started around 2022-06-28.

The attached log is generated by an Azure run book that does an audit of running VMs and terminates any that have been up for 24 hours at the time of the audit. For comparison the prior days the audit was shutting down between 0 and 10 VMs on most days.

That is strange. The shutdown command, and the options passed to it, in the above log look correct:

Running command: 'C:\Windows\System32\shutdown.exe' /s /t 3 /c 'generic-worker idle timeout'

See microsoft docs on shutdown command. Specifically, /s signifies shutdown, not reboot. The timeout is 3 seconds, to give generic-worker a chance to exit reasonably cleanly (although since it is a throwaway VM, it doesn't matter too much).

Note, the code in generic-worker is here.

I think this is a good question for Azure support. We are issuing a shutdown command, but the instance is rebooting. It seems like that shouldn't be the case, or maybe there is something wrong with the instance creation settings, so that it is configured to power on if it shuts down?

Flags: needinfo?(pmoore)

Opened up an Azure support case 2207250040008499.

More or less the Standard NV6 VMs are restarting after the OS issues a shutdown command. I was able to recreate it on a VM that spun up outside of worker-manager and generic-worker. Closing this bug. Will track the rest in Jira.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: