please make termination of azure instances optional/configurable
Categories
(Taskcluster :: Services, enhancement)
Tracking
(Not tracked)
People
(Reporter: grenade, Assigned: dustin)
References
Details
windows 7 azure instances are terminated by worker manager when azure returns a provisioning failure caused by the absence of vm agent software on the instance. below is output taken from the azure activity log for the "Create or Update Virtual Machine" operation showing a failure status of OSProvisioningTimedOut.
i am working on mocking the vm agent azure calls but in the meantime, it would be useful if the instances were not terminated immediately after provisioning.
note that the error message suggests that the image may not be generalized, but we do in fact complete the generalize step and mark the instance as generalized. i believe that warning to be a red herring.
also note that before the instance is terminated, it runs successfully and produces normal, healthy log output to papertrail
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"OSProvisioningTimedOut\",\"message\":\"OS Provisioning for VM 'vm-qgmbdjsasgaczurvztnyowoy7ihd3mrbw4b' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \"}]}}"
``
| Reporter | ||
Comment 1•5 years ago
|
||
i think we've cleaned up windows 7 image generalisation to where we have images generalised in the way azure expects, however when these images/instances are instantiated by worker manager, worker-manager still terminates newly started instances while logging the error message:
Unknown provisioningState Creating or powerStates: ProvisioningState/creating
(see https://stage.taskcluster.nonprod.cloudops.mozgcp.net/worker-manager/gecko-t%2Fwin7-32-azure/errors)
is it possible to turn off this instance termination feature in worker-manager or worker-pool configuration?
| Assignee | ||
Comment 2•5 years ago
|
||
That error comes from
if (successProvisioningStates.has(provisioningState) &&
// fairly lame check, succeeds if we've ever been starting/running
_.some(powerStates, v => successPowerStates.has(v))
) {
..
} else if (failProvisioningStates.has(provisioningState) ||
// if the VM has ever been in a failing power state
_.some(powerStates, v => failPowerStates.has(v))
) {
..
} else {
..
await this.reportError({
workerPool,
kind: 'creation-error',
title: 'Encountered unknown VM provisioningState or powerStates',
description: `Unknown provisioningState ${provisioningState} or powerStates: ${powerStates.join(', ')}`,
});
Where the power states are here:
const successPowerStates = new Set(['PowerState/running', 'PowerState/starting']);
const failPowerStates = new Set(['PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated']);
const successProvisioningStates = new Set(['Succeeded', 'Creating', 'Updating']);
const failProvisioningStates = new Set(['Failed', 'Deleting', 'Canceled', 'Deallocating']);
from the error, we see that provisioningState is 'Creating' and powerStates is ['ProvisioningState/creating'].
The docs for these states are here and are as opaque as any Azure docs.
But curiously, the state is not Failed here. Also, that error does not result in shutting down the instance, and in fact it continues to see that state for a few more minutes. In the worker-scanner logs, it does finally shut down:
2020-08-03 11:22:08.732 GMT reason: "failed state; provisioningState=Failed, powerStates=ProvisioningState/failed/OSProvisioningTimedOut, PowerState/running"
I'm not sure what to do with a resource that has two power states! What combinations are allowed? Does that mean it's in both of those states? Either of them?
So there seem to be two things to fix here:
- Improve the handling of provisioning states and power states, so that powerState
ProvisioningState/creatingis not considered unknown - Provide a worker pool config option
ignoreFailedProvisioningStatewith documentation that this is useful in cases where the image does not have a vm agent. TheregistrationTimeoutwill still catch instances that fail to register, so this won't "leak" instances the way it would have a while ago.
| Assignee | ||
Comment 3•5 years ago
|
||
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 4•5 years ago
|
||
OK! Launch configs for azure now have have ignoreFailedProvisioningStates, which can be set
ignoreFailedProvisioningStates: ["OSProvisioningTimedOut"]
to ignore this particular provisioning failure (or others, if they come up).
| Assignee | ||
Comment 5•5 years ago
|
||
(note, this will be in the next release)
| Assignee | ||
Comment 6•5 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1665920 tracks releasing 37.2.0. This is deployed in staging.
Description
•