Closed Bug 1308791 Opened 9 years ago Closed 9 years ago

Install Generic Worker on gecko-*-*-win* workers as part of worker DSC, rather than during AMI creation

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: pmoore, Unassigned)

Details

When a gecko-*-*-win* worker first starts up, it is not in a position to run the generic worker, as it does not have a formatted Z: drive, but refers to it in the config for e.g. the location to store user home directories. This causes the worker to exit, when it recognises it is in a bad state, and so there is a routine checking script to see if the worker is running, and if it isn't, it tries to run it again - in order that at some point when the drive is ready and available, it should run ok. There are a couple of problems with this approach: 1) it introduces a race condition such that maybe the worker starts up before the prerequisite steps have completed, but it might not detect it - this can lead to non-deterministic behaviour of the worker. 2) the worker is designed to shutdown the instance if it determines that it is unhealthy. it could be that when the worker is started up before the prerequisite steps have been completed, that it causes the instance to shut down A solution to this problem is to move the installation of the worker to the DSC that runs when the worker starts up. The DSC is already responsible for mounting the Z: drive. In this scenario, the AMI creation process could proactively download the generic worker binary, but not actually install it. Then the worker startup DSC could install it, and reboot the computer, after all other steps had successfully completed, such as formatting/mounting the Z: drive. I think once this is done, there is no need to have DSC checking whether generic worker is running, since it should run reliably. Alternatively, if it isn't running, the DSC could instead choose to shut down the machine, rather than restarting the worker, since if the worker is not running, it is an indication that something is seriously wrong. This was not the case before, because it had to account for the worker not running because maybe the Z: drive was not available, and the worker had exited on purpose.
It's possible to implement this and I will, to see if it gets us further, but I don't believe it's the right solution. I think that generic worker should wait for the z: drive to exist and not exit or throw exceptions unless a configurable timeout is reached. The reason it's problematic to rely on the proposed change to OCC is that the start of generic worker is triggered by the instance booting. More specifically, the auto-login mechanism is triggered by the instance booting and the start of generic worker is triggered by a logon script. If generic worker is not installed on the ami but instead on the spot instance, we have to introduce an extra reboot after g-w is installed on each spot instance in order to trigger the start of g-w. It's certainly doable, and I will try, but it feels sloppy to add yet another reboot (on every spot instance) when a more efficient g-w could simply wait for the drive to be available. We deliberately do all of our installations on the ami (rather than on the spot instance) to save building time and reduce the cost of cpu time on spot instances so the change proposed here breaks that pattern.
On OS X we do a clever thing where, once the system is configured correctly, we touch a semaphore file which triggers startup of runner and thus buildbot. Could you do something similar here, where the login task waits for a semaphore file (but not forever) and then starts g-w when that file appears?
I think having the least moving parts when spot instances are started is the best way to ensure that we don't have a lot of failure points when trying to claim and complete tasks. It also reduces the time the tasks are waiting to be claimed. Having more things installed on the spot instance and multiple reboots just sounds like a way to introduce possible failures. Ideally the machine is already in a state where all software is available on boot, and it's a matter of mounting the drive, logging in, and starting up the worker. (of course a lot of hand waving and unknowns from my end here, so don't think too much into my oversimplification here)
The generic worker is meant to be cross-platform, and also to support both cloud/non-cloud environments. The issue with the Z: drive is specific to EC2 environments only, Windows only, and then our use of SSD on these windows instances. I feel burning logic into the generic worker to handle this arbitrary use case would be wrong. Instead, I agree that it should be handled outside of the worker, such that the worker is only started up when its prerequisite steps have been met. Currently the generic worker provides an "install" target which allows for a trivial installation of the worker. In reality, this sets a couple of registry entries, and creates a scheduled task. There is no requirement to use this installation target. It is provided for convenience, but you are free to install it as you wish. Bearing this in mind, and in order to avoid an extra reboot, I would agree with Dustin, that it makes sense to have a custom script/startup mechanism that waits for the environment to be ready, and then starts the generic worker. This could probably be best implemented by applying the same mechanics that the current "install" target applies, but adapting the scheduled task, such that it waits for the Z: drive to be ready (via a semaphore or other means). This could be done on the AMI. Effectively, this means swapping out https://github.com/mozilla-releng/OpenCloudConfig/blob/ef035436d8977571dfeb3d40dfd322a84d2b9242/userdata/Manifest/gecko-t-win7-32.json#L425-L468 with a custom install step that: 1) creates an admin user for generic worker 2) set auto-login for the generic worker user created 3) creates a scheduled task triggered on login of the generic worker user, to monitor for environment-readiness and then start the generic worker This is the code that the "install" target currently runs, for comparison: See: https://github.com/taskcluster/generic-worker/blob/581b86560489d536d896808023d142540c9f04f1/plat_windows.go#L377-L414 As I say, I see this as a custom bootstrap problem, rather than something the generic worker should implicitly support, as it seems architecturally cleaner for the worker to only be started when the environment is ready, and this avoids adding very use-case-specific code to generic worker, which can comfortably live in the bootstrap process, rather than the application itself.
Note, this is the scheduled task that running the command `generic-worker install ....` creates: <Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task"> <RegistrationInfo> <Date>2016-04-28T17:25:08.4654422</Date> <Author>GenericWorker</Author> <Description>Runs the generic worker.</Description> </RegistrationInfo> <Triggers> <LogonTrigger> <Enabled>true</Enabled> <UserId>GenericWorker</UserId> </LogonTrigger> </Triggers> <Principals> <Principal id="Author"> <UserId>GenericWorker</UserId> <LogonType>InteractiveToken</LogonType> <RunLevel>HighestAvailable</RunLevel> </Principal> </Principals> <Settings> <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy> <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries> <StopIfGoingOnBatteries>true</StopIfGoingOnBatteries> <AllowHardTerminate>true</AllowHardTerminate> <StartWhenAvailable>false</StartWhenAvailable> <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable> <IdleSettings> <StopOnIdleEnd>true</StopOnIdleEnd> <RestartOnIdle>false</RestartOnIdle> </IdleSettings> <AllowStartOnDemand>true</AllowStartOnDemand> <Enabled>true</Enabled> <Hidden>false</Hidden> <RunOnlyIfIdle>false</RunOnlyIfIdle> <WakeToRun>false</WakeToRun> <ExecutionTimeLimit>PT0S</ExecutionTimeLimit> <Priority>7</Priority> </Settings> <Actions Context="Author"> <Exec> <Command>C:\generic-worker\run-generic-worker.bat</Command> </Exec> </Actions> </Task> As you see, it runs the script C:\generic-worker\run-generic-worker.bat on login of the user. You could just add some code to the referenced .bat script to wait for the Z: drive to be available. Since `generic-worker install ....` also creates C:\generic-worker\run-generic-worker.bat, you would run the install target first, and then update/replace the run-generic-worker.bat script. This seems the simplest and cleanest solution.
it turns out that the slow mount of the z: drive was due to a fault in the base ami. when i rebuilt the base win7 ami yesterday in order to shrink the c: drive, i inadvertently rectified the slow mount issue. windows 7 spot instances now have z: available before the g-w starts. the other win os amis (2012, 10) never had a slow mounting drive.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
That's great news, thanks Rob! :-)
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.