Closed Bug 1184503 Opened 9 years ago Closed 9 years ago

in-house provisioner for Windows / OS X (Discussion)

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

Details

In order for TaskCluster tasks to run on Mozilla-hosted hardware (Windows™ test jobs, possibly OS X tests too if we don't use an external provider) we will no doubt need a custom in-house TaskCluster provisioner.

The in-house provisioner would essentially perform a similar function to the aws-provisioner, except it would provision environments on Mozilla-hosted hardware rather than in ec2.

Historically, Windows, OS X and devices (such as foopies/pandas) have been available for buildbot jobs. With the migration from buildbot to TaskCluster, it will be necessary for TaskCluster to support at least using in-house hardware for Windows and OS X, unless it can be demonstrated that these tests can be run either in aws or by external parties, or can be abandoned.

Fortunately, RelEng already has mechanisms to control in-house environments, using https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain

Therefore the simplest (and perhaps optimal) path to realisation may be to develop a TaskCluster provisioner which manipulates puppet configs in order to provision environments. Of course a major difference from the aws-provisioner is that capacity is of fixed-size.

It may be that we can get away without a provisioner at all, since as long as there are workers running that are claiming tasks from the Queue, they can operate and report back job status. However, there are several advantages to having a dedicated provisioner, such as transparency of operation, managability, transparency of worker types, etc.

It will be important for teams to be aligned here - so putting needinfo on people in A-Team, RelEng, RelOps and TaskCluster to make sure all sides are covered. Please add issues / requirements / concerns / opinions as you see fit so we can move the planning forward on this tricky and problem-laden topic.

Thanks.
(can't CC :dustin, will CC instead)
Flags: needinfo?(rail)
Flags: needinfo?(jopsen)
Flags: needinfo?(jhford)
Flags: needinfo?(jgriffin)
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
Flags: needinfo?(bhearsum)
We've been tossing this idea around for a while now -- it was behind the unsuccessful attempt to utilize OpenStack for bare-metal provisioning.  The idea was basically to build a souped-up version of mozpool (which manages reimaging and maintaining panda boards).

In general, I think that completely automated re-installation of hardware is *very* difficult and has a relatively small benefit over other options.  By completely automated, I mean allowing the provisioner to, for example, re-image a Windows 8 host as Windows 10 because the win10 queue is longer than win8.  The issues include:
 - poor IPMI implementation on hardware (which would leave hosts unusable at an alarming rate)
 - no OOB on apple hardware (although we do have PDUs)
 - lack of flexibility in the netboot process - especially on OS X
 - installs are very slow - on the scale of hours
 - the bootstrap-secret problem (solved with instance metadata in AWS) is even harder, since PXE is completely unencrypted

Of course, we could throw money and engineers at the problem, but our experience with OpenStack suggests it's going to take a lot of money (better OOB control and/or newer, better hardware; maybe a dedicated install network) and a LOT of engineering (we would have to build a solution largely from scratch, spanning OS X and Windows).  There are no useful commercial alternatives as of our investigation a year or two ago.  Vendors say "sure we do bare metal", then you get into the webex and they say "oh, yeah, we don't do that, but we can install *our* application on bare metal using DVDs"

That is assuming that we have relatively few pools (e.g., a half-dozen pools of 50 or so machines each).  It seems that TC in AWS is trending toward having more workerTypes (which would be one-to-one with pools) with fewer instances, which is OK in AWS but will substantially reduce our flexibility on hardware.  Even if we did have automated provisioning, consider the situation if we had 50 pools of 5 machines each.  If a spate of try jobs spikes the pending in one of those pools, and the provisioner decides it needs 7 machines in the pool, it must find two other pools to steal hardware from.  Those pools will find their capacity depleted by 20%, and the oversubscribed pool won't have its new capacity for over an hour.

There are workarounds -- caching images on local drives to speed reinstalls; writing a bootloader that makes an API call to see what already-installed OS it should boot; using a SAN with copy-on-write semantics to boot operating systems; or even using HyperV or VMware and accepting the level of performance instability it brings (aka, giving up on Firefox performance measurement).  None of them are very appealing.

If we accept fixed naming for hardware, rather than try to rename every time we re-image, we could make *manual* reimaging quite a bit easier, and balance pools that way.  We could even build a tool that looks at historical load figures and recommends a mix, then manually re-provision to achieve that mix every week.  As suggested above, I don't think rapid, dynamic re-provisioning is practical anyway.
We'll also need this for linux hardware, for Talos tests.

Regardless of whether or not we want to try and support automatic OS installations (and I agree, that's a lot of work and probably not something we need for an initial rollout, at least), will we need some kind of mozpool-esque hardware manager to monitor machines, doing things like handling reboots and hangs?  Are we assuming workers on the hardware will claim TC tasks and we don't need any kind of intermediary?
Flags: needinfo?(jgriffin)
In its simplest form a provisioner does the following:
  1) Polls the queue for number of pending tasks
  2) Launch machines as needed
If we have a fixed hardware pool, we don't need a provisioner.

@dustin, I always favor fewer workerTypes both in and outside aws, based on the theory that:
  fewer workerTypes => more tasks per type => constant task flow => predicable load => less idle-time
In practice keeping caches hot, sometimes justifies specialised workerTypes.

> will we need some kind of mozpool-esque hardware manager to monitor machines,
> doing things like handling reboots and hangs?
Not sure how much, but we should have some sort of health monitoring. Presumably, the machines will
eventually become buggy and need a reformatting.

> Are we assuming workers on the hardware will claim TC tasks and we don't need any kind of intermediary?
That's what workers do. Workers don't talk to the provisioner, in aws-land there is a bit of chatter
for secrets and config. But we can just put in a file on the machine.
Deploying docker-worker with a config file should be fairly straight forward.
There is no need for a provisioner, if some human pushes the start button.

---
 - I don't think we need a provisioner
   (we can always do that in the future, if we have bare-metal provisioning)
 - We might need health monitoring
 - We store configuration/secrets in a file on the machines

This obviously makes rollout of new worker code very painful, but hardware is already painful, right?
(Obviously we should automate the roll out process as much as possible).
Flags: needinfo?(jopsen)
It won't be particularly painful -- puppet can take care of that fairly easily, depending on exactly how you want to orchestrate it.
I agree, we don't really need a provisioner for bare-metal hardware, unless you're managing reimaging.  If you are managing reimaging, it might be easier to just build something into the reimaging management tool (mozpool?) that checks Queue.pendingTasks() to see which ones are needed most.
Flags: needinfo?(jhford)
Flags: needinfo?(catlee)
Flags: needinfo?(rail)
I'm going to close this then. Some other topics came up in the discussion above, but those can be spun off into separate bugs if required - it looks like a provisioner per se is not needed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.