1184503 - in-house provisioner for Windows / OS X (Discussion)

Reporter

Description

•

10 years ago

In order for TaskCluster tasks to run on Mozilla-hosted hardware (Windows™ test jobs, possibly OS X tests too if we don't use an external provider) we will no doubt need a custom in-house TaskCluster provisioner. The in-house provisioner would essentially perform a similar function to the aws-provisioner, except it would provision environments on Mozilla-hosted hardware rather than in ec2. Historically, Windows, OS X and devices (such as foopies/pandas) have been available for buildbot jobs. With the migration from buildbot to TaskCluster, it will be necessary for TaskCluster to support at least using in-house hardware for Windows and OS X, unless it can be demonstrated that these tests can be run either in aws or by external parties, or can be abandoned. Fortunately, RelEng already has mechanisms to control in-house environments, using https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain Therefore the simplest (and perhaps optimal) path to realisation may be to develop a TaskCluster provisioner which manipulates puppet configs in order to provision environments. Of course a major difference from the aws-provisioner is that capacity is of fixed-size. It may be that we can get away without a provisioner at all, since as long as there are workers running that are claiming tasks from the Queue, they can operate and report back job status. However, there are several advantages to having a dedicated provisioner, such as transparency of operation, managability, transparency of worker types, etc. It will be important for teams to be aligned here - so putting needinfo on people in A-Team, RelEng, RelOps and TaskCluster to make sure all sides are covered. Please add issues / requirements / concerns / opinions as you see fit so we can move the planning forward on this tricky and problem-laden topic. Thanks.

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

10 years ago

(can't CC :dustin, will CC instead)

Flags: needinfo?(rail)

Flags: needinfo?(jopsen)

Flags: needinfo?(jhford)

Flags: needinfo?(jgriffin)

Flags: needinfo?(catlee)

Flags: needinfo?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

10 years ago

Flags: needinfo?(bhearsum)

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

10 years ago

We've been tossing this idea around for a while now -- it was behind the unsuccessful attempt to utilize OpenStack for bare-metal provisioning. The idea was basically to build a souped-up version of mozpool (which manages reimaging and maintaining panda boards). In general, I think that completely automated re-installation of hardware is *very* difficult and has a relatively small benefit over other options. By completely automated, I mean allowing the provisioner to, for example, re-image a Windows 8 host as Windows 10 because the win10 queue is longer than win8. The issues include: - poor IPMI implementation on hardware (which would leave hosts unusable at an alarming rate) - no OOB on apple hardware (although we do have PDUs) - lack of flexibility in the netboot process - especially on OS X - installs are very slow - on the scale of hours - the bootstrap-secret problem (solved with instance metadata in AWS) is even harder, since PXE is completely unencrypted Of course, we could throw money and engineers at the problem, but our experience with OpenStack suggests it's going to take a lot of money (better OOB control and/or newer, better hardware; maybe a dedicated install network) and a LOT of engineering (we would have to build a solution largely from scratch, spanning OS X and Windows). There are no useful commercial alternatives as of our investigation a year or two ago. Vendors say "sure we do bare metal", then you get into the webex and they say "oh, yeah, we don't do that, but we can install *our* application on bare metal using DVDs" That is assuming that we have relatively few pools (e.g., a half-dozen pools of 50 or so machines each). It seems that TC in AWS is trending toward having more workerTypes (which would be one-to-one with pools) with fewer instances, which is OK in AWS but will substantially reduce our flexibility on hardware. Even if we did have automated provisioning, consider the situation if we had 50 pools of 5 machines each. If a spate of try jobs spikes the pending in one of those pools, and the provisioner decides it needs 7 machines in the pool, it must find two other pools to steal hardware from. Those pools will find their capacity depleted by 20%, and the oversubscribed pool won't have its new capacity for over an hour. There are workarounds -- caching images on local drives to speed reinstalls; writing a bootloader that makes an API call to see what already-installed OS it should boot; using a SAN with copy-on-write semantics to boot operating systems; or even using HyperV or VMware and accepting the level of performance instability it brings (aka, giving up on Firefox performance measurement). None of them are very appealing. If we accept fixed naming for hardware, rather than try to rename every time we re-image, we could make *manual* reimaging quite a bit easier, and balance pools that way. We could even build a tool that looks at historical load figures and recommends a mix, then manually re-provision to achieve that mix every week. As suggested above, I don't think rapid, dynamic re-provisioning is practical anyway.

Jonathan Griffin (:jgriffin)

Comment 3

•

10 years ago

We'll also need this for linux hardware, for Talos tests. Regardless of whether or not we want to try and support automatic OS installations (and I agree, that's a lot of work and probably not something we need for an initial rollout, at least), will we need some kind of mozpool-esque hardware manager to monitor machines, doing things like handling reboots and hangs? Are we assuming workers on the hardware will claim TC tasks and we don't need any kind of intermediary?

Flags: needinfo?(jgriffin)

Jonas Finnemann Jensen (:jonasfj)

Comment 4

•

10 years ago

In its simplest form a provisioner does the following: 1) Polls the queue for number of pending tasks 2) Launch machines as needed If we have a fixed hardware pool, we don't need a provisioner. @dustin, I always favor fewer workerTypes both in and outside aws, based on the theory that: fewer workerTypes => more tasks per type => constant task flow => predicable load => less idle-time In practice keeping caches hot, sometimes justifies specialised workerTypes. > will we need some kind of mozpool-esque hardware manager to monitor machines, > doing things like handling reboots and hangs? Not sure how much, but we should have some sort of health monitoring. Presumably, the machines will eventually become buggy and need a reformatting. > Are we assuming workers on the hardware will claim TC tasks and we don't need any kind of intermediary? That's what workers do. Workers don't talk to the provisioner, in aws-land there is a bit of chatter for secrets and config. But we can just put in a file on the machine. Deploying docker-worker with a config file should be fairly straight forward. There is no need for a provisioner, if some human pushes the start button. --- - I don't think we need a provisioner (we can always do that in the future, if we have bare-metal provisioning) - We might need health monitoring - We store configuration/secrets in a file on the machines This obviously makes rollout of new worker code very painful, but hardware is already painful, right? (Obviously we should automate the roll out process as much as possible).

Flags: needinfo?(jopsen)

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

10 years ago

It won't be particularly painful -- puppet can take care of that fairly easily, depending on exactly how you want to orchestrate it.

John Ford [:jhford] CET/CEST Berlin Time

Comment 6

•

10 years ago

I agree, we don't really need a provisioner for bare-metal hardware, unless you're managing reimaging. If you are managing reimaging, it might be easier to just build something into the reimaging management tool (mozpool?) that checks Queue.pendingTasks() to see which ones are needed most.

Flags: needinfo?(jhford)

Chris AtLee [:catlee]

Updated

•

10 years ago

Flags: needinfo?(catlee)

Rail Aliiev [:rail]

Updated

•

10 years ago

Flags: needinfo?(rail)

Pete Moore [:pmoore][:pete]

Reporter

Comment 7

•

10 years ago

I'm going to close this then. Some other topics came up in the discussion above, but those can be spun off into separate bugs if required - it looks like a provisioner per se is not needed.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Bugzilla

in-house provisioner for Windows / OS X (Discussion)

Categories

(Taskcluster :: General, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Comment 7