set up for darwin10 build slaves intel xserves (maybe?)

RESOLVED WONTFIX

Status

P2
normal
RESOLVED WONTFIX
8 years ago
5 years ago

People

(Reporter: joduinn, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [slaveduty][buildslaves])

Met with spencer earlier today; he will coord with RelEng buildduty/slaveduty to take one of the intel-based 10.5 xserves out of production, and reimage it with 10.6.

This bug is to track:
* exact version of OSX. To start with, I'd prefer to use the same dot-version as we have on our 10.6 mini builders, but dont know if that works on these xserves.
* possible new naming convention for 10.6 xserves, so we can distinguish from the 10.5 xserves and the 10.4 ppc xserves?

To start with, to prevent disruption to production, we should only put staging keys on this slave while we figure out the exact toolchain.
(In reply to comment #0)
> * exact version of OSX. To start with, I'd prefer to use the same dot-version
> as we have on our 10.6 mini builders, but dont know if that works on these
> xserves.

This is critical, actually.  Darwin10 build slaves are 10.2.0, while Darwin10 talos slaves are 10.6.0.  At this point, that's the only way puppet has to tell them apart.

> * possible new naming convention for 10.6 xserves, so we can distinguish from
> the 10.5 xserves and the 10.4 ppc xserves?

darwin10-xsrv-slaveNN
(and darwin9-xsrv-slaveNN and darwin8-ppc-xsrv-slaveNN, if we ever rename the others)

> To start with, to prevent disruption to production, we should only put staging
> keys on this slave while we figure out the exact toolchain.

If this isn't the exact same toolchain as on other darwin10 builders, then the one-offness will probably outweigh the benefit of using it.. so let's see what happens.
Whiteboard: [slaveduty][buildslaves]
I assume "reimage" means "clone from the 10.6.0 builder refimage", in which case it must be the exact same version of OSX.  If this means building a new refimage, I think that we should first discuss how worthwhile this is, and how it fits into our Q1 goals.
After further offline discussions, pulling this back into RelEng, while other higher priority bugs get worked on by ServerOps first.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
(In reply to comment #1)

> This is critical, actually.  Darwin10 build slaves are 10.2.0, while Darwin10
> talos slaves are 10.6.0.  At this point, that's the only way puppet has to tell
> them apart.

Fixing that should be considered to block any new 10.6 refimages. The probability that we can source hardware that runs anything older than last week's MacOS is usually close to zero.
zandr: yes.  And there are a few puppet bugs for that (notably bug 637686)

Updated

8 years ago
Priority: -- → P4
Whiteboard: [slaveduty][buildslaves] → [slaveduty][buildslaves][waiting for other high priority work gets done]
Summary: reimage an idle xserve with osx10.6 → create a darwin10 image for intel xserves
No longer blocks: 590345
Duplicate of this bug: 590345
I am going to be working on creating this image.  For now, my priority is to get xserves to parity with our existing darwin10 builders.  Updating the darwin10 builders to the latest darwin release is a different project outside the scope of this bug.
Assignee: nobody → jhford
Status: NEW → ASSIGNED
Priority: P4 → P2
Whiteboard: [slaveduty][buildslaves][waiting for other high priority work gets done] → [slaveduty][buildslaves]
Hardware: x86 → x86_64
Once completed, I will file a bug to reimage and rename the following machines:

bm-xserve06 -> darwin10-xs-slave01
bm-xserve07 -> darwin10-xs-slave02
bm-xserve08 -> darwin10-xs-slave03
bm-xserve09 -> darwin10-xs-slave04
bm-xserve10 -> darwin10-xs-slave05
bm-xserve11 -> darwin10-xs-slave06
bm-xserve12 -> darwin10-xs-slave07
bm-xserve15 -> darwin10-xs-slave08
bm-xserve16 -> darwin10-xs-slave09
bm-xserve17 -> darwin10-xs-slave10
bm-xserve18 -> darwin10-xs-slave11
bm-xserve19 -> darwin10-xs-slave12
bm-xserve20 -> darwin10-xs-slave13
bm-xserve21 -> darwin10-xs-slave14
bm-xserve22 -> darwin10-xs-slave15
bm-xserve23 -> darwin10-xs-slave16
bm-xserve24 -> darwin10-xs-slave17
(In reply to comment #8)
> Once completed, I will file a bug to reimage and rename the following
> machines:
> 
> bm-xserve06 -> darwin10-xs-slave01
> bm-xserve07 -> darwin10-xs-slave02
> bm-xserve08 -> darwin10-xs-slave03
> bm-xserve09 -> darwin10-xs-slave04
> bm-xserve10 -> darwin10-xs-slave05
> bm-xserve11 -> darwin10-xs-slave06
> bm-xserve12 -> darwin10-xs-slave07
> bm-xserve15 -> darwin10-xs-slave08
> bm-xserve16 -> darwin10-xs-slave09
> bm-xserve17 -> darwin10-xs-slave10
> bm-xserve18 -> darwin10-xs-slave11
> bm-xserve19 -> darwin10-xs-slave12
> bm-xserve20 -> darwin10-xs-slave13
> bm-xserve21 -> darwin10-xs-slave14
> bm-xserve22 -> darwin10-xs-slave15
> bm-xserve23 -> darwin10-xs-slave16
> bm-xserve24 -> darwin10-xs-slave17

This comment assumes that all xserves are going to be converted to 10.6.  This may not be the case, so this list might end up being shorter.
I'd like to put this project back on ice.

First, when did this become a priority?  Last we discussed this, the costs far outweighed the benefits.  I haven't heard anything since except this flurry of activity.

Relops has a lot of scut-work to do already, and that's blocking the capacity to build a better system that requires less hand-holding.  There's a *significant* opportunity cost to adding new relops work for this or any other project.

Second, before you start imaging things up, what is your approach to the puppet version dependency from above?  It hasn't gone away yet, and won't until we build the better system I just mentioned.

Third, when we build the better system, we'll be migrating builders to it silo by silo.  Why make this expensive change now, only to make a similar change later?  Duplicate work is expensive, particularly with the limited personnel we have.
(In reply to comment #10) 
> First, when did this become a priority?  Last we discussed this, the costs
> far outweighed the benefits.  I haven't heard anything since except this
> flurry of activity.

This is an easy way to add Mac builder capacity while waiting for the better system you describe below.

Of course the new system will be better _when it arrives_, but we have these xserves now, and they are currently doing (essentially) busy work as 10.5 builders. As our beefiest Macs, they could be much better utilized. 

We also have a engineering resource, jhford, who has time to investigate this *and* is actually in MV if it becomes a question of hands-on debugging.

> Relops has a lot of scut-work to do already, and that's blocking the
> capacity to build a better system that requires less hand-holding.  There's
> a *significant* opportunity cost to adding new relops work for this or any
> other project.

We need to be able to do both, quite frankly. We can't tell developers to sit on their hands while we wait for our new system.

We recently trained up a bunch of IT interns to re-image tegras. Could any one of these individuals be tasked with doing the basic imaging jhford needs? 

> Second, before you start imaging things up, what is your approach to the
> puppet version dependency from above?  It hasn't gone away yet, and won't
> until we build the better system I just mentioned.

So I think we agree that the first step here is to make sure we can match the version with the existing builders. If we can't, jhford will figure out where we go from there. Do we hold off pending new hardware? Do we rev all the builders? Can that be done via puppet? etc...

The plan of record here is to do as minimal a base OS install as possible, add puppet, and then perform the rest of the setup via puppet.
 
> Third, when we build the better system, we'll be migrating builders to it
> silo by silo.  Why make this expensive change now, only to make a similar
> change later?  Duplicate work is expensive, particularly with the limited
> personnel we have.

This is a red herring. As you state, we'll need to move these xserves forward at some point, possibly to match newly acquired hardware. Until you can give me a timeline for that, we're working with what we have.

Again, let me reiterate that jhford *has time* to work on this, and his first step is making sure that we can match the current builder OS version before we decide how much further to take this.
(In reply to comment #11)
> Again, let me reiterate that jhford *has time* to work on this, and his
> first step is making sure that we can match the current builder OS version
> before we decide how much further to take this.

My concern is not adding another item to the Build:IT:Priorities page above "build the new system."  I can't give you a timeline for the new system because currently it's below the ever-gets-done threshold on that list, and every time the list gets short, we come up with new things to put onto it.

The xserves are in sjc1 and are not remotely manageable, so installing and imaging them is difficult and doesn't multi-task well with other work.

Let's do this (with goals of minimizing relops work and also doing this "the new way", inasmuch as that new way is defined yet):

Ops fetches bm-xserve06 from sjc1 and put it somewhere in mtv1 where jhford can get to it with a crash cart and supplies all available OSX install media (10.2.0 if possible, maybe 10.6.0, or maybe 10.7.0 now - and probably OS X Server is required?).  IIRC these suckers are loud, so maybe that means in AFK or one of the closets.

John builds out a base, non-slave image that starts up and communicates with puppet (using version 0.24.8) to download its configuration, preferably as outlined in bug 658678.  I can help with this.  This will be a new refimage, "OS X $version Base (xserve)", and should be described on the RefImages wiki as such, preferably with a short sequence of install steps.

John then uses DeployStudio to snapshot this base image, then starts working up puppet manifests that can build a builder from the base.  This will require some disambiguation of the test from build systems in puppet, but we can figure out how to solve that at the time.

Once that's ready, reimage with the base image, let puppet run, and then snapshot the result as the new darwin10-xs image.  This image is just a shortcut to avoid massive hits on the puppet server during deployment - the slave refimage is defined as (base refimage + puppet).

Then we can rename the other intel xserves and image them.
(In reply to comment #12) 
> Ops fetches bm-xserve06 from sjc1 and put it somewhere in mtv1 where jhford
> can get to it with a crash cart and supplies all available OSX install media
> (10.2.0 if possible, maybe 10.6.0, or maybe 10.7.0 now - and probably OS X
> Server is required?).  IIRC these suckers are loud, so maybe that means in
> AFK or one of the closets.
> [deletia]
> Then we can rename the other intel xserves and image them.

That sounds like a fine plan to me.
So, in the broad sense, I'm fine with bringing back an xserve or two to play with, especially if we learn how to make the LOM work while we're at it. Spending lots of time mucking about with imaging down in MPT is not something I have resources for, so I like this plan better.

There are two sets of xserves:

11x xserve1,1: Four 2.66 GHz Woodcrest cores, 4GB RAM, 2x80GB 7200RPM SATA drives
6x  xserve2,1: Four 2.80 GHz Penryn cores, 4GB RAM 2x73GB 15.5k SAS drives

In both cases, the drives appear to be configured as RAID1, though I'm not sure of that. (I'm trying to be as un-intrusive as possible, so I'm ssh'ing in and running system_profiler)

My recommendation would be to bring back one of the xserve2,1 machines, as mentioned in the other bug. 

Capturing some conversation in #build this afternoon, It looks like xserve1,1's are about 1.5x faster than the current mini builders. xserve2,1's are probably more like 2x.

Some gotchas:

Snow Leopard Server isn't free, and it's serialized, so we'd need to buy 10.6 upgrade licenses for each of those machines. I've asked for real pricing at quantity 15, but list is $499.

A new mini is $699. I asked for a tag and mozconfig to test on a new mini, since I have a few of these around for the management lab.

$499/each to rehabilitate these machines as 10.6 builders doesn't seem like money well spent, especially with Lion right around the corner. (NB: Apple has said that Server will no longer be a separate product with Lion.)

Having said that, I'd like to get some real test data.

The alternative would be to run 10.6 desktop on that hardware, which seems like it's asking for trouble, and probably wouldn't let us get the LOM running.

It also looks like the 10.6 server distribution is actually 10.6.3. (thus Darwin 10.3.0) We might be able to get the bits for 10.6.0 from Mozilla's Apple Developer account, but I don't have those credentials, and I don't know who does. There should be one license in the Apple Dev account that we can use for experimentation.

Please reply here with a tag and moz config that would be a useful comparison to
make on a new mini, and I'll knock out some tests.

We could bring home an xserve2,1 in parallel, but it may not be worth the time and effort if a new mini can compete.
(In reply to comment #12)
> (In reply to comment #11)
> > Again, let me reiterate that jhford *has time* to work on this, and his
> > first step is making sure that we can match the current builder OS version
> > before we decide how much further to take this.
> 
> My concern is not adding another item to the Build:IT:Priorities page above
> "build the new system."  I can't give you a timeline for the new system
> because currently it's below the ever-gets-done threshold on that list, and
> every time the list gets short, we come up with new things to put onto it.

This project is specifically about baseOS install + puppet manifests to get us a working 10.6 builder slave. Having these manifests in place will speed up deploying new mac machines as production 10.6 builders, when we get those.

To make things less confusing, I've adjusted the subject to match.



> The xserves are in sjc1 and are not remotely manageable, so installing and
> imaging them is difficult and doesn't multi-task well with other work.
> 
> Let's do this (with goals of minimizing relops work and also doing this "the
> new way", inasmuch as that new way is defined yet):
> 
> Ops fetches bm-xserve06 from sjc1 and put it somewhere in mtv1 where jhford
> can get to it with a crash cart and supplies all available OSX install media
> (10.2.0 if possible, maybe 10.6.0, or maybe 10.7.0 now - and probably OS X
> Server is required?).  IIRC these suckers are loud, so maybe that means in
> AFK or one of the closets.
> 
> John builds out a base, non-slave image that starts up and communicates with
> puppet (using version 0.24.8) to download its configuration, preferably as
> outlined in bug 658678.  I can help with this.  This will be a new refimage,
> "OS X $version Base (xserve)", and should be described on the RefImages wiki
> as such, preferably with a short sequence of install steps.
> 
> John then uses DeployStudio to snapshot this base image, then starts working
> up puppet manifests that can build a builder from the base.  This will
> require some disambiguation of the test from build systems in puppet, but we
> can figure out how to solve that at the time.
> 
> Once that's ready, reimage with the base image, let puppet run, and then
> snapshot the result as the new darwin10-xs image.  This image is just a
> shortcut to avoid massive hits on the puppet server during deployment - the
> slave refimage is defined as (base refimage + puppet).
> 
> Then we can rename the other intel xserves and image them.



jhford, does this plan sound ok to you?
Summary: create a darwin10 image for intel xserves → create a darwin10 image and puppet manifests for intel xserves
(In reply to comment #15)

> This project is specifically about baseOS install + puppet manifests to get
> us a working 10.6 builder slave. Having these manifests in place will speed
> up deploying new mac machines as production 10.6 builders, when we get those.
> 
> To make things less confusing, I've adjusted the subject to match.

If the goal is to build up puppet manifests for a 10.6 builder slave for future use, then all the more reason to use a new mini. I can put one on jhford's desk /now/ and order a replacement.
Improving the puppet manifests to not key off the minor darwin version is something we'll need to do anyway.  Since this bug is mostly about xserves, let's devote it to the questions and experiments zandr laid out in comment 14.  Comment 16 points out that the upgrades to the puppet manifests can happen now.  I've opened a new bug for that: bug 661750, genericized the topic of this bug, and temporarily reassigned to zandr for the researching of comment 14.
Assignee: jhford → zandr
Summary: create a darwin10 image and puppet manifests for intel xserves → set up for darwin10 build slaves intel xserves (maybe?)
bug 661750 suggests that we're using 10.7.0 for these. If possible, I'd like to avoid that. We've hit issues with things like hdiutil changing in ways that break some of scripts between minor revs of OS X in the past.
That comment should be on bug 661750.  I agree, but I don't know how practical that is.
12:39:51 PM joduinn: question about https://bugzilla.mozilla.org/show_bug.cgi?id=637988
12:40:15 PM joduinn: *couldnt follow comment14.*
[brief discussion elided]
12:51:09 PM joduinn: its not part of your colo trip today , so lower priority. /me takes the rest of this to bugmail

Haven't seen any further comment in the bug, so...

What were the other issues?
tl;dr: buy 2G apple RAM ASAP, install in a new mini

OK, spoke with jhford about this and here's the skinny:

At the moment, he's working on getting mobile on buildbot-0.8.2, but this remains a releng priority, so someone will be on it soon.  At that point, there are two projects: getting puppet to work with 10.6.0 builders - bug 661750; and determining the best hardware/os to use for the purpose (this bug).

As far as the latter work goes, let's get a "new mini" (as defined in comment 14) loaded up with 4GB of RAM and a fresh copy of 10.6.0, ready for releng to experiment with when the time comes (meaning order the extra 2GB or RAM now).  Armed with that information and the rough estimates of xserve build times in comment 14, we can make a call on the fastest route to success here - be that new minis, futzing with the old xserves, or maybe even server minis.

Comment 23

8 years ago
Anytime soon we will have the new one if it is worth considering it:
http://buyersguide.macrumors.com/#Mac_Mini (I know I know)
We need to minimize the time we spend in analysis paralysis on this ticket - I'd like to have a purchasing decision made and in progress within a week of resuming work on this bug.  Given that iCloud was just announced, Apple won't be releasing anything new in the next few weeks, so I don't think that's worth considering here (although it's a big part of talos-r4 thoughts).
4GB or 8GB? Cost difference is small, we're already replacing all the RAM. (2x1GB out of the box, going to 2x2 or 2x4) I'm thinking 8GB would be a win if we're targeting throughput.

As to a new mini around the corner, leave playing chicken with Apple to me. I have a few of the C2D minis in hand for the management lab, and will backfill.
The drop-dead for a refresh being relevant to us is the release of Lion, which is "July". If there's no refresh before they ship Lion, the refresh won't run Snow Leopard.

If we're doing this with install+puppet, that should translate to a hypothetical refreshed Mac with little effort.
Let's go with 8GB, then.  Moar RAM is moar faster is moar better.
A lot of back and forth here but I think I can summarize this bug this way:

* RelEng needs to get an OSX10.6 reference image build on either an Xserve or a Mini.

To the best of my knowledge, neither has been done.

Once either is done, IT either:

1. buys 10.6 server licenses for $499 * 16
2. buys new Mac Minis for $16 * $699.

#1 is using legacy hardware and we're already starting to see more failures with the older legacy hardware.  #2 uses a lot less power which means we can put more machines in the same space.

There's already a Mini on John Ford's desk and memory en route to increase it.

How do I help get forward progress?
Assignee: zandr → nobody
That's accurate.  From an IRC conversation, it turns out that while in theory our refimages are <base OS + follow instructions on wiki + run puppet>, in fact the instructions on the wiki for our darwin10 builders are no longer functional, due to unavailable versions of OS X, XCode, and MacPorts libraries.  So in the process of running builds on faster hardware, we have two options for software: 1. run the existing refimage on faster hardware; or 2. build a completely new reference image (and do it right this time - this is effectively a hardware refresh).

The latter is massive scope-creep for this project, so we would like to exhaust all options for #1, and failing that, decide whether now is the time for a hardware refresh (#2).

So in terms of IT requirements, nothing changes: we'll try installing the darwin10 builder refimage on a new mini, and then get build times from that new mini (with 8GB RAM).  Failing that, we'll get an xserve into mtv1 to try the same, although I will owe jhford a lot of beer if the image even boots.  If we decide to do a hardware refresh, then the new mini is a likely platform, so it's still worth having on-hand.
I put bm-xserve06 back into production, so don't yank it without advanced preparation.
With 160 new minis on the way and a new generation that is likely to be significantly faster just being release, I don't think we should proceed with this work.  Marking bug as such.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.