Closed Bug 635907 Opened 13 years ago Closed 13 years ago

new, consistent, higher powered foopies

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 666044

People

(Reporter: bhearsum, Assigned: zandr)

Details

Currently, this machine is the staging Tegra host, and is running Linux. All of the production Tegra hosts are Macs. We should use one in staging, too. Especially now that we're running unittests, which require local binaries on the host, it's important to be consistent.
Joduinn: do we have any more rev1's?
Assignee: nobody → aki
Summary: replace bm-foopy with a mac → new, consistent, higher powered foopies
Bear, Ben:

Joduinn is on board with consistent, higher powered foopies.

Two options came up:


1) Order 5 ix linux boxen.  One would become staging; four would replace the current production foopies.

We get new hardware, hopefully without the fan problem.  Linux servers, fast, known, ~20 tegras per foopy.


2) Image up, say, 20 rev2 minis with snow leopard.

We currently have old hardware that's faster than our rev1's.  We run ~5 tegras per foopy and run into fewer port/resource contention issues, but rollouts of any new foopy-specific changes will need to be more automated than it has been.

(Save $$; save time due to lack of time ordering; less performant servers but fewer daemons running; foopy-managing almost must go to puppet.  Joduinn sees this last part as a plus rather than a minus.)

We would have to possibly do rolling graceful reboots of the foopies to get new puppet changes, but downing 5 tegras at a time would not be too difficult.


Do either of you have opinions here?  Joduinn is willing to place the order if we go with #1 and I'm willing to do initial non-puppet setup work either way we go.
some initial thoughts:

- the move to puppet will help iron out any missing details we have in our current situation, so yea, that's a plus
- using the 20 mini's gives us a lot of redundancy without any hardware cost at a faster turnaround
- the IX boxes would be sweet but I'm wondering if they are overkill for the current requirements and could be useful someplace else in our group

there isn't any compelling reason to go IX over mini except for ops related choices (but since we already maintain a ton of mini, event that is minimal)

so unless someone has a compelling reason to pick IX, I'm ok with the mini's but will not lose any sleep if we go IX
Zandr: I'll let you speak up before we make a decision.

How opposed would you be to having 20 rev2 minis running snow leopard be part of the Tegra production infrastructure?
I am loathe to use a mini anywhere we could use something manageable instead.

Can someone point me at a doc that describes what a foopy does? Seems like the sort of thing we want to be high-availability, which means >1 spindle.
a "foopy" server is the host for the combination of tools needed to manage and run tests on the Tegras.  Each Tegra will have the proxy tool (clientproxy.py) and a Buildbot Buildslave.  The clientproxy.py will talk to the Tegra to check on it's state and if online will start the Buildslave.  The buildslave will then run any unit and/or talos tests that then communicate remotely with the SUTAgent running on the Tegra.

They will all configured the same and do not require any special management except what we would normally require for a mini.  The redundancy is already built in because foopy server will run between 10-20 clientproxy/buildslave instances and they can be moved as quickly/easily as setting up a directory with the proper files and editing the buidlbot.tac
Let me say this a bit differently, then. Minis are not production systems. On certain platforms, we don't have a choice. This is not one of them. :)

Rather than 5 IX boxes, what about a big machine with RAID and redundant PSUs? I see hints of port exhaustion mentioned above, which is... puzzling.
(In reply to comment #7)
> Let me say this a bit differently, then. Minis are not production systems. On
> certain platforms, we don't have a choice. This is not one of them. :)

:) - agreeing completely I guess but in a circuitous manner. I was laying the benefits of the mini's only for completeness.  The call is yours as both setups will meet our needs.

> Rather than 5 IX boxes, what about a big machine with RAID and redundant PSUs?
> I see hints of port exhaustion mentioned above, which is... puzzling.

I worry about process memory and network I/O if we deal with a big machine - it would have to have multiple cores and multiple nics as each clientproxy/buildslave combination will have a minimum of 8 python processes running (and also other test tool binaries) and each process will have multiple sockets open to the Tegra.

The reason Aki mentioned port exhaustion is because the Tegras don't run their own services like the N900, so each proxy/slave combo could start a web server, a devicemanager, xre binaries and others.  I'm not sure of the exact count but we could get that if required.
drat, forgot to add to the above...

I think we would prefer the IX boxes instead of a large box as that would allow us to rollout updates to the foopy environment instead of having to do it all-or-nothing
Yes, having multiple machines lets us a) do multiple dev/staging envs, or b) redundant production envs that we can upgrade in a rolling fashion.

With 20 minis, losing a foopy means losing 5 tegras.
With 5 ix boxes, losing a foopy means losing 20 tegras.
With 1 HA box, losing a foopy means losing all 94 tegras.

"Losing" a foopy may include a-team or releng software updates, which may happen multiple times a week and isn't necessarily preventable from an ops perspective.

So yes, 3 seems like the bare minimum of foopies I'd want to have available: one staging, two prod.

It's not entirely port exhaustion as much as the possibility that our homegrown scripts might accidentally decide to use a port that another script or another tegra needs.  20 tegras per foopy is currently our arbitrary upper bound, and I'd rather not put much more on a single foopy... we run a buildbot slave, an httpd.js, a clientproxy.py daemon, a bcontroller.py, a devicemanager.py, and potentially other scripts/daemons per foopy, so even if ports weren't an issue, the number of processes that all need to be equally high priority might be an issue.
(In reply to comment #5)
> I am loathe to use a mini anywhere we could use something manageable instead.

Also, are minis unmanageable when they're running OSX?

(Certainly would accept an answer of "yes" here; just wanted to reiterate that we're not planning on putting fedora on these.)
(In reply to comment #11)
> (In reply to comment #5)
> > I am loathe to use a mini anywhere we could use something manageable instead.
> 
> Also, are minis unmanageable when they're running OSX?

They're much better than when running other OSs, but I still don't have full lights-out management and power control.

20 minis is also a LOT of rack space/switchports/management overhead.

If we do this with iX boxes, I'd want something better than our current builders, maybe a twin-2U2 (four hosts in 2U) with softraid and quad-core CPUs?

Numbers naturally fall at 16 or 24 Tegras/Foopy, actually. I'd be really happy with four fairly stout Linux machines.

I'm guessing that performance of the Foopy shows up in Talos results, and thus virtualization would be problematic?
(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #5)
> > > I am loathe to use a mini anywhere we could use something manageable instead.
> > 
> > Also, are minis unmanageable when they're running OSX?
> 
> They're much better than when running other OSs, but I still don't have full
> lights-out management and power control.
> 
> 20 minis is also a LOT of rack space/switchports/management overhead.
> 
> If we do this with iX boxes, I'd want something better than our current
> builders, maybe a twin-2U2 (four hosts in 2U) with softraid and quad-core CPUs?
> 
> Numbers naturally fall at 16 or 24 Tegras/Foopy, actually. I'd be really happy
> with four fairly stout Linux machines.

This would work IMO as long as we keep the production foopy's split properly - the staging one can be anywhere

> 
> I'm guessing that performance of the Foopy shows up in Talos results, and thus
> virtualization would be problematic?

well, the majority of the timed code does run on the Tegra, but the foopy host will be responding to web requests from the Tegra so it is a concern, just only if we push the upper limit hard.

I would be worried, like Aki mentioned above, about the number of active processes per physical core so that each proxy/slave combo doesn't become cpu bound.

But again, that's something you can factor into your recommendation now that you know our concerns :)
(In reply to comment #12)
> 20 minis is also a LOT of rack space/switchports/management overhead.

Agreed.

> If we do this with iX boxes, I'd want something better than our current
> builders, maybe a twin-2U2 (four hosts in 2U) with softraid and quad-core CPUs?

No objections here.
Verify w/ Mr. Moneybags O'Duinn that the cost is ok, and I'm fine with that.
Depending on Mr. Hearsum's needs, this may be a "please order by EOD Friday to deliver+power up by next week" deal.

> Numbers naturally fall at 16 or 24 Tegras/Foopy, actually. I'd be really happy
> with four fairly stout Linux machines.

That seems to fall within our acceptable bounds.

> I'm guessing that performance of the Foopy shows up in Talos results, and thus
> virtualization would be problematic?

Any virtualization that casts further doubt on the reliability/stability of our performance numbers would be a potential time drain for us and ops.

If there's some virtualization that I am unaware of that doesn't alter performance numbers at all vs. real hardware, I'm open.
I have a slight preference for OS X based ones, because the things I've been working on have been easier to get working on them. If we go with Linux, I'd ask that we install a very modern one, to avoid hitting shared library issues like I did on bm-foopy.

I've been getting along well enough on the existing ones, so I'm in no need of quick turnaround.

Thanks for pushing this forward!
Ok, after discussions in IRC this went from a blocker-urgent bug to an enhancement/good-to-have bug.

New foopies will give us consistency and will keep us from being resource constrained.  They will also potentially make us able to maintain these more easily than the 1 p3 + 4 rev1 minis we currently have.

A short term fix for the consistency issue would be replacing bm-foopy with a similar rev1 mini.

If getting rid of minis and having solid infra for the tegra testing is a goal here, we may want to revisit the 3-minis-behind-a-load-balancer as our Talos webserver as well.
For the short term fix, I now have a rev2 mini which we can image with snow leopard, which will (afaik) be close enough to the rev1 snow leopard minis foopy0{1..4}.
We no longer have a short term fix, due to the minis in bug 647051 mistakenly being sent to the a-team.

The long term fix is now more urgent... could you identify a set of manageable POSIX machines we should be running on, and get quotes?
Assignee: aki → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
OK, I'll get those quoted. Are we committed to keeping the tegras all at one site?
The tegras all need to be local to their foopy and their bm-remote-* load-balanced web hosts.  Also, the tegras all need to be re-imageable and rebootable (the latter can be handled via networked pdu).

Up til now, the above restrictions meant that we kept the tegras all in mtv1.
If we solve the above for more than one colo, then splitting them would make them more robust.  In that case, we should either expand this bug or file a second bug to track new webservers.
So it sounds like 'no', and we should do this with multiple machines.

Will spec out 4 speedy quad-core 1Us. :D
Assignee: server-ops-releng → zandr
Duping to the bug where I'll do the chassis buildout.

The new foopies will be built into a chassis with sets of Tegras.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.