Closed Bug 666044 Opened 13 years ago Closed 11 years ago

tegras rackmount solution

Categories

(Infrastructure & Operations :: RelOps: General, task, P2)

x86
macOS

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: joduinn, Assigned: dividehex)

References

Details

per meeting with zandr yesterday: we have 4? 5? racks available for these 200 tegras. However, these racks still need:

* shelves
* PDUs
* network cable
* ...??


(zandr, anything I forgot?)
(In reply to comment #0)
> per meeting with zandr yesterday: we have 4? 5? racks available for these
> 200 tegras. However, these racks still need:

Available is a loose term, and slightly beyond the scope of this bug. There are four total racks available in scl1. Things we've discussed moving here: 

iX machines from mtv1 (1.2 racks)
200-300 Tegras (2-3 racks minimum)
42 DL120G7's (1 rack)

I'd recommend leaving the first 100 Tegras and the iX machines in MTV until scl3 becomes available.

> (zandr, anything I forgot?)

There are significant design and engineering tasks unresolved before we can do a production-quality deployment of these boards.

Must have:
* Better mechanical mounting of the Tegra boards themselves.
* A better power supply solution. I'm not going to install 200 giant wall warts dangling off 1' power cords in a real data center.


Open questions:
* USB (adb): Is this useful, and should it be part of the production system (bug 665926)
* HDMI: Is an HDMI switch sufficient to convince the board it has a monitor, and is this useful. This has been discussed a few times, but to my knowledge never tested.
* Remote Buttons: We should be able to rig relays (or open drain outputs) to the buttons and provide access to imaging mode and hardware resets. Open question as to whether this replaces switched PDUs, or has significant value.

That's what I have off the top of my head.
Assignee: server-ops-releng → zandr
Depends on: 668526
So, I'm going to appropriate this bug to 'blog' the work on the production solution for the Tegras. 

The thumbnail sketch is 14 tegras and a foopy in a 4U case, which switches/relays/usb hubs/etc integral to the chassis. Details will follow.

As such, duping the 'spec new foopy' bug here.
Whiteboard: [android_tier_1]
Since bugzilla really isn't the place for this, project planning and tracking is happening on the wiki.
Severity: normal → critical
Summary: fit out racks to support 200 tegras → tegras rackmount solution
Combining bugs and capturing info from bug 668526:

In order to make Tegras a tier-one solution, we need to design and build a rackmount solution.

Requirements:
1) Proper airflow
2) no 'waterfall of wall-warts' power supplies
3) remote power management
4) remote imaging for a large percentage of failure modes (bug 665926)
5) stable mechanical mounting

Current thoughts are around a 4U box with a foopy and 14 tegras. 
Two high-current DC supplies to feed the tegras through usb relay boards.
USB for the tegras (and relay boards) connected to the foopy.
On-board unmanaged switch.

Design for enhancement if necessary: 
Add relay boards if pressing buttons remotely is needed. 
  (unfortunate, but reasonable)
Add front panel USB/video connections for crashcarting. 
  (needing this should be considered a failure mode)

14 Tegras + 1 Foopy in 4U gives us a density of 140 Tegras/rack
If we're hurting for tegra space, we can place the n900s in a box (place, not throw, hurl, smash, or other verbs :) with their power supplies and use their haxxor power and "racks" for tegras.
Severity: critical → normal
zandr has obtained phidget USB relay boards for prototyping
Assignee: zandr → jwatkins
Priority: -- → P2
So I've already seen pictures of this on twitter. Can we get an update here for posterity, please?
The pictures you have seen are of a panda chassis prototype.  We are still working on a tegra version since design requirements are different.
(In reply to Zandr Milewski [:zandr] from comment #2)
> So, I'm going to appropriate this bug to 'blog' the work on the production
> solution for the Tegras. 
> 
> The thumbnail sketch is 14 tegras and a foopy in a 4U case, which
> switches/relays/usb hubs/etc integral to the chassis. Details will follow.
> 
> As such, duping the 'spec new foopy' bug here.

This thumbnail feels obsolete now. -- We eventually want to get to "no need for foopy" but the ETA of that is unknown, however we are already off Mac Mini's for new foopies, using Linux HP machines [1U iirc]. So designing a chasis with a foopy as part of the design feels wrong.

(In reply to Amy Rich [:arich] [:arr] from comment #4)
> Since bugzilla really isn't the place for this, project planning and
> tracking is happening on the wiki.

I don't see a wiki page linked here, is there one available?

(In reply to Amy Rich [:arich] [:arr] from comment #5)
> Combining bugs and capturing info from bug 668526:
> 
> In order to make Tegras a tier-one solution, we need to design and build a
> rackmount solution.
> 
> Requirements:
> 1) Proper airflow
> 2) no 'waterfall of wall-warts' power supplies
> 3) remote power management
> 4) remote imaging for a large percentage of failure modes (bug 665926)
> 5) stable mechanical mounting
> 
> Current thoughts are around a 4U box with a foopy and 14 tegras. 
> Two high-current DC supplies to feed the tegras through usb relay boards.
> USB for the tegras (and relay boards) connected to the foopy.
> On-board unmanaged switch.

The USB connections for the tegras should be independant of which foopy we have them controlled by. Such that if foopy25 dies, for example, we don't necessarily lose the tegra, and it allows us to reshuffle dead/hurting tegras to new foopies without needing IT hands-on to shuffle them on us.

> Design for enhancement if necessary: 
> Add relay boards if pressing buttons remotely is needed. 
>   (unfortunate, but reasonable)

Yes, we do need a means for doing remote PowerCycling of the devices. Preferably with PDU snmp like present.

> Add front panel USB/video connections for crashcarting. 
>   (needing this should be considered a failure mode)

We need an *easy* way for IT to hands-on the devices, for some common means, like swapping SDCards, crashcarting for other means, etc.

RemoteImaging is the ideal as well, but is not the only failure mode we need to account for (I don't want a failure mode of a single tegra with an sdcard swap needed, to end up taking 13 other tegras out of service to fix, for example)

> 14 Tegras + 1 Foopy in 4U gives us a density of 140 Tegras/rack

This density will change when we account for not having the foopy part of this inherent design.
Zandr's comment 2 is a year old, so I'm not surprised a lot has changed.  As I think you know, most of these problems have been solved, many in different ways than Zandr's.

Jake, is this bug serving any purpose anymore?  Should we just close it?
Whiteboard: [android_tier_1] → [android_tier_1] [2013Q1]
QA Contact: zandr → arich
The original impetus of bug 821400 was to replace Mac foopies with Linux ones. Just so I'm clear, do we intend to use a Linux foopy (i.e. not a Mac mini) in each tegra chassis?
The foopies won't be *in* the chassis (they aren't for pandas either).  But to the extent possible, yes, we will use Linux foopies.  We won't be moving mac foopies or building new ones.
Whiteboard: [android_tier_1] [2013Q1] → [android_tier_1] [2013Q2]
Whiteboard: [android_tier_1] [2013Q2] → [android_tier_1]
Whiteboard: [android_tier_1]
No longer blocks: 848992
No longer blocks: 848995
No longer blocks: 848983
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
The decision has been jointly made by IT, releng, a-team, and product to move the tegras to Evelyn and not into chassis in a datacenter.  They'll continue on as they currently are until they're decommissioned and won't ever be put in chassis.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.