Closed Bug 767658 Opened 12 years ago Closed 12 years ago

Infrastructure for 80 new tegras in mtv1

Categories

(Infrastructure & Operations :: DCOps, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dmoore)

References

Details

(Whiteboard: mtv1 u=fennec c=it p=3)

Derek and Ravi this is a heads up on a bug I just saw where 28 more tegras were just ordered, since we're going to need space and infrastructure for these in mtv1.  When Jake gets back to the office, he can take a look at the physical slots we have left in haxxor, but I'm reasonably certain that there won't be space for all of these.  

Derek: what room would be best to put the rest of these in (we already have some in 2/idf), and can we get individually addressable power strips (like we have for the other tegras) and switches ordered to accommodate them if we don't already have them in place?
In talking to coop, this is a first batch from arrow.com, and nvidia has contacted joduinn and offered to donate some (unknown as yet) number of tegras as well.  So we're likely looking at a larger installation (sorry I can't give you exact numbers).  Do we know how many we could stand up in mtv before we run out of facilities?
Per the addition of more tegras in bug772450, the total of new tegras is now confirmed at 81.  A small handful of these will be backfills for dead boards (I'd guess maybe 10).
Summary: Infrastructure for 28 new tegras in mtv1 → Infrastructure for 80 new tegras in mtv1
dmoore: any word on where these might go, and have we ordered PDUs for them (at the very least)?  We're still blocked on SD cards right now, but once those are in and imaged, we would potentially be ready to slot these into place.
Whiteboard: mtv1
We have space for the original 27 tegras in MTV, those tegras are not imaged yet, since the SD cards were not purchased when the tegras were purchased.  

dmoore is working to find space for all of the new tegras, but I do not believe joduinn has received the additional 53 tegras from Nvidia yet.
If the remaining balance of 53 units takes more than three weeks to arrive, we'll probably want to defer them to the scl1 expansion. Current rough ETA there is 8/27.
(In reply to Derek Moore from comment #5)
> If the remaining balance of 53 units takes more than three weeks to arrive,
> we'll probably want to defer them to the scl1 expansion. Current rough ETA
> there is 8/27.

If we do that, we'll need to ensure we have at least one master for tegras in that colo (entirely new setup), the PDU units to handle it, and foopy machines there. While the tegra capacity is in dire need of the more hardware [yea yea, I have software issues to work out to even handle it, but I suspect I can get that within the next 1-2 weeks, max]

3 weeks is Aug 8, my question, is if these were to arrive, say Aug 10, must we really wait an extra 3 weeks for us to even have the colo space to host these, nevermind the additional hardware we'd need in place to support them in a new colo?

And you state its a rough ETA, which I presume is a low-end expectation, whats our likely ETA, and whats our "DCOps will convincingly not be a day later than [x] date, if we want to be pessimistic"?

(I'm not the one to make a call here, I'm just expressing what I see as potentially wrong with delaying in favor of another [even if better] colo)
(In reply to Justin Wood (:Callek) from comment #6)

Your concerns about support infrastructure are quite valid. I can't comment on the immediate suitability of scl1 for hosting tegras, so I'm hoping someone else can chime in here.

We are expanding scl1 for the explicit purpose of supporting the upcoming mobile initiatives. All the panda, beagle, etc. boards should be destined for this location.

The real question is: how much time are we willing to wait in order to do this right? Mountain View is an absolutely terrible location for hosting production infrastructure, and it causes tree closures on a regular basis. Can we hold our breath for one, two, or three weeks if it means being able to use a real production environment? Where's the cutoff?
We would need to redesign the racking solution to put tegras in scl1, and that hasn't been done yet because it's blocked behind other projects (OS X, panda, etc).
Ok, this bug was supposed to be simply for rack space, power, etc. and I think we are set with that for the additional tegras we've received.  If so, I'd like to close this bug, if not please indicate what else needs to be done in the mtv1 to host these new tegras.
Whiteboard: mtv1 → mtv1 u=fennec c=it p=3
colo-trip: --- → mtv1
Closing this, we have these setup.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.