Closed Bug 767447 Opened 8 years ago Closed 8 years ago

(Shoe)rack, cable, and image 86 tegras

Categories

(Infrastructure & Operations :: DCOps, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: van)

References

Details

(Whiteboard: mtv1 [reit-tegra])

Attachments

(2 files)

57.07 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
709 bytes, patch
coop
: review+
Details | Diff | Splinter Review
I don't know how much room we have in the existing shoeracks in haxxor due to attrition, but we have 28 new tegras coming in bug 765424 and they'll need to be setup somewhere.
Blocks: 767456
Blocks: 767457
Blocks: 767459
Depends on: 767658
It would have been nice if Jake and I heard about this sooner so we could prepare Haxxor or 2/IDF for these additional devices. Can we get an ETA of arrival for them?
Matt: The delivery date will be listed in the blocker bug 765424.  And the bug for getting the infrastructure set up in mtv1 is open for dcops in the blocker bug 767658.
Assignee: server-ops-releng → jwatkins
colo-trip: --- → mtv1
Here is an update from the blocker bug.  

Ann Ignacio [:anni] 2012-06-27 13:54:21 PDT

I spoke to the vendor and they are finalizing our credit terms.  If we get approved, then the shipment should arrive on Friday.  I will update the bug if any issues arise.
The first step here should be to determine how many open slots we have in haxxor that have been vacated by dead tegras and replace those first.
Summary: (Shoe)rack, cable, and image 28 tegras → (Shoe)rack, cable, and image 27 tegras
Duplicate of this bug: 772450
Per joduinn:

Per nvidia, we are getting an additional 54 tegras next week (16th-23rd). I'll update bug with exact ETA once I have it.

This is in addition to the 27 tegras already delivered last week. Now that we
have concrete numbers, filing thi sbug to track planning & setup work in IT.
Summary: (Shoe)rack, cable, and image 27 tegras → (Shoe)rack, cable, and image 81 tegras
I've done the initial step of adding all 81 new boards to DNS.  DHCP/inventory will come later when we have the necessary info.
releng folks, has anyone ordered sd cards for these?
not that I know of
Depends on: 772636
Attached file tegras xls
dcops has received 27 of the tegras. attached is xls with their asset, mac, and serial.

van
van: can you verify the mac for 10819, please?  it didn't have enough characters in the ss.  Based on the others, I'm guessing it's supposed to be 020448010966.
tegra-289 - tegra-315 added to inventory and dhcp.
Going to assign this to van so he and Jake can work on it together.
Assignee: jwatkins → vle
Component: Server Operations: RelEng → Server Operations: DCOps
QA Contact: arich → dmoore
:arr, you are correct. there should have been a leading 0 for the mac address of asset 10819. i probably formatted the cells to "text" after scanning the first one and forgot to update it. ill make sure it goes into inventory with the correct mac and ill reattach the csv after i update it with hostnames. thanks for the catch.
Whiteboard: mtv1
Van: I made a guess and put it in inventory correctly, I think.

When you guys unbox the next batch, I can help with inventory, too.  I have a script that inputs most of the necessary data.  Or I can share the script with you if you have write access to inventory.
Oh, and if it wasn't clear, I've put the first 27 into inventory with hostname already, so no need to update the csv unless you need to do so for your purposes.
Whiteboard: mtv1 → mtv1 [reit-tegra]
It is my understanding that the additional 50+ tegras from Nvidia have arrived.
That is correct, SD cards also arrived for them.
We have 1 bad board from the first batch of the 27 boards we've received.

Hangs at: "Waiting for bootloader to initialize" - but it never does and quits.

Do we RMA these boards?

Regards,
Van
I am not seeing the shipment of 54 tegra boards in the MV office. Although there are a bunch of "tegra developer tablets" sitting here. Are these the 50+ tegras from NVidia that Melissa had mentioned?

-Vinh
hwine reached out in irc and will follow-up here, since I do not know what we were expecting from Nvidia.

On the 1 bad board I am not sure if we can RMA since they no longer make these boards I believe.
Just to capture state from irc with expectations and numbers (since this is the first I had heard of the specific types and quantities we were getting):

hwine: okay - total number sounds right - we ordered _both_ tegra boards & developer tablets. Both were supposed to have been delivered Monday - I don't know how different the packaging would be. that should be about 55 boards + 6 tables as I understand it

Vinh: No boards.  All 61 of them are tablets.
:arr, I received write access to inventory from rtucker today. Can you show me your inventory scripts once I have the next spreadsheet filled out with location/pdu/switchports?

Thanks!
Van
(In reply to Van Le [:van] from comment #19)
> We have 1 bad board from the first batch of the 27 boards we've received.
> 
> Hangs at: "Waiting for bootloader to initialize" - but it never does and
> quits.
> 
> Do we RMA these boards?
> 
> Regards,
> Van

Yes, if that is one of the boards from arrow.com, please do RMA it.
(In reply to Amy Rich [:arich] [:arr] from comment #22)
> Just to capture state from irc with expectations and numbers (since this is
> the first I had heard of the specific types and quantities we were getting):
> 
> hwine: okay - total number sounds right - we ordered _both_ tegra boards &
> developer tablets. Both were supposed to have been delivered Monday - I
> don't know how different the packaging would be. that should be about 55
> boards + 6 tables as I understand it
> 
> Vinh: No boards.  All 61 of them are tablets.

1) The latest batch of boards from nvidia came in perspex cases, but look to be tegra boards inside. After taking case apart and eyeballing them, they look identical to our "normal" tegra boards. 

2) :sal and I just imaged one of these as tegra-316; it imaged and booted just fine first time. arr added tegra-316 to dhcp with mac addresss 02044b00dccb and RelEng are now evaluating this board in staging. 

3) we never got a shipping/packing list with Monday's delivery, but :sal just manually counted 60 boards. Our previous delivery of 27 tegras from arrow.com had one dead, so we are down to 26 from arrow.com. Therefore we have 60+26 == 86 new tegras at this time. As this bug is being used to track both batches, tweaking summary to match.
Summary: (Shoe)rack, cable, and image 81 tegras → (Shoe)rack, cable, and image 86 tegras
(In reply to Van Le [:van] from comment #23)
> :arr, I received write access to inventory from rtucker today. Can you show
> me your inventory scripts once I have the next spreadsheet filled out with
> location/pdu/switchports?

I sent you email with the script to add new hosts and some basic directions/usage.
I'll send you another to add in the location/pdu/switchports and a third that meshes the two together.
I've got tegra-316 running in staging. Those with access can view progress here:

http://dev-master01.build.scl1.mozilla.com:8160/buildslaves/tegra-316

Summary: a mixture of green, red, and purple.

Green: yay!
Red: unable to post to graphs-stage. Rarely works on a good day, not too concerned here.
Purple: some early exceptions hit as I got things setup, and then some failures to reboot during the allotted time. Callek tells me this can be "normal," and that the retries this triggered seemed to work fine.

I'd say that it appears this tegra is working properly, but I'll leave it running overnight for more data.
(In reply to Chris Cooper [:coop] from comment #27)
> I've got tegra-316 running in staging. Those with access can view progress
> here:
> 
> http://dev-master01.build.scl1.mozilla.com:8160/buildslaves/tegra-316
> 
> Summary: a mixture of green, red, and purple.
> 
> Green: yay!
> Red: unable to post to graphs-stage. Rarely works on a good day, not too
> concerned here.
> Purple: some early exceptions hit as I got things setup, and then some
> failures to reboot during the allotted time. Callek tells me this can be
> "normal," and that the retries this triggered seemed to work fine.

Well the error we see is "normal" (sometimes), and why we needed to setup verify.py here, because it just doesn't come back up fast enough. The purple because of this error has not been traditionally normal, so I suspect the purple is simply a buildbot change I didn't catch go in.
Now that we have verified what has been received is what we will deploy, when I can I expect we can start getting these inventoried, imaged, and into racks?  Assuming of course the blocking bug 767658 gets resolved since I believe space issues have been taken care of.
The first 26 should be available on 7/31. Van has already imaged and racked them, they are waiting on network and power:

* Network should be completed on 7/30.

* The power cord pigtails we have in stock are not compatible, and a new batch was ordered. They are scheduled for delivery on 7/31, and we can bring the units online at that time.
Attached patch add 317, 318Splinter Review
Landing this right away, and turning these tegras on (including the dev-master slave changes which I will not attach here).

r?coop for the real changes
r?sal for the PDU entries as I was told in IRC
Attachment #646911 - Flags: review?(coop)
Comment on attachment 646911 [details] [diff] [review]
add 317, 318


>+        "pdu": "pdu4.df202-2.build.mtv1.mozilla.com",

Guessing this should be df202-1 based on other entries here, and a ping check, Landing with that change.
Attachment #646911 - Flags: review?(sespinoza)
(In reply to Justin Wood (:Callek) from comment #31)
> Created attachment 646911 [details] [diff] [review]
> add 317, 318
> 
> Landing this right away, and turning these tegras on (including the
> dev-master slave changes which I will not attach here).
> 
> r?coop for the real changes
> r?sal for the PDU entries as I was told in IRC

Ok, so 317:

1 successful job (but hit the "reboot during job" issue)
1 failed verify.py run on next jobs start (likely tegra not rebooted in time)
1 more job (failed graphserver then failed verify.py)

currently "unable to write to sdcard" was able to telnet in, check the mount and it did not even have the sdcard mounted. did a "rebt" on tegra, didn't come back, then did a powercycle via PDU info, and it hasn't even gone down.

This one will need hands on.
No longer blocks: 767457, 767459
(In reply to Justin Wood (:Callek) from comment #33)
> (In reply to Justin Wood (:Callek) from comment #31)
> > Created attachment 646911 [details] [diff] [review]
> > add 317, 318
> > 
> > Landing this right away, and turning these tegras on (including the
> > dev-master slave changes which I will not attach here).
> > 
> > r?sal for the PDU entries as I was told in IRC
> 
> and 318:

Currently has SUTAgent seemingly hung (can't get a command prompt when telnetting in) and PDU is also not rebooting it.
Derek are the 26 Tegras mentioned in your comment on " Derek Moore 2012-07-27 14:59:35 PDT " up and ready in mtv1?
(In reply to Melissa O'Connor [:moconnor] from comment #35)
> Derek are the 26 Tegras mentioned in your comment on " Derek Moore
> 2012-07-27 14:59:35 PDT " up and ready in mtv1?

They are fully imaged, racked, cabled, and powered. All that remains is the configuration of the new network switch added to accommodate them. This will be expedited today.
We have finished installing and configuring the CDUs and switch for the 26 hosts. There are 4 more that refuse to come up or are dead. 

tegra-289 -> not pingable. tried reflashing and different network cable/switch ports
tegra-291 -> DOA, no power
tegra-292 -> can't get into flash mode so unable to reimage/reflash.
tegra-301 -> no link light on nic. tried reflashing and using different network cable/switch port

Derek is running a test on the switch ports to see if there are any issues.
I have added tegra-289 to tegra-318 to nagios.
Blocks: 767459
No longer blocks: 767456
FYI, The following new tegras are failing their ping check and alerting in nagios:

tegra-289.build.mtv1
tegra-291.build.mtv1
tegra-292.build.mtv1
tegra-301.build.mtv1
tegra-303.build.mtv1 (not listed above)


And the following are responding to ping but failing the agent (tcp port) check and alerting in nagios:

tegra-290.build.mtv1
tegra-293.build.mtv1
tegra-296.build.mtv1
tegra-299.build.mtv1
tegra-306.build.mtv1
tegra-308.build.mtv1
tegra-311.build.mtv1
tegra-315.build.mtv1
(In reply to Amy Rich [:arich] [:arr] from comment #39)

> And the following are responding to ping but failing the agent (tcp port)
> check and alerting in nagios:
> 
> tegra-290.build.mtv1
> tegra-293.build.mtv1
> tegra-296.build.mtv1
> tegra-299.build.mtv1
> tegra-306.build.mtv1
> tegra-308.build.mtv1
> tegra-311.build.mtv1
> tegra-315.build.mtv1


I'm not sure what, if anything, dcops can do to help in this case. Any guidance?
I power cycled the hosts that were not responding on the agent port via the CDU, and: these are now reachable

tegra-290.build.mtv1
tegra-293.build.mtv1
tegra-296.build.mtv1
tegra-299.build.mtv1

I also tried the others that weren't responding to the agent, but they didn't come back up.  My suggestions for debugging are:

1) verify that the PDU info is correct.  I had a running ping to one of them as I power cycled it, and I didn't actually see any packet loss.
2) Try hard power cycling them, just to make sure it's not the PDU that's faulty.
3) If a power cycle still doesn't bring up the agent, try reimaging them.
*sigh*  I apologize, I didn't notice that they were on two separate PDUs.  Now that I've tried the *CORRECT* outlets for the other 4, 2 of them have also come up.  So, the ones that are unresponsive are just the ones that aren't responding to ping.

tegra-289.build.mtv1
tegra-291.build.mtv1
tegra-292.build.mtv1
tegra-301.build.mtv1
tegra-303.build.mtv1 (not listed above as expected)

And two that that respond to ping, but the agent hasn't come up (after a confirmed reboot):


tegra-308.build.mtv1
tegra-311.build.mtv1
tegra-311 is up now, it was just slow to start the agent, apparently.
Attachment #646911 - Flags: review?(coop) → review+
What is the status of the 86 tegras being brought up?  What has come up and is available and what is left?
We're working on identifying space for the remaining ~60 units. Originally, we had hoped to occupy the space vacated by Bug 774829 and Bug 712456. Both of those have stalled, however, so we have no available rack space in mtv1.

We'll likely be using temporary shelves in Haxxor to finish this batch.
(In reply to Melissa O'Connor [:moconnor] from comment #46)
> ETA ?

We could use some feedback on urgency. We're about to slip into the London work week, which could easily delay this another 7-10 days.
Please bring up as many as you can before the work week and update the team on the progress as you go.

Thanks
I know it isn't the end of the week but can we can an update on progress? :)
The remaining bank of ~50 tegras are powered, flashed, and networked. All that remains is the (virtual) paperwork to place them into inventory and DHCP. Van and Vinh are planning to work with Amy on this today.
We set a staging area up for these Tegra boards up on the fourth floor bull pen and were able to bring tegras-[319-365] up. Several of the boards were DOA and a lot of them came without ac adapters. I have emailed ctalbert to see if he has spares. I'll try to bring the remaining 5 up ASAP.

tegra-319.build.mtv1.mozilla.com is alive
tegra-320.build.mtv1.mozilla.com is alive
tegra-321.build.mtv1.mozilla.com is alive
...
tegra-365.build.mtv1.mozilla.com is alive

Thanks,
Van
I've added 319 - 370 to nagios.

Van, if you have a list of which ones were DOA, I can remove them.
Callek: The rest should likely be acked and referred to in this bug if they're alerting.
Let me rephrase that... acked and referred to in this bug if a power cycle doesn't fix them  :}
Hi Amy, 

366 is DOA. We've brought up tegra-[367-370]. This concludes the imaging/building of the Tegras we received. Please let me know if I'm missing anything.

Thanks,
Van
I have just done a PDU test on all the new tegras, (ping -- then PDU -- then ping again), with success... except...

the following still have issues:

PDU RESPONDS -- TEGRA STILL NOT PINGABLE:

tegra-289
tegra-291
tegra-292
tegra-301
tegra-308

WAS UP -- PDU CYCLE TEST BROUGHT IT DOWN -- STILL NOT PINGABLE _AFTER_ WAITING 3 MINUTES

tegra-290

POSSIBLY WRONG PDU INFO? (PDU trigger didn't reboot tegra):

tegra-335 and up...
 * should be on  "pdu2.dcops.build.mtv1.mozilla.com" according to inventory 
        **** or  "pdu3.dcops.build.mtv1.mozilla.com"
        **** or  "pdu4.dcops.build.mtv1.mozilla.com"
 * SNMP works, and returns a valid return, but the tegra *stays* pingable
 * Alternate if PDU info is right, is that these tegras still have a battery attached, or power switch is wrong.
 * Additional alternate is PDU Info per inventory could be wrong
 * If all of the above is "correct" we should reopen Bug 782099 and track/fix there, as its likely a PDU config issue.

(p.s. for docs, the good ones seem to take almost *exactly* 70 seconds to boot [after pdu trigger] with the base image we have here)
I'll email Ashlee and see if she has time to check on tegra-335 and up as well as Tegra-308.

The following Tegras are dead - comment 37:

tegra-289
tegra-291
tegra-292
tegra-301



Thanks for the heads up.

Van
What are the next steps for the tegras that we determine are dead?  I don't believe we can get replacements.
(In reply to Melissa O'Connor [:moconnor] from comment #57)
> What are the next steps for the tegras that we determine are dead?  I don't
> believe we can get replacements.

If they're from arrow.com, we should seek recompense. Not much we can do about the gift boards from nvidia.
:coop, :moconnor - I'm afraid we didn't make a distinction between the arrow.com and nvidia shipments, since they arrived at roughly the same time. Any thoughts on how to distinguish them, after the fact? Did the arrow.com order/shipment come with some kind of support information with serial numbers?
arr has reminded me that the arrow.com shipment were allocated as:

tegra-289 - tegra-315
Recap: all 4 doa boards are from arrow.com, so MoCo should get refunds. Who can do this? Can we start that process in a separate bug?
:callek,

I just confirmed that the pdu information for tegra-335+ are accurate by rebooting them manually via the web interface and tracing the power. There seems to be a config issue so I've updated Bug 782099.

Thanks,
Van
:hwine, I opened bug Bug 783944  to track the refund.
No longer blocks: 767459
Depends on: 784278
No longer depends on: 784278
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attachment #646911 - Flags: review?(sespinoza)
Depends on: 786966
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.