Closed
Bug 767447
Opened 11 years ago
Closed 11 years ago
(Shoe)rack, cable, and image 86 tegras
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: van)
References
Details
(Whiteboard: mtv1 [reit-tegra])
Attachments
(2 files)
57.07 KB,
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
|
Details | |
709 bytes,
patch
|
coop
:
review+
|
Details | Diff | Splinter Review |
I don't know how much room we have in the existing shoeracks in haxxor due to attrition, but we have 28 new tegras coming in bug 765424 and they'll need to be setup somewhere.
Comment 1•11 years ago
|
||
It would have been nice if Jake and I heard about this sooner so we could prepare Haxxor or 2/IDF for these additional devices. Can we get an ETA of arrival for them?
Comment 2•11 years ago
|
||
Matt: The delivery date will be listed in the blocker bug 765424. And the bug for getting the infrastructure set up in mtv1 is open for dcops in the blocker bug 767658.
Updated•11 years ago
|
Assignee: server-ops-releng → jwatkins
colo-trip: --- → mtv1
Comment 3•11 years ago
|
||
Here is an update from the blocker bug. Ann Ignacio [:anni] 2012-06-27 13:54:21 PDT I spoke to the vendor and they are finalizing our credit terms. If we get approved, then the shipment should arrive on Friday. I will update the bug if any issues arise.
Comment 4•11 years ago
|
||
The first step here should be to determine how many open slots we have in haxxor that have been vacated by dead tegras and replace those first.
Updated•11 years ago
|
Summary: (Shoe)rack, cable, and image 28 tegras → (Shoe)rack, cable, and image 27 tegras
Comment 6•11 years ago
|
||
Per joduinn: Per nvidia, we are getting an additional 54 tegras next week (16th-23rd). I'll update bug with exact ETA once I have it. This is in addition to the 27 tegras already delivered last week. Now that we have concrete numbers, filing thi sbug to track planning & setup work in IT.
Summary: (Shoe)rack, cable, and image 27 tegras → (Shoe)rack, cable, and image 81 tegras
Comment 7•11 years ago
|
||
I've done the initial step of adding all 81 new boards to DNS. DHCP/inventory will come later when we have the necessary info.
Comment 8•11 years ago
|
||
releng folks, has anyone ordered sd cards for these?
Comment 9•11 years ago
|
||
not that I know of
Assignee | ||
Comment 10•11 years ago
|
||
dcops has received 27 of the tegras. attached is xls with their asset, mac, and serial. van
Comment 11•11 years ago
|
||
van: can you verify the mac for 10819, please? it didn't have enough characters in the ss. Based on the others, I'm guessing it's supposed to be 020448010966.
Comment 12•11 years ago
|
||
tegra-289 - tegra-315 added to inventory and dhcp.
Comment 13•11 years ago
|
||
Going to assign this to van so he and Jake can work on it together.
Assignee: jwatkins → vle
Component: Server Operations: RelEng → Server Operations: DCOps
QA Contact: arich → dmoore
Assignee | ||
Comment 14•11 years ago
|
||
:arr, you are correct. there should have been a leading 0 for the mac address of asset 10819. i probably formatted the cells to "text" after scanning the first one and forgot to update it. ill make sure it goes into inventory with the correct mac and ill reattach the csv after i update it with hostnames. thanks for the catch.
Assignee | ||
Updated•11 years ago
|
Whiteboard: mtv1
Comment 15•11 years ago
|
||
Van: I made a guess and put it in inventory correctly, I think. When you guys unbox the next batch, I can help with inventory, too. I have a script that inputs most of the necessary data. Or I can share the script with you if you have write access to inventory.
Comment 16•11 years ago
|
||
Oh, and if it wasn't clear, I've put the first 27 into inventory with hostname already, so no need to update the csv unless you need to do so for your purposes.
Updated•11 years ago
|
Whiteboard: mtv1 → mtv1 [reit-tegra]
Comment 17•11 years ago
|
||
It is my understanding that the additional 50+ tegras from Nvidia have arrived.
Comment 18•11 years ago
|
||
That is correct, SD cards also arrived for them.
Assignee | ||
Comment 19•11 years ago
|
||
We have 1 bad board from the first batch of the 27 boards we've received. Hangs at: "Waiting for bootloader to initialize" - but it never does and quits. Do we RMA these boards? Regards, Van
Comment 20•11 years ago
|
||
I am not seeing the shipment of 54 tegra boards in the MV office. Although there are a bunch of "tegra developer tablets" sitting here. Are these the 50+ tegras from NVidia that Melissa had mentioned? -Vinh
Comment 21•11 years ago
|
||
hwine reached out in irc and will follow-up here, since I do not know what we were expecting from Nvidia. On the 1 bad board I am not sure if we can RMA since they no longer make these boards I believe.
Comment 22•11 years ago
|
||
Just to capture state from irc with expectations and numbers (since this is the first I had heard of the specific types and quantities we were getting): hwine: okay - total number sounds right - we ordered _both_ tegra boards & developer tablets. Both were supposed to have been delivered Monday - I don't know how different the packaging would be. that should be about 55 boards + 6 tables as I understand it Vinh: No boards. All 61 of them are tablets.
Assignee | ||
Comment 23•11 years ago
|
||
:arr, I received write access to inventory from rtucker today. Can you show me your inventory scripts once I have the next spreadsheet filled out with location/pdu/switchports? Thanks! Van
Comment 24•11 years ago
|
||
(In reply to Van Le [:van] from comment #19) > We have 1 bad board from the first batch of the 27 boards we've received. > > Hangs at: "Waiting for bootloader to initialize" - but it never does and > quits. > > Do we RMA these boards? > > Regards, > Van Yes, if that is one of the boards from arrow.com, please do RMA it.
Comment 25•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #22) > Just to capture state from irc with expectations and numbers (since this is > the first I had heard of the specific types and quantities we were getting): > > hwine: okay - total number sounds right - we ordered _both_ tegra boards & > developer tablets. Both were supposed to have been delivered Monday - I > don't know how different the packaging would be. that should be about 55 > boards + 6 tables as I understand it > > Vinh: No boards. All 61 of them are tablets. 1) The latest batch of boards from nvidia came in perspex cases, but look to be tegra boards inside. After taking case apart and eyeballing them, they look identical to our "normal" tegra boards. 2) :sal and I just imaged one of these as tegra-316; it imaged and booted just fine first time. arr added tegra-316 to dhcp with mac addresss 02044b00dccb and RelEng are now evaluating this board in staging. 3) we never got a shipping/packing list with Monday's delivery, but :sal just manually counted 60 boards. Our previous delivery of 27 tegras from arrow.com had one dead, so we are down to 26 from arrow.com. Therefore we have 60+26 == 86 new tegras at this time. As this bug is being used to track both batches, tweaking summary to match.
Summary: (Shoe)rack, cable, and image 81 tegras → (Shoe)rack, cable, and image 86 tegras
Comment 26•11 years ago
|
||
(In reply to Van Le [:van] from comment #23) > :arr, I received write access to inventory from rtucker today. Can you show > me your inventory scripts once I have the next spreadsheet filled out with > location/pdu/switchports? I sent you email with the script to add new hosts and some basic directions/usage. I'll send you another to add in the location/pdu/switchports and a third that meshes the two together.
Reporter | ||
Comment 27•11 years ago
|
||
I've got tegra-316 running in staging. Those with access can view progress here: http://dev-master01.build.scl1.mozilla.com:8160/buildslaves/tegra-316 Summary: a mixture of green, red, and purple. Green: yay! Red: unable to post to graphs-stage. Rarely works on a good day, not too concerned here. Purple: some early exceptions hit as I got things setup, and then some failures to reboot during the allotted time. Callek tells me this can be "normal," and that the retries this triggered seemed to work fine. I'd say that it appears this tegra is working properly, but I'll leave it running overnight for more data.
Comment 28•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #27) > I've got tegra-316 running in staging. Those with access can view progress > here: > > http://dev-master01.build.scl1.mozilla.com:8160/buildslaves/tegra-316 > > Summary: a mixture of green, red, and purple. > > Green: yay! > Red: unable to post to graphs-stage. Rarely works on a good day, not too > concerned here. > Purple: some early exceptions hit as I got things setup, and then some > failures to reboot during the allotted time. Callek tells me this can be > "normal," and that the retries this triggered seemed to work fine. Well the error we see is "normal" (sometimes), and why we needed to setup verify.py here, because it just doesn't come back up fast enough. The purple because of this error has not been traditionally normal, so I suspect the purple is simply a buildbot change I didn't catch go in.
Comment 29•11 years ago
|
||
Now that we have verified what has been received is what we will deploy, when I can I expect we can start getting these inventoried, imaged, and into racks? Assuming of course the blocking bug 767658 gets resolved since I believe space issues have been taken care of.
Comment 30•11 years ago
|
||
The first 26 should be available on 7/31. Van has already imaged and racked them, they are waiting on network and power: * Network should be completed on 7/30. * The power cord pigtails we have in stock are not compatible, and a new batch was ordered. They are scheduled for delivery on 7/31, and we can bring the units online at that time.
Comment 31•11 years ago
|
||
Landing this right away, and turning these tegras on (including the dev-master slave changes which I will not attach here). r?coop for the real changes r?sal for the PDU entries as I was told in IRC
Attachment #646911 -
Flags: review?(coop)
Comment 32•11 years ago
|
||
Comment on attachment 646911 [details] [diff] [review] add 317, 318 >+ "pdu": "pdu4.df202-2.build.mtv1.mozilla.com", Guessing this should be df202-1 based on other entries here, and a ping check, Landing with that change.
Updated•11 years ago
|
Attachment #646911 -
Flags: review?(sespinoza)
Comment 33•11 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #31) > Created attachment 646911 [details] [diff] [review] > add 317, 318 > > Landing this right away, and turning these tegras on (including the > dev-master slave changes which I will not attach here). > > r?coop for the real changes > r?sal for the PDU entries as I was told in IRC Ok, so 317: 1 successful job (but hit the "reboot during job" issue) 1 failed verify.py run on next jobs start (likely tegra not rebooted in time) 1 more job (failed graphserver then failed verify.py) currently "unable to write to sdcard" was able to telnet in, check the mount and it did not even have the sdcard mounted. did a "rebt" on tegra, didn't come back, then did a powercycle via PDU info, and it hasn't even gone down. This one will need hands on.
Updated•11 years ago
|
Updated•11 years ago
|
Blocks: android_4.0_testing
Comment 34•11 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #33) > (In reply to Justin Wood (:Callek) from comment #31) > > Created attachment 646911 [details] [diff] [review] > > add 317, 318 > > > > Landing this right away, and turning these tegras on (including the > > dev-master slave changes which I will not attach here). > > > > r?sal for the PDU entries as I was told in IRC > > and 318: Currently has SUTAgent seemingly hung (can't get a command prompt when telnetting in) and PDU is also not rebooting it.
Comment 35•11 years ago
|
||
Derek are the 26 Tegras mentioned in your comment on " Derek Moore 2012-07-27 14:59:35 PDT " up and ready in mtv1?
Comment 36•11 years ago
|
||
(In reply to Melissa O'Connor [:moconnor] from comment #35) > Derek are the 26 Tegras mentioned in your comment on " Derek Moore > 2012-07-27 14:59:35 PDT " up and ready in mtv1? They are fully imaged, racked, cabled, and powered. All that remains is the configuration of the new network switch added to accommodate them. This will be expedited today.
Assignee | ||
Comment 37•11 years ago
|
||
We have finished installing and configuring the CDUs and switch for the 26 hosts. There are 4 more that refuse to come up or are dead. tegra-289 -> not pingable. tried reflashing and different network cable/switch ports tegra-291 -> DOA, no power tegra-292 -> can't get into flash mode so unable to reimage/reflash. tegra-301 -> no link light on nic. tried reflashing and using different network cable/switch port Derek is running a test on the switch ports to see if there are any issues.
Comment 38•11 years ago
|
||
I have added tegra-289 to tegra-318 to nagios.
Updated•11 years ago
|
Comment 39•11 years ago
|
||
FYI, The following new tegras are failing their ping check and alerting in nagios: tegra-289.build.mtv1 tegra-291.build.mtv1 tegra-292.build.mtv1 tegra-301.build.mtv1 tegra-303.build.mtv1 (not listed above) And the following are responding to ping but failing the agent (tcp port) check and alerting in nagios: tegra-290.build.mtv1 tegra-293.build.mtv1 tegra-296.build.mtv1 tegra-299.build.mtv1 tegra-306.build.mtv1 tegra-308.build.mtv1 tegra-311.build.mtv1 tegra-315.build.mtv1
Comment 40•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #39) > And the following are responding to ping but failing the agent (tcp port) > check and alerting in nagios: > > tegra-290.build.mtv1 > tegra-293.build.mtv1 > tegra-296.build.mtv1 > tegra-299.build.mtv1 > tegra-306.build.mtv1 > tegra-308.build.mtv1 > tegra-311.build.mtv1 > tegra-315.build.mtv1 I'm not sure what, if anything, dcops can do to help in this case. Any guidance?
Comment 41•11 years ago
|
||
I power cycled the hosts that were not responding on the agent port via the CDU, and: these are now reachable tegra-290.build.mtv1 tegra-293.build.mtv1 tegra-296.build.mtv1 tegra-299.build.mtv1 I also tried the others that weren't responding to the agent, but they didn't come back up. My suggestions for debugging are: 1) verify that the PDU info is correct. I had a running ping to one of them as I power cycled it, and I didn't actually see any packet loss. 2) Try hard power cycling them, just to make sure it's not the PDU that's faulty. 3) If a power cycle still doesn't bring up the agent, try reimaging them.
Comment 42•11 years ago
|
||
*sigh* I apologize, I didn't notice that they were on two separate PDUs. Now that I've tried the *CORRECT* outlets for the other 4, 2 of them have also come up. So, the ones that are unresponsive are just the ones that aren't responding to ping. tegra-289.build.mtv1 tegra-291.build.mtv1 tegra-292.build.mtv1 tegra-301.build.mtv1 tegra-303.build.mtv1 (not listed above as expected) And two that that respond to ping, but the agent hasn't come up (after a confirmed reboot): tegra-308.build.mtv1 tegra-311.build.mtv1
Comment 43•11 years ago
|
||
tegra-311 is up now, it was just slow to start the agent, apparently.
Reporter | ||
Updated•11 years ago
|
Attachment #646911 -
Flags: review?(coop) → review+
Comment 44•11 years ago
|
||
What is the status of the 86 tegras being brought up? What has come up and is available and what is left?
Comment 45•11 years ago
|
||
We're working on identifying space for the remaining ~60 units. Originally, we had hoped to occupy the space vacated by Bug 774829 and Bug 712456. Both of those have stalled, however, so we have no available rack space in mtv1. We'll likely be using temporary shelves in Haxxor to finish this batch.
Comment 46•11 years ago
|
||
ETA ?
Comment 47•11 years ago
|
||
(In reply to Melissa O'Connor [:moconnor] from comment #46) > ETA ? We could use some feedback on urgency. We're about to slip into the London work week, which could easily delay this another 7-10 days.
Comment 48•11 years ago
|
||
Please bring up as many as you can before the work week and update the team on the progress as you go. Thanks
Comment 49•11 years ago
|
||
I know it isn't the end of the week but can we can an update on progress? :)
Comment 50•11 years ago
|
||
The remaining bank of ~50 tegras are powered, flashed, and networked. All that remains is the (virtual) paperwork to place them into inventory and DHCP. Van and Vinh are planning to work with Amy on this today.
Assignee | ||
Comment 51•11 years ago
|
||
We set a staging area up for these Tegra boards up on the fourth floor bull pen and were able to bring tegras-[319-365] up. Several of the boards were DOA and a lot of them came without ac adapters. I have emailed ctalbert to see if he has spares. I'll try to bring the remaining 5 up ASAP. tegra-319.build.mtv1.mozilla.com is alive tegra-320.build.mtv1.mozilla.com is alive tegra-321.build.mtv1.mozilla.com is alive ... tegra-365.build.mtv1.mozilla.com is alive Thanks, Van
Comment 52•11 years ago
|
||
I've added 319 - 370 to nagios. Van, if you have a list of which ones were DOA, I can remove them. Callek: The rest should likely be acked and referred to in this bug if they're alerting.
Comment 53•11 years ago
|
||
Let me rephrase that... acked and referred to in this bug if a power cycle doesn't fix them :}
Assignee | ||
Comment 54•11 years ago
|
||
Hi Amy, 366 is DOA. We've brought up tegra-[367-370]. This concludes the imaging/building of the Tegras we received. Please let me know if I'm missing anything. Thanks, Van
Comment 55•11 years ago
|
||
I have just done a PDU test on all the new tegras, (ping -- then PDU -- then ping again), with success... except... the following still have issues: PDU RESPONDS -- TEGRA STILL NOT PINGABLE: tegra-289 tegra-291 tegra-292 tegra-301 tegra-308 WAS UP -- PDU CYCLE TEST BROUGHT IT DOWN -- STILL NOT PINGABLE _AFTER_ WAITING 3 MINUTES tegra-290 POSSIBLY WRONG PDU INFO? (PDU trigger didn't reboot tegra): tegra-335 and up... * should be on "pdu2.dcops.build.mtv1.mozilla.com" according to inventory **** or "pdu3.dcops.build.mtv1.mozilla.com" **** or "pdu4.dcops.build.mtv1.mozilla.com" * SNMP works, and returns a valid return, but the tegra *stays* pingable * Alternate if PDU info is right, is that these tegras still have a battery attached, or power switch is wrong. * Additional alternate is PDU Info per inventory could be wrong * If all of the above is "correct" we should reopen Bug 782099 and track/fix there, as its likely a PDU config issue. (p.s. for docs, the good ones seem to take almost *exactly* 70 seconds to boot [after pdu trigger] with the base image we have here)
Assignee | ||
Comment 56•11 years ago
|
||
I'll email Ashlee and see if she has time to check on tegra-335 and up as well as Tegra-308. The following Tegras are dead - comment 37: tegra-289 tegra-291 tegra-292 tegra-301 Thanks for the heads up. Van
Comment 57•11 years ago
|
||
What are the next steps for the tegras that we determine are dead? I don't believe we can get replacements.
Reporter | ||
Comment 58•11 years ago
|
||
(In reply to Melissa O'Connor [:moconnor] from comment #57) > What are the next steps for the tegras that we determine are dead? I don't > believe we can get replacements. If they're from arrow.com, we should seek recompense. Not much we can do about the gift boards from nvidia.
Comment 59•11 years ago
|
||
:coop, :moconnor - I'm afraid we didn't make a distinction between the arrow.com and nvidia shipments, since they arrived at roughly the same time. Any thoughts on how to distinguish them, after the fact? Did the arrow.com order/shipment come with some kind of support information with serial numbers?
Comment 60•11 years ago
|
||
arr has reminded me that the arrow.com shipment were allocated as: tegra-289 - tegra-315
Comment 61•11 years ago
|
||
Recap: all 4 doa boards are from arrow.com, so MoCo should get refunds. Who can do this? Can we start that process in a separate bug?
Assignee | ||
Comment 62•11 years ago
|
||
:callek, I just confirmed that the pdu information for tegra-335+ are accurate by rebooting them manually via the web interface and tracing the power. There seems to be a config issue so I've updated Bug 782099. Thanks, Van
Assignee | ||
Comment 63•11 years ago
|
||
:hwine, I opened bug Bug 783944 to track the refund.
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Attachment #646911 -
Flags: review?(sespinoza)
Updated•9 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•