Closed Bug 1429225 Opened 7 years ago Closed 7 years ago

bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: van, Assigned: van)

References

Details

(Whiteboard: case 5326088334, 5326802525, 5328599861)

strange issue on the moonshot. reran fibers, tested optics on both ends, finally changed SFP+ port and that appears to be the issue. currently reconfigured 1/1/14 as part of the port channel until we can RMA the module. XGE1/1/14 UP 10G(a) F(a) T 1 XGE1/1/15 UP 10G(a) F(a) T 1 XGE1/1/16 DOWN auto A T 1 BAGG46 UP 40G(a) F(a) T 1 error messages: %Jan 9 19:36:46:383 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:36:50:617 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:36:50:929 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:36:52:469 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:36:52:764 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:02:916 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:37:03:211 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:03:561 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:37:03:854 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:13:547 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
Blocks: 1428159
Summary: bad SFP+ interface 16 on Moonshot-45XGc → bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16
opened 5326088334 with HP for RMA.
Assignee: server-ops-dcops → vle
Whiteboard: case 5326088334
the RMA is on site in SCL3. i'd like to schedule a trip to replace the moonshot switch and view its behavior when we hot swap the switch. are there any issues with this or do we need to schedule a CAB? to note, this is moon-chassis-4.
Flags: needinfo?(klibby)
Summary: bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16
What exactly needs replacing, and what's the impact to the chassis? If it's just one port, then it should be fine whenever. If we need to take the whole chassis down, we should just let QA/releng know; I don't think it needs a TCW.
Flags: needinfo?(klibby)
>what's the impact to the chassis? i was told these are hotswappable as they are just the uplink modules but in a redundant pair. it looks like they sent me the internal switch vs the uplink module so i'll have to contact HP again.
ok, should be gtg.
didn't get an update when i emailed them referring to old case number so i went ahead and opened case 5326802525 for the 16 port sfp+ uplink module.
Whiteboard: case 5326088334 → case 5326088334, 5326802525
still working with HPE on this. the previous tech i spoke to told me that we can do this live. however, the current tech is telling me we need to have an outage window. i see this in the guide though: • Removing any component from bay A or bay B does not disrupt traffic for the other switch assembly. so i'm asking for clarification. if this does cause any disruption, i'll open a CAB to handle this during the TCW. we are also pending shipment of the module as they're currently out of stock.
Summary: bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16
actually, now that i think about it - we will need to take an outage window as none of the cartridges are running LACP. we ran into pxe booting issues so i believe this configuration has been removed. will loop back once i get more info.
We only have a limited number of production workers on this chassis, so scheduling an outage should be relatively easy.
replacing the switch didn't resolve the issue. some things to note: 1) since we've disabled the 2nd switch due to pxe/operational issues, there is no redundancy. when we remove switch1, the whole chassis will go down. 2) replacing the switch did not resolve the issue, port 1/1/16 will continuously flap. im not sure if that's an issue with the chassis or system board now. will need to revisit HPE case.
as mentioned in our postscl3 meetings, this chassis and all of its blades will go down when we swap out the motherboard. i would like to schedule next Friday 4/20 at 10am for the hardware replacement. i will open a CAB but would like to know if there are other people i need to contact and whether this is possible before i coordinate with HP to send a tech out (i'll need to be on site as well).
Flags: needinfo?(klibby)
Flags: needinfo?(jwatkins)
Whiteboard: case 5326088334, 5326802525 → case 5326088334, 5326802525, 5328599861
(In reply to Van Le [:van] from comment #11) > as mentioned in our postscl3 meetings, this chassis and all of its blades > will go down when we swap out the motherboard. i would like to schedule next > Friday 4/20 at 10am for the hardware replacement. i will open a CAB but > would like to know if there are other people i need to contact and whether > this is possible before i coordinate with HP to send a tech out (i'll need > to be on site as well). ETA on how long you expect to have the chassis offline? I assume not very long. We should just let jmaher and the buildduty folks know when we start, so that if we get alerts for pending counts they know it's related. And either relops or buildduty can quarantine all of the hosts on the chassis to block jobs from running. I think next Friday is fine.
Flags: needinfo?(klibby)
this will take 2-3 hours. we need to remove every component from the chassis when replacing the motherboard. i also plan on showing QTS the process for future smart hand requests. ill plan a CAB for next Friday if HP has the parts.
Flags: needinfo?(jwatkins)
CHG0012795 opened for CAB review.
CHG0012795 has been approved. these are the hosts i have in inventory: " xenserver136.ops.releng.mdc1.mozilla.com 31246" " xenserver145.ops.releng.mdc1.mozilla.com 31255" " xenserver146.ops.releng.mdc1.mozilla.com 31256" " xenserver147.ops.releng.mdc1.mozilla.com 31257" " xenserver148.ops.releng.mdc1.mozilla.com 31258" " xenserver149.ops.releng.mdc1.mozilla.com 31259" " xenserver150.ops.releng.mdc1.mozilla.com 31260" " xenserver151.ops.releng.mdc1.mozilla.com 31261" " xenserver152.ops.releng.mdc1.mozilla.com 31262" " xenserver153.ops.releng.mdc1.mozilla.com 31263" " xenserver154.ops.releng.mdc1.mozilla.com 31264" " xenserver137.ops.releng.mdc1.mozilla.com 31247" " xenserver155.ops.releng.mdc1.mozilla.com 31265" " xenserver156.ops.releng.mdc1.mozilla.com 31266" " xenserver157.ops.releng.mdc1.mozilla.com 31267" " xenserver158.ops.releng.mdc1.mozilla.com 31268" " xenserver159.ops.releng.mdc1.mozilla.com 31269" " xenserver160.ops.releng.mdc1.mozilla.com 31270" " xenserver161.ops.releng.mdc1.mozilla.com 31271" " xenserver162.ops.releng.mdc1.mozilla.com 31272" " xenserver163.ops.releng.mdc1.mozilla.com 31273" " xenserver164.ops.releng.mdc1.mozilla.com 31274" " xenserver138.ops.releng.mdc1.mozilla.com 31248" " xenserver165.ops.releng.mdc1.mozilla.com 31275" " xenserver166.ops.releng.mdc1.mozilla.com 31276" " xenserver167.ops.releng.mdc1.mozilla.com 31277" " xenserver168.ops.releng.mdc1.mozilla.com 31278" " xenserver169.ops.releng.mdc1.mozilla.com 31279" " xenserver170.ops.releng.mdc1.mozilla.com 31280" " xenserver171.ops.releng.mdc1.mozilla.com 31281" " xenserver172.ops.releng.mdc1.mozilla.com 31282" " xenserver173.ops.releng.mdc1.mozilla.com 31283" " xenserver174.ops.releng.mdc1.mozilla.com 31284" " xenserver139.ops.releng.mdc1.mozilla.com 31249" " xenserver175.ops.releng.mdc1.mozilla.com 31285" " xenserver176.ops.releng.mdc1.mozilla.com 31286" " xenserver177.ops.releng.mdc1.mozilla.com 31287" " xenserver178.ops.releng.mdc1.mozilla.com 31288" " xenserver179.ops.releng.mdc1.mozilla.com 31289" " xenserver180.ops.releng.mdc1.mozilla.com 31290" " xenserver140.ops.releng.mdc1.mozilla.com 31250" " xenserver141.ops.releng.mdc1.mozilla.com 31251" " xenserver142.ops.releng.mdc1.mozilla.com 31252" " xenserver143.ops.releng.mdc1.mozilla.com 31253" " xenserver144.ops.releng.mdc1.mozilla.com 31254"
who will be on buildduty and needs to be aware to quarantine the hosts?
Flags: needinfo?(jwatkins)
Flags: needinfo?(jlund)
the HP tech requested 2.5 hours, i've asked for 4 in CAB because i intend to show the QTS techs how to handle this in the future when we do a smart hands request. in the past, replacing the motherboard took around 3 hours.
I'll let :jlund handle who will be assigned from buildduty. Also, fwiw, I'm not sure which inventory those xenserver hosts came from but they have since been renamed. The list of affected hosts should be: t-linux64-ms-136.test.releng.mdc1.mozilla.com t-linux64-ms-137.test.releng.mdc1.mozilla.com t-linux64-ms-138.test.releng.mdc1.mozilla.com t-linux64-ms-139.test.releng.mdc1.mozilla.com t-linux64-ms-140.test.releng.mdc1.mozilla.com t-linux64-ms-141.test.releng.mdc1.mozilla.com t-linux64-ms-142.test.releng.mdc1.mozilla.com t-linux64-ms-143.test.releng.mdc1.mozilla.com t-linux64-ms-144.test.releng.mdc1.mozilla.com t-linux64-ms-145.test.releng.mdc1.mozilla.com t-linux64-ms-146.test.releng.mdc1.mozilla.com t-linux64-ms-147.test.releng.mdc1.mozilla.com t-linux64-ms-148.test.releng.mdc1.mozilla.com t-linux64-ms-149.test.releng.mdc1.mozilla.com t-linux64-ms-150.test.releng.mdc1.mozilla.com t-w1064-ms-151.wintest.releng.mdc1.mozilla.com t-w1064-ms-152.wintest.releng.mdc1.mozilla.com t-w1064-ms-153.wintest.releng.mdc1.mozilla.com t-w1064-ms-154.wintest.releng.mdc1.mozilla.com t-w1064-ms-155.wintest.releng.mdc1.mozilla.com t-w1064-ms-156.wintest.releng.mdc1.mozilla.com t-w1064-ms-157.wintest.releng.mdc1.mozilla.com t-w1064-ms-158.wintest.releng.mdc1.mozilla.com t-w1064-ms-159.wintest.releng.mdc1.mozilla.com t-w1064-ms-160.wintest.releng.mdc1.mozilla.com t-w1064-ms-161.wintest.releng.mdc1.mozilla.com t-w1064-ms-162.wintest.releng.mdc1.mozilla.com t-w1064-ms-163.wintest.releng.mdc1.mozilla.com t-w1064-ms-164.wintest.releng.mdc1.mozilla.com t-w1064-ms-165.wintest.releng.mdc1.mozilla.com t-w1064-ms-166.wintest.releng.mdc1.mozilla.com t-w1064-ms-167.wintest.releng.mdc1.mozilla.com t-w1064-ms-168.wintest.releng.mdc1.mozilla.com t-w1064-ms-169.wintest.releng.mdc1.mozilla.com t-w1064-ms-170.wintest.releng.mdc1.mozilla.com t-w1064-ms-171.wintest.releng.mdc1.mozilla.com t-w1064-ms-172.wintest.releng.mdc1.mozilla.com t-w1064-ms-173.wintest.releng.mdc1.mozilla.com t-w1064-ms-174.wintest.releng.mdc1.mozilla.com t-w1064-ms-175.wintest.releng.mdc1.mozilla.com t-w1064-ms-176.wintest.releng.mdc1.mozilla.com t-w1064-ms-177.wintest.releng.mdc1.mozilla.com t-w1064-ms-178.wintest.releng.mdc1.mozilla.com t-w1064-ms-179.wintest.releng.mdc1.mozilla.com t-w1064-ms-180.wintest.releng.mdc1.mozilla.com
Flags: needinfo?(jwatkins)
:jlund/build duty, HPE will be on site at 10am Friday 4/20.
:van, Build duty is aware and the people on shift tomorrow will be ready to quarantine the hosts on your mark.
(In reply to Zsolt Fay [:zsoltfay] from comment #20) > :van, Build duty is aware and the people on shift tomorrow will be ready to > quarantine the hosts on your mark. You probably want to quarantine these host a few hours before the chassis is set to go offline. This will give time for any current tasks to complete.
(In reply to Jake Watkins [:dividehex] from comment #21) > (In reply to Zsolt Fay [:zsoltfay] from comment #20) > > :van, Build duty is aware and the people on shift tomorrow will be ready to > > quarantine the hosts on your mark. > > You probably want to quarantine these host a few hours before the chassis is > set to go offline. This will give time for any current tasks to complete. If we haven't already done so, we should quarintine these now.
Flags: needinfo?(jlund) → needinfo?(dlabici)
As per request I have quarantined the requested servers, with the exception of: t-linux64-ms-: 136, 137, 138, 139, 141, 142 T-W1064-MS-: 170, 178, 179 no idea where to find the missing servers.
Flags: needinfo?(dlabici)
1/1/16 is now stable now after motherboard replacement.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.