Closed Bug 1429225 Opened 7 years ago Closed 7 years ago

bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16

Categories

(Infrastructure & Operations :: DCOps, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: van, Assigned: van)

References

Details

(Whiteboard: case 5326088334, 5326802525, 5328599861)

Assignee

Description

•

7 years ago

strange issue on the moonshot. reran fibers, tested optics on both ends, finally changed SFP+ port and that appears to be the issue. currently reconfigured 1/1/14 as part of the port channel until we can RMA the module. XGE1/1/14 UP 10G(a) F(a) T 1 XGE1/1/15 UP 10G(a) F(a) T 1 XGE1/1/16 DOWN auto A T 1 BAGG46 UP 40G(a) F(a) T 1 error messages: %Jan 9 19:36:46:383 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:36:50:617 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:36:50:929 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:36:52:469 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:36:52:764 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:02:916 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:37:03:211 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:03:561 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down. %Jan 9 19:37:03:854 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up. %Jan 9 19:37:13:547 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.

Assignee

Updated

•

7 years ago

Blocks: 1428159

Summary: bad SFP+ interface 16 on Moonshot-45XGc → bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16

Assignee

Comment 1

•

7 years ago

opened 5326088334 with HP for RMA.

Assignee: server-ops-dcops → vle

Whiteboard: case 5326088334

Assignee

Comment 2

•

7 years ago

the RMA is on site in SCL3. i'd like to schedule a trip to replace the moonshot switch and view its behavior when we hot swap the switch. are there any issues with this or do we need to schedule a CAB? to note, this is moon-chassis-4.

Flags: needinfo?(klibby)

Summary: bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16

Kendall Libby [:fubar] (he/him)

Comment 3

•

7 years ago

What exactly needs replacing, and what's the impact to the chassis? If it's just one port, then it should be fine whenever. If we need to take the whole chassis down, we should just let QA/releng know; I don't think it needs a TCW.

Flags: needinfo?(klibby)

Assignee

Comment 4

•

7 years ago

>what's the impact to the chassis? i was told these are hotswappable as they are just the uplink modules but in a redundant pair. it looks like they sent me the internal switch vs the uplink module so i'll have to contact HP again.

Kendall Libby [:fubar] (he/him)

Comment 5

•

7 years ago

ok, should be gtg.

Assignee

Comment 6

•

7 years ago

didn't get an update when i emailed them referring to old case number so i went ahead and opened case 5326802525 for the 16 port sfp+ uplink module.

Whiteboard: case 5326088334 → case 5326088334, 5326802525

Assignee

Comment 7

•

7 years ago

still working with HPE on this. the previous tech i spoke to told me that we can do this live. however, the current tech is telling me we need to have an outage window. i see this in the guide though: • Removing any component from bay A or bay B does not disrupt traffic for the other switch assembly. so i'm asking for clarification. if this does cause any disruption, i'll open a CAB to handle this during the TCW. we are also pending shipment of the module as they're currently out of stock.

Assignee

Updated

•

7 years ago

Summary: bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16

Assignee

Comment 8

•

7 years ago

actually, now that i think about it - we will need to take an outage window as none of the cartridges are running LACP. we ran into pxe booting issues so i believe this configuration has been removed. will loop back once i get more info.

Kendall Libby [:fubar] (he/him)

Comment 9

•

7 years ago

We only have a limited number of production workers on this chassis, so scheduling an outage should be relatively easy.

Assignee

Comment 10

•

7 years ago

replacing the switch didn't resolve the issue. some things to note: 1) since we've disabled the 2nd switch due to pxe/operational issues, there is no redundancy. when we remove switch1, the whole chassis will go down. 2) replacing the switch did not resolve the issue, port 1/1/16 will continuously flap. im not sure if that's an issue with the chassis or system board now. will need to revisit HPE case.

Assignee

Comment 11

•

7 years ago

as mentioned in our postscl3 meetings, this chassis and all of its blades will go down when we swap out the motherboard. i would like to schedule next Friday 4/20 at 10am for the hardware replacement. i will open a CAB but would like to know if there are other people i need to contact and whether this is possible before i coordinate with HP to send a tech out (i'll need to be on site as well).

Flags: needinfo?(klibby)

Flags: needinfo?(jwatkins)

Assignee

Updated

•

7 years ago

Whiteboard: case 5326088334, 5326802525 → case 5326088334, 5326802525, 5328599861

Kendall Libby [:fubar] (he/him)

Comment 12

•

7 years ago

(In reply to Van Le [:van] from comment #11) > as mentioned in our postscl3 meetings, this chassis and all of its blades > will go down when we swap out the motherboard. i would like to schedule next > Friday 4/20 at 10am for the hardware replacement. i will open a CAB but > would like to know if there are other people i need to contact and whether > this is possible before i coordinate with HP to send a tech out (i'll need > to be on site as well). ETA on how long you expect to have the chassis offline? I assume not very long. We should just let jmaher and the buildduty folks know when we start, so that if we get alerts for pending counts they know it's related. And either relops or buildduty can quarantine all of the hosts on the chassis to block jobs from running. I think next Friday is fine.

Flags: needinfo?(klibby)

Assignee

Comment 13

•

7 years ago

this will take 2-3 hours. we need to remove every component from the chassis when replacing the motherboard. i also plan on showing QTS the process for future smart hand requests. ill plan a CAB for next Friday if HP has the parts.

Flags: needinfo?(jwatkins)

Assignee

Comment 14

•

7 years ago

CHG0012795 opened for CAB review.

Assignee

Comment 15

•

7 years ago

CHG0012795 has been approved. these are the hosts i have in inventory: " xenserver136.ops.releng.mdc1.mozilla.com 31246" " xenserver145.ops.releng.mdc1.mozilla.com 31255" " xenserver146.ops.releng.mdc1.mozilla.com 31256" " xenserver147.ops.releng.mdc1.mozilla.com 31257" " xenserver148.ops.releng.mdc1.mozilla.com 31258" " xenserver149.ops.releng.mdc1.mozilla.com 31259" " xenserver150.ops.releng.mdc1.mozilla.com 31260" " xenserver151.ops.releng.mdc1.mozilla.com 31261" " xenserver152.ops.releng.mdc1.mozilla.com 31262" " xenserver153.ops.releng.mdc1.mozilla.com 31263" " xenserver154.ops.releng.mdc1.mozilla.com 31264" " xenserver137.ops.releng.mdc1.mozilla.com 31247" " xenserver155.ops.releng.mdc1.mozilla.com 31265" " xenserver156.ops.releng.mdc1.mozilla.com 31266" " xenserver157.ops.releng.mdc1.mozilla.com 31267" " xenserver158.ops.releng.mdc1.mozilla.com 31268" " xenserver159.ops.releng.mdc1.mozilla.com 31269" " xenserver160.ops.releng.mdc1.mozilla.com 31270" " xenserver161.ops.releng.mdc1.mozilla.com 31271" " xenserver162.ops.releng.mdc1.mozilla.com 31272" " xenserver163.ops.releng.mdc1.mozilla.com 31273" " xenserver164.ops.releng.mdc1.mozilla.com 31274" " xenserver138.ops.releng.mdc1.mozilla.com 31248" " xenserver165.ops.releng.mdc1.mozilla.com 31275" " xenserver166.ops.releng.mdc1.mozilla.com 31276" " xenserver167.ops.releng.mdc1.mozilla.com 31277" " xenserver168.ops.releng.mdc1.mozilla.com 31278" " xenserver169.ops.releng.mdc1.mozilla.com 31279" " xenserver170.ops.releng.mdc1.mozilla.com 31280" " xenserver171.ops.releng.mdc1.mozilla.com 31281" " xenserver172.ops.releng.mdc1.mozilla.com 31282" " xenserver173.ops.releng.mdc1.mozilla.com 31283" " xenserver174.ops.releng.mdc1.mozilla.com 31284" " xenserver139.ops.releng.mdc1.mozilla.com 31249" " xenserver175.ops.releng.mdc1.mozilla.com 31285" " xenserver176.ops.releng.mdc1.mozilla.com 31286" " xenserver177.ops.releng.mdc1.mozilla.com 31287" " xenserver178.ops.releng.mdc1.mozilla.com 31288" " xenserver179.ops.releng.mdc1.mozilla.com 31289" " xenserver180.ops.releng.mdc1.mozilla.com 31290" " xenserver140.ops.releng.mdc1.mozilla.com 31250" " xenserver141.ops.releng.mdc1.mozilla.com 31251" " xenserver142.ops.releng.mdc1.mozilla.com 31252" " xenserver143.ops.releng.mdc1.mozilla.com 31253" " xenserver144.ops.releng.mdc1.mozilla.com 31254"

Assignee

Comment 16

•

7 years ago

who will be on buildduty and needs to be aware to quarantine the hosts?

Flags: needinfo?(jwatkins)

Flags: needinfo?(jlund)

Assignee

Comment 17

•

7 years ago

the HP tech requested 2.5 hours, i've asked for 4 in CAB because i intend to show the QTS techs how to handle this in the future when we do a smart hands request. in the past, replacing the motherboard took around 3 hours.

Jake Watkins [:dividehex]

Comment 18

•

7 years ago

I'll let :jlund handle who will be assigned from buildduty. Also, fwiw, I'm not sure which inventory those xenserver hosts came from but they have since been renamed. The list of affected hosts should be: t-linux64-ms-136.test.releng.mdc1.mozilla.com t-linux64-ms-137.test.releng.mdc1.mozilla.com t-linux64-ms-138.test.releng.mdc1.mozilla.com t-linux64-ms-139.test.releng.mdc1.mozilla.com t-linux64-ms-140.test.releng.mdc1.mozilla.com t-linux64-ms-141.test.releng.mdc1.mozilla.com t-linux64-ms-142.test.releng.mdc1.mozilla.com t-linux64-ms-143.test.releng.mdc1.mozilla.com t-linux64-ms-144.test.releng.mdc1.mozilla.com t-linux64-ms-145.test.releng.mdc1.mozilla.com t-linux64-ms-146.test.releng.mdc1.mozilla.com t-linux64-ms-147.test.releng.mdc1.mozilla.com t-linux64-ms-148.test.releng.mdc1.mozilla.com t-linux64-ms-149.test.releng.mdc1.mozilla.com t-linux64-ms-150.test.releng.mdc1.mozilla.com t-w1064-ms-151.wintest.releng.mdc1.mozilla.com t-w1064-ms-152.wintest.releng.mdc1.mozilla.com t-w1064-ms-153.wintest.releng.mdc1.mozilla.com t-w1064-ms-154.wintest.releng.mdc1.mozilla.com t-w1064-ms-155.wintest.releng.mdc1.mozilla.com t-w1064-ms-156.wintest.releng.mdc1.mozilla.com t-w1064-ms-157.wintest.releng.mdc1.mozilla.com t-w1064-ms-158.wintest.releng.mdc1.mozilla.com t-w1064-ms-159.wintest.releng.mdc1.mozilla.com t-w1064-ms-160.wintest.releng.mdc1.mozilla.com t-w1064-ms-161.wintest.releng.mdc1.mozilla.com t-w1064-ms-162.wintest.releng.mdc1.mozilla.com t-w1064-ms-163.wintest.releng.mdc1.mozilla.com t-w1064-ms-164.wintest.releng.mdc1.mozilla.com t-w1064-ms-165.wintest.releng.mdc1.mozilla.com t-w1064-ms-166.wintest.releng.mdc1.mozilla.com t-w1064-ms-167.wintest.releng.mdc1.mozilla.com t-w1064-ms-168.wintest.releng.mdc1.mozilla.com t-w1064-ms-169.wintest.releng.mdc1.mozilla.com t-w1064-ms-170.wintest.releng.mdc1.mozilla.com t-w1064-ms-171.wintest.releng.mdc1.mozilla.com t-w1064-ms-172.wintest.releng.mdc1.mozilla.com t-w1064-ms-173.wintest.releng.mdc1.mozilla.com t-w1064-ms-174.wintest.releng.mdc1.mozilla.com t-w1064-ms-175.wintest.releng.mdc1.mozilla.com t-w1064-ms-176.wintest.releng.mdc1.mozilla.com t-w1064-ms-177.wintest.releng.mdc1.mozilla.com t-w1064-ms-178.wintest.releng.mdc1.mozilla.com t-w1064-ms-179.wintest.releng.mdc1.mozilla.com t-w1064-ms-180.wintest.releng.mdc1.mozilla.com

Flags: needinfo?(jwatkins)

Assignee

Comment 19

•

7 years ago

:jlund/build duty, HPE will be on site at 10am Friday 4/20.

Zsolt Fay [:zfay]

Comment 20

•

7 years ago

:van, Build duty is aware and the people on shift tomorrow will be ready to quarantine the hosts on your mark.

Jake Watkins [:dividehex]

Comment 21

•

7 years ago

(In reply to Zsolt Fay [:zsoltfay] from comment #20) > :van, Build duty is aware and the people on shift tomorrow will be ready to > quarantine the hosts on your mark. You probably want to quarantine these host a few hours before the chassis is set to go offline. This will give time for any current tasks to complete.

Jordan Lund (:jlund)

Comment 22

•

7 years ago

(In reply to Jake Watkins [:dividehex] from comment #21) > (In reply to Zsolt Fay [:zsoltfay] from comment #20) > > :van, Build duty is aware and the people on shift tomorrow will be ready to > > quarantine the hosts on your mark. > > You probably want to quarantine these host a few hours before the chassis is > set to go offline. This will give time for any current tasks to complete. If we haven't already done so, we should quarintine these now.

Flags: needinfo?(jlund) → needinfo?(dlabici)

Comment 23

•

7 years ago

As per request I have quarantined the requested servers, with the exception of: t-linux64-ms-: 136, 137, 138, 139, 141, 142 T-W1064-MS-: 170, 178, 179 no idea where to find the missing servers.

Flags: needinfo?(dlabici)

Assignee

Comment 24

•

7 years ago

1/1/16 is now stable now after motherboard replacement.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.