Closed
Bug 1429225
Opened 7 years ago
Closed 7 years ago
bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16
Categories
(Infrastructure & Operations :: DCOps, task)
Infrastructure & Operations
DCOps
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: van, Assigned: van)
References
Details
(Whiteboard: case 5326088334, 5326802525, 5328599861)
strange issue on the moonshot. reran fibers, tested optics on both ends, finally changed SFP+ port and that appears to be the issue. currently reconfigured 1/1/14 as part of the port channel until we can RMA the module.
XGE1/1/14 UP 10G(a) F(a) T 1
XGE1/1/15 UP 10G(a) F(a) T 1
XGE1/1/16 DOWN auto A T 1
BAGG46 UP 40G(a) F(a) T 1
error messages:
%Jan 9 19:36:46:383 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up.
%Jan 9 19:36:50:617 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
%Jan 9 19:36:50:929 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up.
%Jan 9 19:36:52:469 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
%Jan 9 19:36:52:764 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up.
%Jan 9 19:37:02:916 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
%Jan 9 19:37:03:211 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up.
%Jan 9 19:37:03:561 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
%Jan 9 19:37:03:854 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is up.
%Jan 9 19:37:13:547 2018 moon4-access.rit47.inband.releng.mdc1.mozilla.net IFNET/3/PHY_UPDOWN: Ten-GigabitEthernet1/1/16 link status is down.
Assignee | ||
Updated•7 years ago
|
Blocks: 1428159
Summary: bad SFP+ interface 16 on Moonshot-45XGc → bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16
Assignee | ||
Comment 1•7 years ago
|
||
opened 5326088334 with HP for RMA.
Assignee: server-ops-dcops → vle
Whiteboard: case 5326088334
Assignee | ||
Comment 2•7 years ago
|
||
the RMA is on site in SCL3. i'd like to schedule a trip to replace the moonshot switch and view its behavior when we hot swap the switch. are there any issues with this or do we need to schedule a CAB? to note, this is moon-chassis-4.
Flags: needinfo?(klibby)
Summary: bad port - moon1-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16
Comment 3•7 years ago
|
||
What exactly needs replacing, and what's the impact to the chassis? If it's just one port, then it should be fine whenever. If we need to take the whole chassis down, we should just let QA/releng know; I don't think it needs a TCW.
Flags: needinfo?(klibby)
Assignee | ||
Comment 4•7 years ago
|
||
>what's the impact to the chassis?
i was told these are hotswappable as they are just the uplink modules but in a redundant pair. it looks like they sent me the internal switch vs the uplink module so i'll have to contact HP again.
Comment 5•7 years ago
|
||
ok, should be gtg.
Assignee | ||
Comment 6•7 years ago
|
||
didn't get an update when i emailed them referring to old case number so i went ahead and opened case 5326802525 for the 16 port sfp+ uplink module.
Whiteboard: case 5326088334 → case 5326088334, 5326802525
Assignee | ||
Comment 7•7 years ago
|
||
still working with HPE on this. the previous tech i spoke to told me that we can do this live. however, the current tech is telling me we need to have an outage window. i see this in the guide though:
• Removing any component from bay A or bay B does not disrupt traffic for the other switch assembly.
so i'm asking for clarification. if this does cause any disruption, i'll open a CAB to handle this during the TCW.
we are also pending shipment of the module as they're currently out of stock.
Assignee | ||
Updated•7 years ago
|
Summary: bad port - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16 → bad uplink module - moon4-1-access.rit47.inband.releng.mdc1.mozilla.net:16
Assignee | ||
Comment 8•7 years ago
|
||
actually, now that i think about it - we will need to take an outage window as none of the cartridges are running LACP. we ran into pxe booting issues so i believe this configuration has been removed. will loop back once i get more info.
Comment 9•7 years ago
|
||
We only have a limited number of production workers on this chassis, so scheduling an outage should be relatively easy.
Assignee | ||
Comment 10•7 years ago
|
||
replacing the switch didn't resolve the issue. some things to note:
1) since we've disabled the 2nd switch due to pxe/operational issues, there is no redundancy. when we remove switch1, the whole chassis will go down.
2) replacing the switch did not resolve the issue, port 1/1/16 will continuously flap. im not sure if that's an issue with the chassis or system board now. will need to revisit HPE case.
Assignee | ||
Comment 11•7 years ago
|
||
as mentioned in our postscl3 meetings, this chassis and all of its blades will go down when we swap out the motherboard. i would like to schedule next Friday 4/20 at 10am for the hardware replacement. i will open a CAB but would like to know if there are other people i need to contact and whether this is possible before i coordinate with HP to send a tech out (i'll need to be on site as well).
Flags: needinfo?(klibby)
Flags: needinfo?(jwatkins)
Assignee | ||
Updated•7 years ago
|
Whiteboard: case 5326088334, 5326802525 → case 5326088334, 5326802525, 5328599861
Comment 12•7 years ago
|
||
(In reply to Van Le [:van] from comment #11)
> as mentioned in our postscl3 meetings, this chassis and all of its blades
> will go down when we swap out the motherboard. i would like to schedule next
> Friday 4/20 at 10am for the hardware replacement. i will open a CAB but
> would like to know if there are other people i need to contact and whether
> this is possible before i coordinate with HP to send a tech out (i'll need
> to be on site as well).
ETA on how long you expect to have the chassis offline? I assume not very long. We should just let jmaher and the buildduty folks know when we start, so that if we get alerts for pending counts they know it's related. And either relops or buildduty can quarantine all of the hosts on the chassis to block jobs from running.
I think next Friday is fine.
Flags: needinfo?(klibby)
Assignee | ||
Comment 13•7 years ago
|
||
this will take 2-3 hours. we need to remove every component from the chassis when replacing the motherboard. i also plan on showing QTS the process for future smart hand requests.
ill plan a CAB for next Friday if HP has the parts.
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 14•7 years ago
|
||
CHG0012795 opened for CAB review.
Assignee | ||
Comment 15•7 years ago
|
||
CHG0012795 has been approved. these are the hosts i have in inventory:
" xenserver136.ops.releng.mdc1.mozilla.com 31246"
" xenserver145.ops.releng.mdc1.mozilla.com 31255"
" xenserver146.ops.releng.mdc1.mozilla.com 31256"
" xenserver147.ops.releng.mdc1.mozilla.com 31257"
" xenserver148.ops.releng.mdc1.mozilla.com 31258"
" xenserver149.ops.releng.mdc1.mozilla.com 31259"
" xenserver150.ops.releng.mdc1.mozilla.com 31260"
" xenserver151.ops.releng.mdc1.mozilla.com 31261"
" xenserver152.ops.releng.mdc1.mozilla.com 31262"
" xenserver153.ops.releng.mdc1.mozilla.com 31263"
" xenserver154.ops.releng.mdc1.mozilla.com 31264"
" xenserver137.ops.releng.mdc1.mozilla.com 31247"
" xenserver155.ops.releng.mdc1.mozilla.com 31265"
" xenserver156.ops.releng.mdc1.mozilla.com 31266"
" xenserver157.ops.releng.mdc1.mozilla.com 31267"
" xenserver158.ops.releng.mdc1.mozilla.com 31268"
" xenserver159.ops.releng.mdc1.mozilla.com 31269"
" xenserver160.ops.releng.mdc1.mozilla.com 31270"
" xenserver161.ops.releng.mdc1.mozilla.com 31271"
" xenserver162.ops.releng.mdc1.mozilla.com 31272"
" xenserver163.ops.releng.mdc1.mozilla.com 31273"
" xenserver164.ops.releng.mdc1.mozilla.com 31274"
" xenserver138.ops.releng.mdc1.mozilla.com 31248"
" xenserver165.ops.releng.mdc1.mozilla.com 31275"
" xenserver166.ops.releng.mdc1.mozilla.com 31276"
" xenserver167.ops.releng.mdc1.mozilla.com 31277"
" xenserver168.ops.releng.mdc1.mozilla.com 31278"
" xenserver169.ops.releng.mdc1.mozilla.com 31279"
" xenserver170.ops.releng.mdc1.mozilla.com 31280"
" xenserver171.ops.releng.mdc1.mozilla.com 31281"
" xenserver172.ops.releng.mdc1.mozilla.com 31282"
" xenserver173.ops.releng.mdc1.mozilla.com 31283"
" xenserver174.ops.releng.mdc1.mozilla.com 31284"
" xenserver139.ops.releng.mdc1.mozilla.com 31249"
" xenserver175.ops.releng.mdc1.mozilla.com 31285"
" xenserver176.ops.releng.mdc1.mozilla.com 31286"
" xenserver177.ops.releng.mdc1.mozilla.com 31287"
" xenserver178.ops.releng.mdc1.mozilla.com 31288"
" xenserver179.ops.releng.mdc1.mozilla.com 31289"
" xenserver180.ops.releng.mdc1.mozilla.com 31290"
" xenserver140.ops.releng.mdc1.mozilla.com 31250"
" xenserver141.ops.releng.mdc1.mozilla.com 31251"
" xenserver142.ops.releng.mdc1.mozilla.com 31252"
" xenserver143.ops.releng.mdc1.mozilla.com 31253"
" xenserver144.ops.releng.mdc1.mozilla.com 31254"
Assignee | ||
Comment 16•7 years ago
|
||
who will be on buildduty and needs to be aware to quarantine the hosts?
Flags: needinfo?(jwatkins)
Flags: needinfo?(jlund)
Assignee | ||
Comment 17•7 years ago
|
||
the HP tech requested 2.5 hours, i've asked for 4 in CAB because i intend to show the QTS techs how to handle this in the future when we do a smart hands request. in the past, replacing the motherboard took around 3 hours.
Comment 18•7 years ago
|
||
I'll let :jlund handle who will be assigned from buildduty. Also, fwiw, I'm not sure which inventory those xenserver hosts came from but they have since been renamed. The list of affected hosts should be:
t-linux64-ms-136.test.releng.mdc1.mozilla.com
t-linux64-ms-137.test.releng.mdc1.mozilla.com
t-linux64-ms-138.test.releng.mdc1.mozilla.com
t-linux64-ms-139.test.releng.mdc1.mozilla.com
t-linux64-ms-140.test.releng.mdc1.mozilla.com
t-linux64-ms-141.test.releng.mdc1.mozilla.com
t-linux64-ms-142.test.releng.mdc1.mozilla.com
t-linux64-ms-143.test.releng.mdc1.mozilla.com
t-linux64-ms-144.test.releng.mdc1.mozilla.com
t-linux64-ms-145.test.releng.mdc1.mozilla.com
t-linux64-ms-146.test.releng.mdc1.mozilla.com
t-linux64-ms-147.test.releng.mdc1.mozilla.com
t-linux64-ms-148.test.releng.mdc1.mozilla.com
t-linux64-ms-149.test.releng.mdc1.mozilla.com
t-linux64-ms-150.test.releng.mdc1.mozilla.com
t-w1064-ms-151.wintest.releng.mdc1.mozilla.com
t-w1064-ms-152.wintest.releng.mdc1.mozilla.com
t-w1064-ms-153.wintest.releng.mdc1.mozilla.com
t-w1064-ms-154.wintest.releng.mdc1.mozilla.com
t-w1064-ms-155.wintest.releng.mdc1.mozilla.com
t-w1064-ms-156.wintest.releng.mdc1.mozilla.com
t-w1064-ms-157.wintest.releng.mdc1.mozilla.com
t-w1064-ms-158.wintest.releng.mdc1.mozilla.com
t-w1064-ms-159.wintest.releng.mdc1.mozilla.com
t-w1064-ms-160.wintest.releng.mdc1.mozilla.com
t-w1064-ms-161.wintest.releng.mdc1.mozilla.com
t-w1064-ms-162.wintest.releng.mdc1.mozilla.com
t-w1064-ms-163.wintest.releng.mdc1.mozilla.com
t-w1064-ms-164.wintest.releng.mdc1.mozilla.com
t-w1064-ms-165.wintest.releng.mdc1.mozilla.com
t-w1064-ms-166.wintest.releng.mdc1.mozilla.com
t-w1064-ms-167.wintest.releng.mdc1.mozilla.com
t-w1064-ms-168.wintest.releng.mdc1.mozilla.com
t-w1064-ms-169.wintest.releng.mdc1.mozilla.com
t-w1064-ms-170.wintest.releng.mdc1.mozilla.com
t-w1064-ms-171.wintest.releng.mdc1.mozilla.com
t-w1064-ms-172.wintest.releng.mdc1.mozilla.com
t-w1064-ms-173.wintest.releng.mdc1.mozilla.com
t-w1064-ms-174.wintest.releng.mdc1.mozilla.com
t-w1064-ms-175.wintest.releng.mdc1.mozilla.com
t-w1064-ms-176.wintest.releng.mdc1.mozilla.com
t-w1064-ms-177.wintest.releng.mdc1.mozilla.com
t-w1064-ms-178.wintest.releng.mdc1.mozilla.com
t-w1064-ms-179.wintest.releng.mdc1.mozilla.com
t-w1064-ms-180.wintest.releng.mdc1.mozilla.com
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 19•7 years ago
|
||
:jlund/build duty, HPE will be on site at 10am Friday 4/20.
Comment 20•7 years ago
|
||
:van, Build duty is aware and the people on shift tomorrow will be ready to quarantine the hosts on your mark.
Comment 21•7 years ago
|
||
(In reply to Zsolt Fay [:zsoltfay] from comment #20)
> :van, Build duty is aware and the people on shift tomorrow will be ready to
> quarantine the hosts on your mark.
You probably want to quarantine these host a few hours before the chassis is set to go offline. This will give time for any current tasks to complete.
Comment 22•7 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #21)
> (In reply to Zsolt Fay [:zsoltfay] from comment #20)
> > :van, Build duty is aware and the people on shift tomorrow will be ready to
> > quarantine the hosts on your mark.
>
> You probably want to quarantine these host a few hours before the chassis is
> set to go offline. This will give time for any current tasks to complete.
If we haven't already done so, we should quarintine these now.
Flags: needinfo?(jlund) → needinfo?(dlabici)
Comment 23•7 years ago
|
||
As per request I have quarantined the requested servers, with the exception of:
t-linux64-ms-:
136, 137, 138, 139, 141, 142
T-W1064-MS-:
170, 178, 179
no idea where to find the missing servers.
Flags: needinfo?(dlabici)
Assignee | ||
Comment 24•7 years ago
|
||
1/1/16 is now stable now after motherboard replacement.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•