Closed Bug 1181615 Opened 9 years ago Closed 9 years ago

Please investigate switch1.r401-5.ops.releng.scl3.mozilla.net

Categories

(Infrastructure & Operations Graveyard :: NetOps: DC Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: dcurado)

References

Details

Over in bug 1180877 we're tracking a spike in retries on Windows 8 test slaves. The machines in question all seem to be connected to the same switch, switch1.r101-18.console.scl3.mozilla.net.

The relevant comment is https://bugzilla.mozilla.org/show_bug.cgi?id=1180877#c6

Can someone please look at this switch, or get someone from NetOps to do so?
Assignee: nobody → network-operations
Component: MOC: Service Requests → NetOps: DC Other
QA Contact: lypulong → jbarnell
Summary: Please investigate switch1.r101-18.console.scl3.mozilla.net → Please investigate switch1.r401-5.ops.releng.scl3.mozilla.net
this switch is actually switch1.r401-5.ops.releng -> https://inventory.mozilla.org/en-US/systems/show/9473/

we have an open RMA with Juniper to replace node0 of this VC.
yeah, what Van said...
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
Are the headaches with the windows boxen something we can limp through, or do we need to take
action here?

Thanks.
Flags: needinfo?(coop)
(In reply to Dave Curado :dcurado from comment #3)
> Are the headaches with the windows boxen something we can limp through, or
> do we need to take
> action here?

We can limp through. We have enough capacity on other switches that we're not starved.
Flags: needinfo?(coop)
:coop, did the urgency for this switch replacement change? we were planning to swap it out during the TCW of 7/18.

12:23 < arr> van: can you double check in the bug/with coop (since he was the last one to follow up)?
12:23 < van> ok
12:23 < arr> I believe the sheriffs disabled a bunch of stuff so we're hurting for capacity
We have the switch now, and we're down 36 t-w864-ix machines until we replace it. We'll need a downtime for the entire rack (so temporarily disabling another 36 machines) to do the replacement due to the oversized PDUs. Another week is a long time to go without > 20% of our w8 capacity, but I'll let releng folks make the decision there.
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
Blocks: 1182631
(In reply to Amy Rich [:arr] [:arich] from comment #6)
> We have the switch now, and we're down 36 t-w864-ix machines until we
> replace it. We'll need a downtime for the entire rack (so temporarily
> disabling another 36 machines) to do the replacement due to the oversized
> PDUs. Another week is a long time to go without > 20% of our w8 capacity,
> but I'll let releng folks make the decision there.

The piece of information I'm missing here in order to make a decision is how long it would take to do the replacement. 

If it's a few hours (2-4), then I would suggest doing it ASAP this week. If it's longer, we should wait for the TCW.
Flags: needinfo?(coop)
i'd like at least 4 hours. this is our first time replacing a production switch with over-sized PDUS and I'm not sure the extent of the work needed. i'd like to see if i can finesse it out without powering down any additional chassis for future repairs/upgrades.

i'm pretty sure if things go smoothly, we can replace it much quicker. 

if possible, i'd like to schedule this tomorrow (tues 7/14) in the afternoon at 1pm as Sal works swing now.
coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an expected downtime of 4hrs?

ive updated our moc manager but would like a hard confirmation/approval.
Sounds like coop will be handling the releng side of this tomorrow, so needinfoed him.
Flags: needinfo?(catlee) → needinfo?(coop)
(In reply to Van Le [:van] from comment #9)
> coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an
> expected downtime of 4hrs?
> 
> ive updated our moc manager but would like a hard confirmation/approval.

If that's the earliest we can go, then yes, I approve.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #11)
> (In reply to Van Le [:van] from comment #9)
> > coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an
> > expected downtime of 4hrs?
> > 
> > ive updated our moc manager but would like a hard confirmation/approval.
> 
> If that's the earliest we can go, then yes, I approve.

I disabled all releng machines in the rack (401-5). They should all be idle by 1pm PT. 

Even if there are 1 or two jobs left running, please proceed with the repair. The jobs will simply get retried on available hardware.
work done per #moc - all hosts in rack pingable and re-enabled
switch replacement complete. things worth noting:

1) we had to remove the vertical cable manager/fingers on the side to remove the switch. this is actually pretty tough if the rack is full with hosts and lots of cables like this high density cabinet.

2) we had to remove both PDUs. the OOBs switches without redundant PSUs had to be shut off as their power cables weren't long enough.

3) we had to remove the top cable manager for wiggle room as the network cables were cabled to length so not a great deal of slack.

4) definitely need at least 2 ppl, 3 preferred as one has to hold both PDUs while one held the cables out of the way while the last person removed the switch.

5) we completed in about 2.5 hours (removed bad switch, upgraded new switch's code, recabled, inventoried) but ran into a software bug with the switch. the switch was configured and showed the correct information in 'show virtual-chassis' but none of the nodes on node0/re0 were pingable. the switch had to be rebooted and everything was happy again. 

will send back RMA to Juniper at earliest convenience.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1185844
No longer blocks: 1185844
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.