1181615 - Please investigate switch1.r401-5.ops.releng.scl3.mozilla.net

Reporter

Description

•

10 years ago

Over in bug 1180877 we're tracking a spike in retries on Windows 8 test slaves. The machines in question all seem to be connected to the same switch, switch1.r101-18.console.scl3.mozilla.net. The relevant comment is https://bugzilla.mozilla.org/show_bug.cgi?id=1180877#c6 Can someone please look at this switch, or get someone from NetOps to do so?

Vinh Hua [:vinh]

Updated

•

10 years ago

Assignee: nobody → network-operations

Component: MOC: Service Requests → NetOps: DC Other

QA Contact: lypulong → jbarnell

Van Le [:van]

Updated

•

10 years ago

Summary: Please investigate switch1.r101-18.console.scl3.mozilla.net → Please investigate switch1.r401-5.ops.releng.scl3.mozilla.net

Van Le [:van]

Comment 1

•

10 years ago

this switch is actually switch1.r401-5.ops.releng -> https://inventory.mozilla.org/en-US/systems/show/9473/ we have an open RMA with Juniper to replace node0 of this VC.

Dave Curado :dcurado

Assignee

Comment 2

•

10 years ago

yeah, what Van said...

Assignee: network-operations → dcurado

Status: NEW → ASSIGNED

Dave Curado :dcurado

Assignee

Comment 3

•

10 years ago

Are the headaches with the windows boxen something we can limp through, or do we need to take action here? Thanks.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Reporter

Comment 4

•

10 years ago

(In reply to Dave Curado :dcurado from comment #3) > Are the headaches with the windows boxen something we can limp through, or > do we need to take > action here? We can limp through. We have enough capacity on other switches that we're not starved.

Flags: needinfo?(coop)

Van Le [:van]

Comment 5

•

10 years ago

:coop, did the urgency for this switch replacement change? we were planning to swap it out during the TCW of 7/18. 12:23 < arr> van: can you double check in the bug/with coop (since he was the last one to follow up)? 12:23 < van> ok 12:23 < arr> I believe the sheriffs disabled a bunch of stuff so we're hurting for capacity

Amy Rich [:arr] [:arich]

Comment 6

•

10 years ago

We have the switch now, and we're down 36 t-w864-ix machines until we replace it. We'll need a downtime for the entire rack (so temporarily disabling another 36 machines) to do the replacement due to the oversized PDUs. Another week is a long time to go without > 20% of our w8 capacity, but I'll let releng folks make the decision there.

Flags: needinfo?(coop)

Flags: needinfo?(catlee)

Amy Rich [:arr] [:arich]

Updated

•

10 years ago

Blocks: 1182631

Chris Cooper [:coop] (he/him)

Reporter

Comment 7

•

10 years ago

(In reply to Amy Rich [:arr] [:arich] from comment #6) > We have the switch now, and we're down 36 t-w864-ix machines until we > replace it. We'll need a downtime for the entire rack (so temporarily > disabling another 36 machines) to do the replacement due to the oversized > PDUs. Another week is a long time to go without > 20% of our w8 capacity, > but I'll let releng folks make the decision there. The piece of information I'm missing here in order to make a decision is how long it would take to do the replacement. If it's a few hours (2-4), then I would suggest doing it ASAP this week. If it's longer, we should wait for the TCW.

Flags: needinfo?(coop)

Van Le [:van]

Comment 8

•

10 years ago

i'd like at least 4 hours. this is our first time replacing a production switch with over-sized PDUS and I'm not sure the extent of the work needed. i'd like to see if i can finesse it out without powering down any additional chassis for future repairs/upgrades. i'm pretty sure if things go smoothly, we can replace it much quicker. if possible, i'd like to schedule this tomorrow (tues 7/14) in the afternoon at 1pm as Sal works swing now.

Van Le [:van]

Comment 9

•

10 years ago

coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an expected downtime of 4hrs? ive updated our moc manager but would like a hard confirmation/approval.

Amy Rich [:arr] [:arich]

Comment 10

•

10 years ago

Sounds like coop will be handling the releng side of this tomorrow, so needinfoed him.

Flags: needinfo?(catlee) → needinfo?(coop)

Chris Cooper [:coop] (he/him)

Reporter

Comment 11

•

10 years ago

(In reply to Van Le [:van] from comment #9) > coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an > expected downtime of 4hrs? > > ive updated our moc manager but would like a hard confirmation/approval. If that's the earliest we can go, then yes, I approve.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Reporter

Comment 12

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #11) > (In reply to Van Le [:van] from comment #9) > > coop/buildduty/arr, are we ok with proceeding at 1pm tomorrow with an > > expected downtime of 4hrs? > > > > ive updated our moc manager but would like a hard confirmation/approval. > > If that's the earliest we can go, then yes, I approve. I disabled all releng machines in the rack (401-5). They should all be idle by 1pm PT. Even if there are 1 or two jobs left running, please proceed with the repair. The jobs will simply get retried on available hardware.

hwine

Comment 13

•

10 years ago

work done per #moc - all hosts in rack pingable and re-enabled

Van Le [:van]

Comment 14

•

10 years ago

switch replacement complete. things worth noting: 1) we had to remove the vertical cable manager/fingers on the side to remove the switch. this is actually pretty tough if the rack is full with hosts and lots of cables like this high density cabinet. 2) we had to remove both PDUs. the OOBs switches without redundant PSUs had to be shut off as their power cables weren't long enough. 3) we had to remove the top cable manager for wiggle room as the network cables were cabled to length so not a great deal of slack. 4) definitely need at least 2 ppl, 3 preferred as one has to hold both PDUs while one held the cables out of the way while the last person removed the switch. 5) we completed in about 2.5 hours (removed bad switch, upgraded new switch's code, recabled, inventoried) but ran into a software bug with the switch. the switch was configured and showed the correct information in 'show virtual-chassis' but none of the nodes on node0/re0 were pingable. the switch had to be rebooted and everything was happy again. will send back RMA to Juniper at earliest convenience.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Q

Updated

•

10 years ago

Blocks: 1185844

Q

Updated

•

10 years ago

No longer blocks: 1185844

BMO Automation

Updated

•

3 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Please investigate switch1.r401-5.ops.releng.scl3.mozilla.net

Categories

(Infrastructure & Operations Graveyard :: NetOps: DC Other, task)

Tracking

(Not tracked)

People

(Reporter: coop, Assigned: dcurado)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Updated

Updated