Closed Bug 1056143 Opened 10 years ago Closed 10 years ago

use excess pandas to backfill broken production pandas

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Unassigned)

References

Details

Attachments

(1 file)

This bug is to track the production pandas that are broken and the pandas that we're going to take out of other chassis (which will be moving into storage) to backfill the broken production chassis.

Coop, can you please start by providing a list of the "broken" pandas so we can determine which we want to backfill and which chassis we want to put into storage?
Flags: needinfo?(coop)
Blocks: 1056145
Van just did resolved the most recent panda recovery bug yesterday, so I'm putting a bunch of questionable pandas back into service today that may or may not hold up.

There are a few known-bad pandas I can add to the list today, but I want to see how those recovered pandas hold-up before adding ones from that list.
Flags: needinfo?(coop)
I've added the disabled pandas as dependencies, but here's a list in case that's easier:

* panda-0091
* panda-0126
* panda-0129
* panda-0157
* panda-0191
* panda-0219
* panda-0257
* panda-0370
* panda-0373
* panda-0460
* panda-0476
* panda-0490
* panda-0539
* panda-0584
* panda-0587
* panda-0592
* panda-0643
* panda-0647
* panda-0681
* panda-0726
* panda-0730
* panda-0731
* panda-0734
* panda-0736
* panda-0747
* panda-0749
* panda-0778
* panda-0797
* panda-0803
* panda-0807
* panda-0819
* panda-0832
* panda-0834
* panda-0835
* panda-0848

I'll mark them all for decomm in slavealloc and the relevant bugs.

Note: I may have more to add once I go through the list of "broken" pandas, i.e. pandas that are enabled but not reporting.
I've added panda-0588 to the list.
Blocks: panda-0588
Added:

* panda-0330
* panda-0344
* panda-0489
Added:

* panda-0052
* panda-0095
* panda-0107
* panda-0234
* panda-0294
* panda-0302
* panda-0337
* panda-0621
* panda-0664
* panda-0665
* panda-0674
That's good enough for now.

The only other pandas I worry about are marked in red as "broken" in slave health:

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=panda

That so many of them cluster around 117 days and 38 days since their last job makes me wonder whether a colo move (or similar large event) knocked a whole batch of good pandas offline.

I will file a follow-up bug for that.
(In reply to Chris Cooper [:coop] from comment #6)
> I will file a follow-up bug for that.

Bug 1057069 filed.
There doesn't appear to be any specific pattern, and all of the "broken" pandas are spread out over p1 - p9 (none that I could see in p10).

The distribution is as follows:

p1:
panda-0091
panda-0095
panda-0107
panda-0126
panda-0129
panda-0157

p2:
panda-0191
panda-0219
panda-0234
panda-0257

p3:
panda-0294
panda-0302
panda-0330
panda-0337
panda-0344

p4:
panda-0370
panda-0373

p5:
panda-0460
panda-0476
panda-0489
panda-0490

p6:
panda-0539
panda-0584
panda-0587
panda-0588
panda-0592

p7:
panda-0621
panda-0643
panda-0647
panda-0664
panda-0665
panda-0674
panda-0681

p8:
panda-0726
panda-0730
panda-0731
panda-0734
panda-0736
panda-0747
panda-0749
panda-0778

p9:
panda-0797
panda-0803
panda-0807
panda-0819
panda-0832
panda-0834
panda-0835
panda-0848

It doesn't make a great deal of difference, but there are slightly higher numbers of failures in the last three racks, which adds to my belief that those are the three racks we should put in storage.

That means that we should disable the pandas listed here:

https://inventory.mozilla.org/en-US/systems/racks/?rack=372
https://inventory.mozilla.org/en-US/systems/racks/?rack=373
https://inventory.mozilla.org/en-US/systems/racks/?rack=374

And my suggestion is that we do backfill as follows:

pandas from panda-chassis-055 relocated into p1 and p2:

p1:
panda-0091 -> panda-0610
panda-0095 -> panda-0611
panda-0107 -> panda-0612
panda-0126 -> panda-0613
panda-0129 -> panda-0614
panda-0157 -> panda-0615

p2:
panda-0191 -> panda-0616
panda-0219 -> panda-0617
panda-0234 -> panda-0618
panda-0257 -> panda-0619

pandas from panda-chassis-056 relocated into p3, p4, and p5

p3:
panda-0294 -> panda-0620
panda-0302 -> panda-0622
panda-0330 -> panda-0623
panda-0337 -> panda-0624
panda-0344 -> panda-0625

p4:
panda-0370 -> panda-0626
panda-0373 -> panda-0627

p5:
panda-0460 -> panda-0628
panda-0476 -> panda-0629
panda-0489 -> panda-0630
panda-0490 -> panda-0081

pandas from panda-chassis-057 relocated into p6

p6:
panda-0539 -> panda-0631
panda-0584 -> panda-0632
panda-0587 -> panda-0633
panda-0588 -> panda-0634
panda-0592 -> panda-0635

That leaves panda-chassis-057 with extra boards in it and makes it the
one we should scanvege from first when we next need backfill.

Coop, does that look good to you?
Flags: needinfo?(coop)
(In reply to Amy Rich [:arich] [:arr] from comment #8)
> That means that we should disable the pandas listed here:
> 
> https://inventory.mozilla.org/en-US/systems/racks/?rack=372
> https://inventory.mozilla.org/en-US/systems/racks/?rack=373
> https://inventory.mozilla.org/en-US/systems/racks/?rack=374
> 
> Coop, does that look good to you?

That's fine. I'll mark the pandas from those racks as disabled and in storage in slavealloc.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #9)
> That's fine. I'll mark the pandas from those racks as disabled and in
> storage in slavealloc.

This is done.
This removes the dead pandas from nagios. I'll wait till the back fill ones are in place before adding them to nagios.
The following have also been decommissioned in inventory:

panda-0091.p1.releng.scl3.mozilla.com
panda-0095.p1.releng.scl3.mozilla.com
panda-0107.p1.releng.scl3.mozilla.com
panda-0126.p1.releng.scl3.mozilla.com
panda-0129.p1.releng.scl3.mozilla.com
panda-0157.p1.releng.scl3.mozilla.com
panda-0191.p2.releng.scl3.mozilla.com
panda-0219.p2.releng.scl3.mozilla.com
panda-0234.p2.releng.scl3.mozilla.com
panda-0294.p3.releng.scl3.mozilla.com
panda-0302.p3.releng.scl3.mozilla.com
panda-0330.p3.releng.scl3.mozilla.com
panda-0337.p3.releng.scl3.mozilla.com
panda-0344.p3.releng.scl3.mozilla.com
panda-0370.p4.releng.scl3.mozilla.com
panda-0373.p4.releng.scl3.mozilla.com
panda-0460.p5.releng.scl3.mozilla.com
panda-0476.p5.releng.scl3.mozilla.com
panda-0489.p5.releng.scl3.mozilla.com
panda-0490.p5.releng.scl3.mozilla.com
panda-0539.p6.releng.scl3.mozilla.com
panda-0584.p6.releng.scl3.mozilla.com
panda-0587.p6.releng.scl3.mozilla.com
panda-0588.p6.releng.scl3.mozilla.com
panda-0592.p6.releng.scl3.mozilla.com
(In reply to Amy Rich [:arich] [:arr] from comment #8)

dcops, could you please backfill pandas as described in comment #8? The notation is:

dead panda -> replacement panda scavenged from a chassis we want to put in storage. e.g.
panda-0091 -> panda-0610

We'll need to update inventory for each board with the new location information, IP, vlan, as well as the mobile imaging server and panda-relay key/values. Since there are only a handful, do you want to do them manually, or do you want to try to get uberj's assistance to do these as a batch?

Once they're all up and functional, I'll add them to nagios.
Assignee: arich → server-ops-dcops
Component: RelOps → Server Operations: DCOps
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dmoore
IP, vlan, mobile imaging server, and panda-relay updated in inventory. Waiting on physical move and update of location information in inventory.
colo-trip: --- → scl3
UPDATE:  Remaining pandas that needs replacement.

panda-0330 -> panda-0623
panda-0337 -> panda-0624
panda-0344 -> panda-0625

p4:
panda-0370 -> panda-0626
panda-0373 -> panda-0627

p5:
panda-0460 -> panda-0628
panda-0476 -> panda-0629
panda-0489 -> panda-0630
panda-0490 -> panda-0081

pandas from panda-chassis-057 relocated into p6

p6:
panda-0539 -> panda-0631
panda-0584 -> panda-0632
panda-0587 -> panda-0633
panda-0588 -> panda-0634
panda-0592 -> panda-0635
All pandas have been replaced, rack location and switch ports updated in inventory.
Self tested and installed a fresh copy of 4.0.4_v3.3 on each replaced board.
Assignee: server-ops-dcops → relops
Status: NEW → RESOLVED
Closed: 10 years ago
Component: Server Operations: DCOps → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → arich
Resolution: --- → FIXED
Depends on: 1072405
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: