Closed Bug 1193002 Opened 5 years ago Closed 4 years ago

decommission more pandas/foopies and mobile imaging servers once bug 1183877 lands

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(2 files, 2 obsolete files)

reallocate the associated foopies to linux32 test machines and reimage
We are currently having linux32 test capacity issues.

In inventory these machines appear to be the same, do they need any hardware changes (video card?) to run talos tests?

foopy -> iX Systems - iX21X4 2U Neutron
linux32 -> iX Systems - iX21X4 2U Neutron (Releng Talos Config 1)
Flags: needinfo?(arich)
Summary: disable more pandas once bug 1183877 lands → disable more pandas once bug 1183877 lands and reallocate foopies as linux32 test machines
Summary: disable more pandas once bug 1183877 lands and reallocate foopies as linux32 test machines → disable more pandas once bug 1183877 lands and reallocate foopies as linux32 talos machines
foopies and talos machines have different model CPUs:

foopy: model name      : Intel(R) Xeon(R) CPU           X3470  @ 2.93GHz
talos: model name      : Intel(R) Xeon(R) CPU           X3450  @ 2.67GHz
Flags: needinfo?(arich)
see bug 1056139 for the necessary modifications to turn foopies into talos machines.
Also note that the foopies come out of warranty in November.
:jmaher I don't know if it is worth reallocating the foopies given that their cpus are different from the existing ix machines and their warranties soon expire
Flags: needinfo?(jmaher)
maybe :bc: could use these for autophone?
Flags: needinfo?(jmaher) → needinfo?(bob)
The only use I can think of would be as Autophone controllers but we are moving away from using Mac minis as controllers and to using Linux servers.
Flags: needinfo?(bob)
if the foopies are out of warranty in a few weeks, then we can just shut them off when they are done.  Do we have a plan for the webserver that the android tests would use?  I know talos uses the webserver, and we still run talos.
:jmaher: Which web server are you referring to? 

Each panda rack has a set of foopies and and imaging server. If we stop using the pandas in that rack, we can shut off and decomm that infrastructure (except for rack 1, since that's where the master imaging server that syncs up with the database lives).
relengwebadm.private.scl3.mozilla.com, here is a wiki outlining the update process:
https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Other_Duties#Update_mobile_talos_webhosts

I assume this is just a few machines and in due time we can decommission them as well.
relengweb houses many web services for releng, so it won't be decommissioned any time soon.
I'm not sure what the plan for talos is going forward and whether or not you'll still need those vhosts, but if not we can delete them. That's a different issue than disposing of this hardware, though.
Just disabled pandas 0022-060, 0082-0306 in slavealloc
Summary: disable more pandas once bug 1183877 lands and reallocate foopies as linux32 talos machines → disable more pandas once bug 1183877 lands and reallocate foopies
Attached patch bug1193002.patch (obsolete) — Splinter Review
patch to decommission pandas 0022-0060, 0082-0306, 0901-0903 0610-618 and foopies:
102-104,39-56
Attachment #8678913 - Flags: review?(bugspam.Callek)
Comment on attachment 8678913 [details] [diff] [review]
bug1193002.patch

Review of attachment 8678913 [details] [diff] [review]:
-----------------------------------------------------------------

stamp
Attachment #8678913 - Flags: review?(bugspam.Callek) → review+
When we do the next batch (assuming it isn't "all of them"), can we please do them by chassis and clean out entire racks at a time? Getting rid of the rest of p3 and p10 first would be optimal. This will make life much easier for dcops, relops, and netops since we can decomm the mobile imaging servers at the same time if we do a whole rack (thus emptying out the rack and allowing netops to delete the vlan, etc). 

Here's the remaining mappings:

panda-relay-027.p3.releng.mozilla.com (303-312)
panda-relay-028.p3.releng.mozilla.com (313-323)
panda-relay-029.p3.releng.mozilla.com (324-334)
panda-relay-030.p3.releng.mozilla.com (335-343, 624-625)

panda-relay-031.p4.releng.mozilla.com (346-356)
panda-relay-032.p4.releng.mozilla.com (357-367)
panda-relay-033.p4.releng.mozilla.com (369-378)
panda-relay-034.p4.releng.mozilla.com (379-389)
panda-relay-035.p4.releng.mozilla.com (390-400)
panda-relay-036.p4.releng.mozilla.com (401-410)
panda-relay-037.p4.releng.mozilla.com (412-422)
panda-relay-038.p4.releng.mozilla.com (423-433)

panda-relay-039.p5.releng.mozilla.com (434-443)
panda-relay-040.p5.releng.mozilla.com (445-455)
panda-relay-041.p5.releng.mozilla.com (456-466, 628)
panda-relay-042.p5.releng.mozilla.com (45, 467-477, 629)
panda-relay-043.p5.releng.mozilla.com (478-488)
panda-relay-044.p5.releng.mozilla.com (491-499)
panda-relay-045.p5.releng.mozilla.com (500-510)
panda-relay-046.p5.releng.mozilla.com (511-521)

panda-relay-047.p6.releng.mozilla.com (522-532)
panda-relay-048.p6.releng.mozilla.com (533-543)
panda-relay-049.p6.releng.mozilla.com (544-554)
panda-relay-050.p6.releng.mozilla.com (555-565)
panda-relay-051.p6.releng.mozilla.com (57, 69, 566-576)
panda-relay-052.p6.releng.mozilla.com (577-585)
panda-relay-053.p6.releng.mozilla.com (589-598, 634)
panda-relay-054.p6.releng.mozilla.com (599-609)

panda-relay-079.p10.releng.mozilla.com (874-884)
panda-relay-080.p10.releng.mozilla.com (887-909)
Summary: disable more pandas once bug 1183877 lands and reallocate foopies → decommission more pandas/foopies and mobile imaging servers once bug 1183877 lands
Attached patch bug1193002.patchSplinter Review
Attachment #8678913 - Attachment is obsolete: true
Assignee: nobody → kmoir
FYI, 0620 - 0629 should also be absent, since they were decommisioned long ago as part of decomming rack p7.
I missed:

panda-relay-005.p10.releng.scl3.mozilla.com (58-68)
panda-relay-006.p10.releng.scl3.mozilla.com (70-80)
looking at slavealloc pandas-620 to 629 are running jobs so it appears they were not decommed.  Perhaps they are assigned to as replacements to racks where other pandas died?
Apparently someone has been doing that and not updating nagios. :/
I'll add that to my patch in the other bug.
Attachment #8679038 - Flags: checked-in+
Depends on: 1218571
Okay, we've moved the master mobile imaging server to p4, so let's decomm things in that rack LAST and aim for p3 and p10 FIRST if we're doing another partial batch.
Amy I talked to jmaher in the mobile meeting and he said that it is reasonable to expect that talos on autophone be completely implemented by the end of Q1.  That being said, we still have a couple of hundred pandas enabled (although a lot of them seem to be in a broken state, perhaps buildduty is not actively rebooting them) that are easily keeping up with the current load.  We could disable some more, let me know what racks you would prefer and we can move forward with these changes.
Flags: needinfo?(arich)
I'm not sure how many you want to disable, but I'd like to go in this order. Being able to finish off entire racks at a time would be great. I've separated them out so that the two half racks we have left are first. I've also made sure that p4 is last, since that's where the master mozpool server is now.

panda-relay-027.p3.releng.mozilla.com (303-312)
panda-relay-028.p3.releng.mozilla.com (313-323)
panda-relay-029.p3.releng.mozilla.com (324-334)
panda-relay-030.p3.releng.mozilla.com (335-343, 624-625)

panda-relay-005.p10.releng.scl3.mozilla.com (58-68)
panda-relay-006.p10.releng.scl3.mozilla.com (70-80)
panda-relay-079.p10.releng.mozilla.com (874-884)
panda-relay-080.p10.releng.mozilla.com (887-909)

panda-relay-039.p5.releng.mozilla.com (434-443)
panda-relay-040.p5.releng.mozilla.com (445-455)
panda-relay-041.p5.releng.mozilla.com (456-466, 628)
panda-relay-042.p5.releng.mozilla.com (45, 467-477, 629)
panda-relay-043.p5.releng.mozilla.com (478-488)
panda-relay-044.p5.releng.mozilla.com (491-499)
panda-relay-045.p5.releng.mozilla.com (500-510)
panda-relay-046.p5.releng.mozilla.com (511-521)

panda-relay-047.p6.releng.mozilla.com (522-532)
panda-relay-048.p6.releng.mozilla.com (533-543)
panda-relay-049.p6.releng.mozilla.com (544-554)
panda-relay-050.p6.releng.mozilla.com (555-565)
panda-relay-051.p6.releng.mozilla.com (57, 69, 566-576)
panda-relay-052.p6.releng.mozilla.com (577-585)
panda-relay-053.p6.releng.mozilla.com (589-598, 634)
panda-relay-054.p6.releng.mozilla.com (599-609)

panda-relay-031.p4.releng.mozilla.com (346-356)
panda-relay-032.p4.releng.mozilla.com (357-367)
panda-relay-033.p4.releng.mozilla.com (369-378)
panda-relay-034.p4.releng.mozilla.com (379-389)
panda-relay-035.p4.releng.mozilla.com (390-400)
panda-relay-036.p4.releng.mozilla.com (401-410)
panda-relay-037.p4.releng.mozilla.com (412-422)
panda-relay-038.p4.releng.mozilla.com (423-433)
Flags: needinfo?(arich)
:vlad or alin

Could you go through the list of pandas that are showing up as orange in slave health
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=panda
and reboot them to try to get the to work.  Then we can move forward with patches to disable some more panda racks with a better idea of how many remaining pandas are actually working.
Flags: needinfo?(vlad.ciobancai)
Flags: needinfo?(alin.selagea)
Because I'm curious like that, I already did the ones that have been idle for more than two weeks, which unsurprisingly resulted in "unreachable" bugs being filed for all of them except the ones that don't have tracking bugs already filed, because bug 1126879.

Do we actually want to make dcops remember how to deal with pandas, which they haven't touched for months, and replace the sd cards in all those ones which are probably saying by being dead that they are the ones we most want to decomm?
No, my intention was not to get dcops to deal with pandas and replace sd cards.  I was just going to get buildduty look at the existing pool of pandas (which has a huge number that appear to be broken) and reboot them to see if they could come back online so we could have a good idea of how many chassis we can actally decomm.
Ah, reboot them other than via slaveapi so they don't get unreachables?
Any buildduty troubleshooting with pandas should be done via mozpool. It tries to perform any corrective meaures (that don't require physical intervention) and will tell you what the failure state is: http://mobile-imaging-004.p4.releng.scl3.mozilla.com/ui/lifeguard.html

There are only 12 panda boards that are showing hardware issues, so I suspect any other issues are not related to the pandas, but the foopies, buildbot, etc. One can always force a reimage of the panda from mozpool if need be.
Attached patch 1193002_v2.patch (obsolete) — Splinter Review
Attached the patch to decommission the rest of panda slave regarding to the comment #24
Flags: needinfo?(vlad.ciobancai)
Attachment #8680693 - Flags: review?(kmoir)
Updated the patch
Attachment #8680693 - Attachment is obsolete: true
Attachment #8680693 - Flags: review?(kmoir)
Attachment #8680715 - Flags: review?(kmoir)
pandas from  bug1193002.patch  removed from slavealloc db
Flags: needinfo?(alin.selagea)
Attachment #8680715 - Flags: review?(kmoir) → review+
(In reply to Vlad Ciobancai [:vladC] from comment #32)
> Created attachment 8680715 [details] [diff] [review]
> bug1193002_v2.patch
> 
> Updated the patch

Disabled all the panda's from the above patch in slave alloc
Attachment #8680715 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.