Investigate why backfilled pandas haven't taken any jobs

RESOLVED FIXED

Status

Release Engineering
Buildduty
P2
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: philor, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [capacity])

Attachments

(3 attachments)

(Reporter)

Description

3 years ago
None of the pandas that bug 1056143 backfilled into production while their chassis was being removed have taken any jobs since.
If I had to guess, these pandas are likely still associate with their original foopy instead of a foopy that still exists.

I'll dig into this tomorrow.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Created attachment 8496045 [details] [diff] [review]
Add replacement pandas to new foopies.

I'm slotting the replacement pandas into the same foopies as the original pandas that they replaced. This information is contained in https://bugzilla.mozilla.org/show_bug.cgi?id=1056143#c8
Attachment #8496045 - Flags: review?(bugspam.Callek)
Comment on attachment 8496045 [details] [diff] [review]
Add replacement pandas to new foopies.

Review of attachment 8496045 [details] [diff] [review]:
-----------------------------------------------------------------

stamp+, are we removing (or did we already) the swapped-out pandas from devices.json?
Attachment #8496045 - Flags: review?(bugspam.Callek) → review+
(In reply to Justin Wood (:Callek) from comment #3) 
> stamp+, are we removing (or did we already) the swapped-out pandas from
> devices.json?

That will be step #2. I'll post a patch for it shortly.
Created attachment 8496289 [details] [diff] [review]
Remove decommissioned pandas from devices.json
Attachment #8496289 - Flags: review?(bugspam.Callek)
Comment on attachment 8496045 [details] [diff] [review]
Add replacement pandas to new foopies.

Review of attachment 8496045 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/5ad6931211e8
Attachment #8496045 - Flags: checked-in+
Comment on attachment 8496289 [details] [diff] [review]
Remove decommissioned pandas from devices.json

Review of attachment 8496289 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't cross check the panda list.  But r+
Attachment #8496289 - Flags: review?(bugspam.Callek) → review+
(Reporter)

Comment 8

3 years ago
Step in the right direction, now they are taking jobs, but every single one of them is now disabled, because they all fail every other job (with a tiny bit of variation as they break some jobs even earlier) like https://tbpl.mozilla.org/php/getParsedLog.php?id=49026169&tree=Mozilla-Inbound#error1, failing to powercycle 75 times and thus burning the job.
(In reply to Phil Ringnalda (:philor) from comment #8)
> Step in the right direction, now they are taking jobs, but every single one
> of them is now disabled, because they all fail every other job (with a tiny
> bit of variation as they break some jobs even earlier) like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=49026169&tree=Mozilla-
> Inbound#error1, failing to powercycle 75 times and thus burning the job.

Failing to powercycle means that the relay host is probably wrong and needs updating too. That info isn't available in bug 1056143, so I'm going to need to go spelunking in inventory to find it.
Comment on attachment 8496289 [details] [diff] [review]
Remove decommissioned pandas from devices.json

Review of attachment 8496289 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/00d99fd9508f
Attachment #8496289 - Flags: checked-in+
(In reply to Chris Cooper [:coop] from comment #9)
> Failing to powercycle means that the relay host is probably wrong and needs
> updating too. That info isn't available in bug 1056143, so I'm going to need
> to go spelunking in inventory to find it.

https://hg.mozilla.org/build/tools/rev/16985a437a01

I've kicked off a re-image of all the affected pandas, and have re-enabled them all in slavealloc.
(Reporter)

Comment 12

3 years ago
I may well have disabled some for failing after you fixed them, thinking that I was just disabling ones that I failed to actually disable last night.
(In reply to Phil Ringnalda (:philor) from comment #12)
> I may well have disabled some for failing after you fixed them, thinking
> that I was just disabling ones that I failed to actually disable last night.

I will check them all again after the in-progress reconfig finishes.
(Reporter)

Comment 14

3 years ago
Still busted, and disabled again: panda-0616, panda-0623, panda-0630, panda-0631.
(Reporter)

Comment 15

3 years ago
Also disabled panda-0615, panda-0624, panda-0628, panda-0632 and panda-0633, so let's just say "all of them, once they finally take two jobs so they can burn one of them."
Since the relay assignments in devices.json are correct now, either the assignments are wrong in the inventory or maybe there's a problem at the mozpool layer.

I'll start diving into the logs today.
Despite a reconfig on Saturday which is supposed to update the tools checkout on the foopies (tools/buildfarm/maintenance/end_to_end_reconfig.sh), the foopies still had an out-of-date tools repo. This caused any mozharness scripts that referenced the tools checkout to use stale relay data when trying to reboot the pandas.

I updated the tools checkout on the foopies today, and then re-enabled 3 pandas to gauge results. Each of those 3 pandas has now run at least 2 successful jobs in a row, so I've re-enabled all the other pandas now as well.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Created attachment 8497132 [details] [diff] [review]
Prune reserve pandas from devices.json

The pandas I'm removing represent our reserve capacity. They'll be used to backfill any hardware failures in the panda pool. When that happens, a given panda will be slotted into a new foopy and relay, so keeping this old information in devices.json is pointless.

Removing these pandas and their associated foopy mappings will also make fabric actions that touch foopies less painful, because we won't be trying to access 20 decommissioned foopies.
Attachment #8497132 - Flags: review?(bugspam.Callek)
Comment on attachment 8497132 [details] [diff] [review]
Prune reserve pandas from devices.json

Review of attachment 8497132 [details] [diff] [review]:
-----------------------------------------------------------------

Stamp
Attachment #8497132 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8497132 [details] [diff] [review]
Prune reserve pandas from devices.json

Review of attachment 8497132 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/179bfe89bf2a
Attachment #8497132 - Flags: checked-in+
You need to log in before you can comment on or make changes to this bug.