Split mozpool managed panda jobs away from non mozpool jobs

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
4 months ago

People

(Reporter: Callek, Assigned: Callek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

(Assignee)

Description

5 years ago
So, to lead with the problem summary:

* We have unittests on pandas running with mozpool.
** This means we can't "lock" them out
** UnitTest jobs for unittests "request" them from mozpool
** UnitTest jobs "release" the request when complete
** UnitTest jobs do this all as part of mozharness

* Talos jobs are not using Mozpool
** Everything is as before
** Mozpool has no knowledge of when a panda is being used for talos
** Mozpool is unable to distinguish between a device going down because talos wanted to reboot it and a device going down due to some power issue or other hardware issue.
** These jobs used to use "Locked Out" pandas.

* Mozpool watches pandas which are not "Locked Out"
** If an unloaned device goes down during mozpools regular pinging mozpool tries to recover by:
*** Rebooting device
*** Applying the Self-Test image to devices SDCard
*** Running Self Test
*** Returning device to "ready" state [this state is not an Android Image until next mozpool "request"]

This composit of information yeilds the problem that talos jobs can (and do) burn due to mozpool thinking it needs to recover them, however we also can't *just* lock out pandas for talos because the same pool of devices runs both.

This bug is about splitting said pool of devices in two, so that we can lock out ones used for talos and leave the rest managed by mozpool
(Assignee)

Comment 1

5 years ago
Created attachment 776544 [details] [diff] [review]
[puppet] add new slave class

This is pretty straight forward
Attachment #776544 - Flags: review?
(Assignee)

Comment 2

5 years ago
Created attachment 776546 [details] [diff] [review]
[buildbotcustom] support new slave class

This adds the new slave class into buildbotcustom where needed, most cases already do things like startswith('panda') or 'panda_android' in ...; but this is the one place that needs it explicitly added
Attachment #776546 - Flags: review?(aki)
(Assignee)

Comment 3

5 years ago
Created attachment 776549 [details] [diff] [review]
[configs] Bulk of the work

Things to note:

* Needed adding to slave_platform to satisfy talos.
* By doing so needed to provide unittest empty lists otherwise that barfed
* the .remove() in patch is ONLY applied against talos definitions
* Kept "slave description" the same to satisfy buildbot (retriggers will work) and graph server.
* Trychooser I think will work with no changes (f? to sfink)
Attachment #776549 - Flags: review?(aki)
Attachment #776549 - Flags: feedback?(sphink)
(Assignee)

Comment 4

5 years ago
Created attachment 776555 [details] [diff] [review]
[dump_master] diff of clean vs patched dump_master

To help with review here is a clean vs patched dump master

To help see changes better I did a case-insens buildername sort and made the slave name from the configs patch all-caps so that it was obvious what builders are affected.

This also shows we're adding new schedulers for all branches with panda for the nomozpool unittest things, but those schedulers have an empty builderlist array.
Attachment #776555 - Flags: feedback?(aki)
(Assignee)

Updated

5 years ago
Attachment #776544 - Flags: review? → review?(aki)
Comment on attachment 776549 [details] [diff] [review]
[configs] Bulk of the work

Review of attachment 776549 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this should be ok for trychooser. It cares some about the keys before and after, but you're splitting out a new value for the one in the middle. :-)
Attachment #776549 - Flags: feedback?(sphink) → feedback+
Attachment #776544 - Flags: review?(aki) → review+
Comment on attachment 776555 [details] [diff] [review]
[dump_master] diff of clean vs patched dump_master

This eyeballs ok, with the caps lock change reversed.
Attachment #776555 - Flags: feedback?(aki) → feedback+
Attachment #776546 - Flags: review?(aki) → review+
Comment on attachment 776549 [details] [diff] [review]
[configs] Bulk of the work

> ANDROID = PLATFORMS['android']['slave_platforms']
>+ANDROID_NOT_MOZPOOL = deepcopy(ANDROID)
>+if 'panda_android-nomozpool' in PLATFORMS['android']['slave_platforms']:
>+    ANDROID_NOT_MOZPOOL.remove('panda_android')

We don't use ANDROID anywhere.
This should be fine, though, especially since we want to get rid of the split pool later.
Attachment #776549 - Flags: review?(aki) → review+
(Assignee)

Comment 8

5 years ago
https://hg.mozilla.org/build/puppet/rev/66bdeff85368
https://hg.mozilla.org/build/buildbotcustom/rev/e78dcd4a5e39
https://hg.mozilla.org/build/buildbot-configs/rev/0c6bf8328c94

Sometime after this goes live we need to image with android and lock out from mozpool pandas 522 to 728.

This will ensure that mozpool is not using them, and be the actual solution to this bug, while the patches landed are all to make buildbot behave nicely when we do.
This is in production.
After getting the thumbs up from callek via irc, I've gone ahead and locked out panda-{0522-0728) within mozpool, although 4 of them refused to take the android image.  I'll mark these 4 as 'bad pandas' in mozpool and added them to the bad panda log.

panda-0720:
2013-07-18T10:08:57 syslog mkfs.ext4 failed:
2013-07-18T10:11:51 statemachine entering state failed_android_downloading

panda-0696:
2013-07-18T10:11:46 statemachine entering state failed_android_downloading

panda-0674:
2013-07-18T10:11:56 statemachine entering state failed_android_downloading

panda-0664:
2013-07-18T10:14:56 statemachine entering state failed_android_downloading
(Assignee)

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.