many panda masters do not have pandas attached

RESOLVED FIXED

Status

Infrastructure & Operations
CIDuty
RESOLVED FIXED
5 years ago
2 months ago

People

(Reporter: kmoir, Assigned: Callek)

Tracking

Details

(Reporter)

Description

5 years ago
29 - last job Oct 25, no pandas connected
42 - huge amount or retries, last job was Oct 25
43 - huge number of retries, no buildslaves connected
44 - pandas attached, running jobs
45 - one panda attached, huge amount of retries around Oct 25 20:00PST

I have reimaged the pandas attached to bm29 since many of them are in self-test mode.  I wonder if this is related to the change I made in bug 889967

https://hg.mozilla.org/build/buildbot-configs/rev/bbcec3c6f785

but this would not explain why the pandas on 44 are still up.
(Reporter)

Updated

5 years ago
Assignee: nobody → kmoir
(Reporter)

Comment 1

5 years ago
I also reimaged the ones connected to bm42 since they were all down and most were in self-test mode
(Assignee)

Comment 2

5 years ago
I think this is mostly related to the following "known problem" conditions:

* We reboot pandas via relay.py rather than mozpool
*** Mozpool will get an error using the same relay for other pandas forcing them into a self test mode when this happens and its trying to check those other pandas, since only one reboot can happen on a relay at once.

* Pandas that get sent to self test for *any* reason will not come back up into Android unless we "request" (with mozpool) the android image.
** This is a problem because we have a chicken and egg problem, where verify.py runs ahead of buildbot and checks all the android expected states. While we don't request the panda officially until mozharness runs as part of buildbot. So we could be trying to run verify against a self test image and never request the proper image, which leaves the device "disabled"

* All pandas are locked to specific masters.
** This prevents the slavealloc auto-load-balancing from taking place, meaning that a whole bunch of pandas can easily be on one master without spreading their load out.
(Reporter)

Comment 3

5 years ago
Yes I understand the reboot issues which I'm working on.  It just seems unusual that suddenly all of the pandas on most of the masters continually fall down and last week they were fine.
(Assignee)

Updated

5 years ago
Depends on: 889967
(Assignee)

Updated

5 years ago
Depends on: 936615
Blocks: 936827
(Assignee)

Updated

5 years ago
Duplicate of this bug: 936827
No longer blocks: 936827
Callek, Jake, and I met briefly to talk about this.

What Callek said in comment #2 is the proximal cause here: post-request, devices sometimes get self-tested when they really don't need it, and mozpool conservatively assumes that a self-tested device has no usable image on it anymore.  Pre-request, before the next request, verify.py sees that there's no Android image on the device, and treats the device as disabled instead of requesting that mozpool put the Android image on there.

This became an issue last week when Callek fixed an unrelated bug that was keeping each Panda's Buildbot instance running permanently, thereby never releasing a request.  So these post-request and pre-request problems never occurred.

As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday were unrelated.
(Assignee)

Comment 6

5 years ago
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #5)
> As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday
> were unrelated.

Bug 937322 tracks investigation on said mtv outage.

Comment 7

5 years ago
Callek: what remains to do here? Should I re-assign this to you or pmoore?
Assignee: kmoir → nobody
(Assignee)

Comment 8

5 years ago
since pete is driving some b2g stuff I'll own this, if he gets free before I circle back we can swap out though
Assignee: nobody → bugspam.Callek
(Assignee)

Comment 9

5 years ago
Effectively fixed by Bug 936615 - no need for this specific tracker
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

2 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.