29 - last job Oct 25, no pandas connected 42 - huge amount or retries, last job was Oct 25 43 - huge number of retries, no buildslaves connected 44 - pandas attached, running jobs 45 - one panda attached, huge amount of retries around Oct 25 20:00PST I have reimaged the pandas attached to bm29 since many of them are in self-test mode. I wonder if this is related to the change I made in bug 889967 https://hg.mozilla.org/build/buildbot-configs/rev/bbcec3c6f785 but this would not explain why the pandas on 44 are still up.
I also reimaged the ones connected to bm42 since they were all down and most were in self-test mode
I think this is mostly related to the following "known problem" conditions: * We reboot pandas via relay.py rather than mozpool *** Mozpool will get an error using the same relay for other pandas forcing them into a self test mode when this happens and its trying to check those other pandas, since only one reboot can happen on a relay at once. * Pandas that get sent to self test for *any* reason will not come back up into Android unless we "request" (with mozpool) the android image. ** This is a problem because we have a chicken and egg problem, where verify.py runs ahead of buildbot and checks all the android expected states. While we don't request the panda officially until mozharness runs as part of buildbot. So we could be trying to run verify against a self test image and never request the proper image, which leaves the device "disabled" * All pandas are locked to specific masters. ** This prevents the slavealloc auto-load-balancing from taking place, meaning that a whole bunch of pandas can easily be on one master without spreading their load out.
Yes I understand the reboot issues which I'm working on. It just seems unusual that suddenly all of the pandas on most of the masters continually fall down and last week they were fine.
Callek, Jake, and I met briefly to talk about this. What Callek said in comment #2 is the proximal cause here: post-request, devices sometimes get self-tested when they really don't need it, and mozpool conservatively assumes that a self-tested device has no usable image on it anymore. Pre-request, before the next request, verify.py sees that there's no Android image on the device, and treats the device as disabled instead of requesting that mozpool put the Android image on there. This became an issue last week when Callek fixed an unrelated bug that was keeping each Panda's Buildbot instance running permanently, thereby never releasing a request. So these post-request and pre-request problems never occurred. As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday were unrelated.
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #5) > As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday > were unrelated. Bug 937322 tracks investigation on said mtv outage.
Callek: what remains to do here? Should I re-assign this to you or pmoore?
Assignee: kmoir → nobody
since pete is driving some b2g stuff I'll own this, if he gets free before I circle back we can swap out though
Assignee: nobody → bugspam.Callek
Effectively fixed by Bug 936615 - no need for this specific tracker
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.