932231 - many panda masters do not have pandas attached

Reporter

Description

•

11 years ago

29 - last job Oct 25, no pandas connected
42 - huge amount or retries, last job was Oct 25
43 - huge number of retries, no buildslaves connected
44 - pandas attached, running jobs
45 - one panda attached, huge amount of retries around Oct 25 20:00PST

I have reimaged the pandas attached to bm29 since many of them are in self-test mode.  I wonder if this is related to the change I made in bug 889967

https://hg.mozilla.org/build/buildbot-configs/rev/bbcec3c6f785

but this would not explain why the pandas on 44 are still up.

Kim Moir [:kmoir] ET

Reporter

Updated

•

11 years ago

Assignee: nobody → kmoir

Kim Moir [:kmoir] ET

Reporter

Comment 1

•

11 years ago

I also reimaged the ones connected to bm42 since they were all down and most were in self-test mode

Justin Wood (:Callek)

Assignee

Comment 2

•

11 years ago

I think this is mostly related to the following "known problem" conditions:

* We reboot pandas via relay.py rather than mozpool
*** Mozpool will get an error using the same relay for other pandas forcing them into a self test mode when this happens and its trying to check those other pandas, since only one reboot can happen on a relay at once.

* Pandas that get sent to self test for *any* reason will not come back up into Android unless we "request" (with mozpool) the android image.
** This is a problem because we have a chicken and egg problem, where verify.py runs ahead of buildbot and checks all the android expected states. While we don't request the panda officially until mozharness runs as part of buildbot. So we could be trying to run verify against a self test image and never request the proper image, which leaves the device "disabled"

* All pandas are locked to specific masters.
** This prevents the slavealloc auto-load-balancing from taking place, meaning that a whole bunch of pandas can easily be on one master without spreading their load out.

Kim Moir [:kmoir] ET

Reporter

Comment 3

•

11 years ago

Yes I understand the reboot issues which I'm working on.  It just seems unusual that suddenly all of the pandas on most of the masters continually fall down and last week they were fine.

Justin Wood (:Callek)

Assignee

Updated

•

11 years ago

Depends on: 889967

Justin Wood (:Callek)

Assignee

Updated

•

11 years ago

Depends on: 936615

Matt Brubeck (:mbrubeck)

Updated

•

11 years ago

Blocks: 936827

Ed Morley [:emorley]

Updated

•

11 years ago

No longer blocks: 936827

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

11 years ago

Callek, Jake, and I met briefly to talk about this.

What Callek said in comment #2 is the proximal cause here: post-request, devices sometimes get self-tested when they really don't need it, and mozpool conservatively assumes that a self-tested device has no usable image on it anymore.  Pre-request, before the next request, verify.py sees that there's no Android image on the device, and treats the device as disabled instead of requesting that mozpool put the Android image on there.

This became an issue last week when Callek fixed an unrelated bug that was keeping each Panda's Buildbot instance running permanently, thereby never releasing a request.  So these post-request and pre-request problems never occurred.

As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday were unrelated.

Justin Wood (:Callek)

Assignee

Comment 6

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #5)
> As Callek pointed out in bug 936827, the nagios alerts in mtv1 on Sunday
> were unrelated.

Bug 937322 tracks investigation on said mtv outage.

Chris Cooper [:coop] (he/him)

Comment 7

•

11 years ago

Callek: what remains to do here? Should I re-assign this to you or pmoore?

Assignee: kmoir → nobody

Justin Wood (:Callek)

Assignee

Comment 8

•

11 years ago

since pete is driving some b2g stuff I'll own this, if he gets free before I circle back we can swap out though

Assignee: nobody → bugspam.Callek

Justin Wood (:Callek)

Assignee

Comment 9

•

10 years ago

Effectively fixed by Bug 936615 - no need for this specific tracker

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

many panda masters do not have pandas attached

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: kmoir, Assigned: Callek)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated