Closed
Bug 831727
Opened 13 years ago
Closed 10 years ago
Mozpool should identify devices that have failed the majority of recent jobs and remove from the pool
Categories
(Testing Graveyard :: Mozpool, defect)
Testing Graveyard
Mozpool
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: emorley, Unassigned)
Details
(Keywords: sheriffing-P1)
For example:
https://secure.pub.build.mozilla.org/buildapi/recent/panda-0856
...has failed 18 out of the last 20 jobs.
Mozpool should pull it from the pool, since at the moment I have to manually inspect the recent jobs page and file bugs to eventually get them pulled out of production (but not until they've already burnt many jobs). Automating this would make me very happy :-)
Comment 1•13 years ago
|
||
We're [releng] actually working on automation/dashboards around "did this device/hardware fail a lot recently" we have no ETA to provide yet.
That said, as long as we continue to have Panda Android "Locked Out" of being managed by mozpool, it won't be able to help the situation that troubles you at the moment anyway.
Comment 2•13 years ago
|
||
2 schools of thought here:
1) releng should fix this as it is a big nightmare and will only get worse as we have more 'flaky' devices
2) if releng doesn't fix this we will fix it ourselves with tools out of band which will make releng depend on it and everybody will think it is clunky.
one thought I have is we could use orange factor as a way to query bad boards (assuming it gets green status). then every hour we could run a cron job to look for failing boards and send an email out or file a bug.
another area to fix this is in clientproxy. If we have no guarantee of replacing client proxy in the next 2 months, then we need to go under the assumption we will be using it for the foreseeable future.
| Reporter | ||
Comment 3•13 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #1)
> That said, as long as we continue to have Panda Android "Locked Out" of
> being managed by mozpool
Why are they locked out?
Surely one of the primary reasons for mozpool was so that it could do things like this bug automatically?
Comment 4•13 years ago
|
||
Mozpool isn't being used to its potential by either B2G or Android. In particular, devices used for Android are marked `locked_out` in Mozpool, which basically prohibits *any* management by Mozpool. But even for B2G, Mozpool's ability to dynamically provision isn't being used by releng - the releng automation always requests a specific panda. Furthermore, Mozpool doesn't have any vision into the status of jobs run on a particular device, unless we modify the releng automation to tell it that.
If we can figure out *what* about the device is problematic, and reproduce that in a reasonably efficient self-test procedure, then releng automation could trigger a self-test on suspected board-failure conditions, and Mozpool could use the results to pull the board from production, flagged for appropriate remediation. Historically, we haven't had good luck tracking down exactly what makes a device problematic, but we also haven't taken a very formal approach to that problem-solving, so it's hard to say how likely that is.
Alternately, the releng automation Callek described in comment 1 can just mark a board bad when it detects that. In that case, we'll want to avoid the failure mode where a series of bum builds wind up marking all of our devices as bad and starving the pool. This approach also has the downside of producing no useful information for us. Handing a device to DC Ops with "uhh, it's acting funky" isn't productive.
So the bottom line is that we need to do the hard work of *systematically* determining what is going wrong with these devices, how to detect it, and how to fix it. At that point, Mozpool can help.
Comment 5•13 years ago
|
||
Oh, and I assume we're talking about Android judging by the folks on this list. Has someone tried re-imaging these devices with Android? Based on bugs I've seen, I'm guessing not.
If the problem is as simple as Firefox corrupting the Android image, then at least the remediation would be pretty easy (and easy to automate).
Comment 6•10 years ago
|
||
Mozpool is decommed along with pandas...
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•10 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•