Closed Bug 897546 Opened 12 years ago Closed 12 years ago

Investigation into Panda job retries due to mozpool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: kmoir)

References

Details

Filing this in response to callek's email. { So I'm writing this in hopes you guys are willing to help me gather data in order to fix a panda-on-mozpool bug I escalated merely by fixing a different regression. First the background: * We were frequently getting bustages on android-panda robocop2 because we only requested a device for a 30 minute duration, but robocop2 ran for longer than that. After the 30 minutes were up our automation could not reliably predict what mozpool was doing, which in some cases involved reboots... * Due to that I increased our request duration to 4 hours, which matches the max runtime of these jobs in mozharness, * When a job finishes, it is meant to inform mozharness that we are done with the job, thus releasing the device from being "held" by a given job and freeing it for the next job we'd want to use that panda for. The Problem: * With this 4 hour request duration, we've gotten a lot of jobs from the pandas with the message |ERROR - INFRA-ERROR: Request did not become ready in time| * This sets RETRY properly * Each RETRY of that elapses somewhere between 15 and 25 minutes (the one I am looking at now was 22mins) Assumption: * This current state is not worse than previous state (since it is the actual "correct" state, even if I did expose a more-visible issue) The Data Request: So these devices have a previous job that fails to actually tell mozpool to reclaim it. The previous job may be failing to do so for any number of reasons, I'd like to know what those are so I can fix them. It's possible that with only a small set of data points I'd fix all the [known] issues here, its also possible I'd fix one and find out that there are 10 other distinct issues to fix. I anticipate each issue to be relatively easy to fix. What I'd like is to be informed (likely in a bug + needinfo, or direct assign-to-me) when you see the |ERROR - INFRA-ERROR: Request did not become ready in time| along with your identification of the log for the *last* job it ran, where it successfully requested a device, before that error. (note if it started to run verify.py it did successfully request a device) Reason I even ask is that this should be relatively easy to grab when you are watching the trees, much harder for me to do after the fact (since I'd need to go in and manually use SQL to go back far enough in recent job history unless I get lucky). }
(In reply to Ed Morley [:edmorley UTC+1] from comment #1) > First one: > > https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave. > html?name=panda-0870 has had three "INFRA-ERROR: Request did not become > ready in time" in a row, after a green run: > http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204. > 0%20Panda%20try%20opt%20test%20mochitest-3/builds/659 This case, it did release the mozpool lock, but it went into self-test mode on us: 2013-07-24T07:55:37 sut connecting to SUT agent 2013-07-24T08:05:47 sut connecting to SUT agent 2013-07-24T08:06:08 statemachine device failed SUT verification 2013-07-24T08:06:08 statemachine entering state start_self_test The next job didn't even start until 10 minutes later
See bug 897549 comment 0 for a device that has retried the last 100 jobs, with at least the last few (didn't check beyond that) being "INFRA-ERROR: Request did not become ready in time".
Blocks: 829211
Also now panda-0820 in bug 897566 comment 0.
And the panda in bug 897946.
Bug 897947 (panda-0729).
And yet another! Bug 897948 (panda-0788)
Bug 897950 (panda-0737)
Bug 897952 (panda-0869) Bug 897953 (panda-0763)
Depends on: 898227
Assignee: nobody → kmoir
Product: mozilla.org → Release Engineering
haven't seen signs of this in a long while.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.