Closed
Bug 897546
Opened 12 years ago
Closed 12 years ago
Investigation into Panda job retries due to mozpool
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: kmoir)
References
Details
Filing this in response to callek's email.
{
So I'm writing this in hopes you guys are willing to help me gather data in order to fix a panda-on-mozpool bug I escalated merely by fixing a different regression.
First the background:
* We were frequently getting bustages on android-panda robocop2 because we only requested a device for a 30 minute duration, but robocop2 ran for longer than that. After the 30 minutes were up our automation could not reliably predict what mozpool was doing, which in some cases involved reboots...
* Due to that I increased our request duration to 4 hours, which matches the max runtime of these jobs in mozharness,
* When a job finishes, it is meant to inform mozharness that we are done with the job, thus releasing the device from being "held" by a given job and freeing it for the next job we'd want to use that panda for.
The Problem:
* With this 4 hour request duration, we've gotten a lot of jobs from the pandas with the message |ERROR - INFRA-ERROR: Request did not become ready in time|
* This sets RETRY properly
* Each RETRY of that elapses somewhere between 15 and 25 minutes (the one I am looking at now was 22mins)
Assumption:
* This current state is not worse than previous state (since it is the actual "correct" state, even if I did expose a more-visible issue)
The Data Request:
So these devices have a previous job that fails to actually tell mozpool to reclaim it. The previous job may be failing to do so for any number of reasons, I'd like to know what those are so I can fix them.
It's possible that with only a small set of data points I'd fix all the [known] issues here, its also possible I'd fix one and find out that there are 10 other distinct issues to fix. I anticipate each issue to be relatively easy to fix.
What I'd like is to be informed (likely in a bug + needinfo, or direct assign-to-me) when you see the |ERROR - INFRA-ERROR: Request did not become ready in time| along with your identification of the log for the *last* job it ran, where it successfully requested a device, before that error. (note if it started to run verify.py it did successfully request a device)
Reason I even ask is that this should be relatively easy to grab when you are watching the trees, much harder for me to do after the fact (since I'd need to go in and manually use SQL to go back far enough in recent job history unless I get lucky).
}
| Reporter | ||
Comment 1•12 years ago
|
||
First one:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0870 has had three "INFRA-ERROR: Request did not become ready in time" in a row, after a green run:
http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-3/builds/659
| Reporter | ||
Comment 2•12 years ago
|
||
Same slave (panda-0870):
The INFRA-ERROR run (http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-3/builds/653) was preceded by this run:
http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20mozilla-inbound%20opt%20test%20mochitest-8/builds/1015
Comment 3•12 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #1)
> First one:
>
> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.
> html?name=panda-0870 has had three "INFRA-ERROR: Request did not become
> ready in time" in a row, after a green run:
> http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.
> 0%20Panda%20try%20opt%20test%20mochitest-3/builds/659
This case, it did release the mozpool lock, but it went into self-test mode on us:
2013-07-24T07:55:37 sut connecting to SUT agent
2013-07-24T08:05:47 sut connecting to SUT agent
2013-07-24T08:06:08 statemachine device failed SUT verification
2013-07-24T08:06:08 statemachine entering state start_self_test
The next job didn't even start until 10 minutes later
| Reporter | ||
Comment 4•12 years ago
|
||
See bug 897549 comment 0 for a device that has retried the last 100 jobs, with at least the last few (didn't check beyond that) being "INFRA-ERROR: Request did not become ready in time".
| Reporter | ||
Comment 5•12 years ago
|
||
Also now panda-0820 in bug 897566 comment 0.
| Reporter | ||
Comment 6•12 years ago
|
||
And the panda in bug 897946.
| Reporter | ||
Comment 7•12 years ago
|
||
Bug 897947 (panda-0729).
| Reporter | ||
Comment 8•12 years ago
|
||
And yet another! Bug 897948 (panda-0788)
| Reporter | ||
Comment 9•12 years ago
|
||
Bug 897950 (panda-0737)
| Reporter | ||
Comment 10•12 years ago
|
||
Bug 897952 (panda-0869)
Bug 897953 (panda-0763)
| Reporter | ||
Comment 11•12 years ago
|
||
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0772
"INFRA-ERROR: Request did not become ready in time":
http://buildbot-master44.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20mozilla-inbound%20opt%20test%20jsreftest-1/builds/3596
Run prior:
http://buildbot-master44.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-7/builds/2383/steps/run_script/logs/stdio
| Assignee | ||
Updated•12 years ago
|
Assignee: nobody → kmoir
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Comment 12•12 years ago
|
||
haven't seen signs of this in a long while.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•