897546 - Investigation into Panda job retries due to mozpool

Reporter

Description

•

12 years ago

Filing this in response to callek's email. { So I'm writing this in hopes you guys are willing to help me gather data in order to fix a panda-on-mozpool bug I escalated merely by fixing a different regression. First the background: * We were frequently getting bustages on android-panda robocop2 because we only requested a device for a 30 minute duration, but robocop2 ran for longer than that. After the 30 minutes were up our automation could not reliably predict what mozpool was doing, which in some cases involved reboots... * Due to that I increased our request duration to 4 hours, which matches the max runtime of these jobs in mozharness, * When a job finishes, it is meant to inform mozharness that we are done with the job, thus releasing the device from being "held" by a given job and freeing it for the next job we'd want to use that panda for. The Problem: * With this 4 hour request duration, we've gotten a lot of jobs from the pandas with the message |ERROR - INFRA-ERROR: Request did not become ready in time| * This sets RETRY properly * Each RETRY of that elapses somewhere between 15 and 25 minutes (the one I am looking at now was 22mins) Assumption: * This current state is not worse than previous state (since it is the actual "correct" state, even if I did expose a more-visible issue) The Data Request: So these devices have a previous job that fails to actually tell mozpool to reclaim it. The previous job may be failing to do so for any number of reasons, I'd like to know what those are so I can fix them. It's possible that with only a small set of data points I'd fix all the [known] issues here, its also possible I'd fix one and find out that there are 10 other distinct issues to fix. I anticipate each issue to be relatively easy to fix. What I'd like is to be informed (likely in a bug + needinfo, or direct assign-to-me) when you see the |ERROR - INFRA-ERROR: Request did not become ready in time| along with your identification of the log for the *last* job it ran, where it successfully requested a device, before that error. (note if it started to run verify.py it did successfully request a device) Reason I even ask is that this should be relatively easy to grab when you are watching the trees, much harder for me to do after the fact (since I'd need to go in and manually use SQL to go back far enough in recent job history unless I get lucky). }

Ed Morley [:emorley]

Reporter

Comment 1

•

12 years ago

First one: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0870 has had three "INFRA-ERROR: Request did not become ready in time" in a row, after a green run: http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-3/builds/659

Ed Morley [:emorley]

Reporter

Comment 2

•

12 years ago

Same slave (panda-0870): The INFRA-ERROR run (http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-3/builds/653) was preceded by this run: http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20mozilla-inbound%20opt%20test%20mochitest-8/builds/1015

Justin Wood (:Callek)

Comment 3

•

12 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #1) > First one: > > https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave. > html?name=panda-0870 has had three "INFRA-ERROR: Request did not become > ready in time" in a row, after a green run: > http://buildbot-master45.build.scl1.mozilla.com:8201/builders/Android%204. > 0%20Panda%20try%20opt%20test%20mochitest-3/builds/659 This case, it did release the mozpool lock, but it went into self-test mode on us: 2013-07-24T07:55:37 sut connecting to SUT agent 2013-07-24T08:05:47 sut connecting to SUT agent 2013-07-24T08:06:08 statemachine device failed SUT verification 2013-07-24T08:06:08 statemachine entering state start_self_test The next job didn't even start until 10 minutes later

Ed Morley [:emorley]

Reporter

Comment 4

•

12 years ago

See bug 897549 comment 0 for a device that has retried the last 100 jobs, with at least the last few (didn't check beyond that) being "INFRA-ERROR: Request did not become ready in time".

Ed Morley [:emorley]

Reporter

Updated

•

12 years ago

Blocks: 829211

Ed Morley [:emorley]

Reporter

Comment 5

•

12 years ago

Also now panda-0820 in bug 897566 comment 0.

Ed Morley [:emorley]

Reporter

Comment 6

•

12 years ago

And the panda in bug 897946.

Ed Morley [:emorley]

Reporter

Comment 7

•

12 years ago

Bug 897947 (panda-0729).

Ed Morley [:emorley]

Reporter

Comment 8

•

12 years ago

And yet another! Bug 897948 (panda-0788)

Ed Morley [:emorley]

Reporter

Comment 9

•

12 years ago

Bug 897950 (panda-0737)

Ed Morley [:emorley]

Reporter

Comment 10

•

12 years ago

Bug 897952 (panda-0869) Bug 897953 (panda-0763)

Ed Morley [:emorley]

Reporter

Comment 11

•

12 years ago

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0772 "INFRA-ERROR: Request did not become ready in time": http://buildbot-master44.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20mozilla-inbound%20opt%20test%20jsreftest-1/builds/3596 Run prior: http://buildbot-master44.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20try%20opt%20test%20mochitest-7/builds/2383/steps/run_script/logs/stdio

Justin Wood (:Callek)

Updated

•

12 years ago

Depends on: 898227

Kim Moir [:kmoir] ET

Assignee

Updated

•

12 years ago

Assignee: nobody → kmoir

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Justin Wood (:Callek)

Comment 12

•

12 years ago

haven't seen signs of this in a long while.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Investigation into Panda job retries due to mozpool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: kmoir)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Comment 12

Updated

Updated