Closed Bug 895466 Opened 8 years ago Closed 7 years ago

Increase mozpool request time

Categories

(Release Engineering :: General, defect)

x86
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: Callek)

References

Details

Attachments

(2 files)

There is concern that the mozpool request_device duration at:

http://mxr.mozilla.org/build/source/mozpool/mozpoolclient/mozpoolclient/mozpoolclient.py#182

may be too short, and that may be causing, for instance, the rc2 timeouts tracked in bug 883539. We should consider increasing this to 60 minutes or more.

(:jmaher also notes that there are buildbot timeouts to consider...he thinks they are 60 minutes.)
If I understand it correctly, that is *only* a timeout for requesting a device to become available.
This timeout could only happen at the beginning of the job, before even one line of the tests would be run.

Could you please paste a log to see? Just to make sure that it is as I say or if I got it wrong.
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) (EDT/UTC-4) from comment #1)
> If I understand it correctly, that is *only* a timeout for requesting a
> device to become available.

I checked with dustin and jake about this, the duration param to that function is indeed a value letting mozpool how long we 'want' the device for. And has nothing to do with the initial request timeout.

> Could you please paste a log to see? Just to make sure that it is as I say
> or if I got it wrong.

The logs are not as informative (or easy to splice), atm, I need to convince the mozpool people that the way I envision it is worthwhile! :-)
Once the request duration runs out, the request is closed, and whatever user requested the device no longer "owns" the device.  At that point, Mozpool is free to do whatever it would like to the device - reboot, selftest, reimage, mine bitcoins, etc.

You can either request a sufficiently-long duration up front, or renew the request before it expires.
Knowing that, fixing it should fix the issues pointed out.
So currently, the no-output timeout on these jobs is set to 40-minutes, while the max-time they are allowed to run is 4 hours. (This includes all setup and teardown of mozharness)

This patch makes the mozpool request last 4 hours, in theory this is fine even for short jobs since we should always be releasing the request when we're done with the job (even on failed jobs) - anyplace that doesn't hold true will need to be dealt with though.
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #779337 - Flags: review?(aki)
Attachment #779337 - Flags: review?(aki) → review+
Landed and deployed to production:

https://hg.mozilla.org/build/mozharness/rev/e79596c26aae
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
So, based on findings in Bug 883539 and a chat with :gbrown on IRC it turns out that I didn't actually do this bug right, I merely increased duration for *b2g* and not android.

I'm sad that I made that mistake, and even sadder that it wasted time on a few fronts and wasn't caught until now. I'll spin up a new patch today.
Status: RESOLVED → REOPENED
Flags: needinfo?(bugspam.Callek)
Resolution: FIXED → ---
Attachment #8368663 - Flags: review?(gbrown)
Flags: needinfo?(bugspam.Callek)
Attachment #8368663 - Flags: review?(gbrown) → review+
merged to production
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Unfortunately, this did not fix the robocop 30 minute retries: https://tbpl.mozilla.org/php/getParsedLog.php?id=33894396&tree=Try&full=1#error0.

Thanks anyway -- it was worth a try.
actually it did help!

11:40:17     INFO - Waiting for request 'ready' stage.  Current state: 'contact_lifeguard'
11:41:17     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
11:42:17     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
11:43:17     INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py']
...
12:15:00     INFO -  42 INFO SimpleTest FINISHED
12:15:00     INFO -  INFO | automation.py | Application ran for: 0:01:40.592312
12:15:00     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpFIKeeHpidlog
12:15:00     INFO -  /data/anr/traces.txt not found
12:15:01     INFO -  WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
12:15:01     INFO -  runtests.py | Running tests: end.
12:15:09     INFO -  MochitestServer : launching [u'/builds/panda-0270/test/build/hostutils/bin/xpcshell', '-g', '/builds/panda-0270/test/build/hostutils/xre', '-v', '170', '-f', '/builds/panda-0270/test/build/hostutils/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpcRRPj1'; const _SERVER_PORT = '30270'; const _SERVER_ADDR = '10.12.130.18'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', './server.js']
12:15:09     INFO -  runtests.py | Server pid: 16259
12:15:10     INFO -  runtests.py | Running tests: start.
12:16:36     INFO -  Robocop process name: org.mozilla.fennec
12:17:27     INFO -  Traceback (most recent call last):
....
12:18:01     INFO - Request 'http://mobile-imaging-003.p3.releng.scl1.mozilla.com/api/request/860091/' deleted on cleanup



Mozpool agrees with that:
2014-01-31T11:42:40 statemachine sending imaging result 'complete' to Mozpool
2014-01-31T11:42:40 statemachine entering state ready
2014-01-31T12:22:53 sut connecting to SUT agent
2014-01-31T12:33:03 sut connecting to SUT agent
2014-01-31T12:43:04 sut connecting to SUT agent

So that is 33 minutes instead of 30-on-the-dot.

With mozpool never checking on the device until ~4 minutes after our request was cleaned up, after the job was done.

So while _this_ is fixed, it certainly spells of some other issue in the test harness/tests
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.