Closed Bug 895466 Opened 8 years ago Closed 7 years ago
Increase mozpool request time
There is concern that the mozpool request_device duration at: http://mxr.mozilla.org/build/source/mozpool/mozpoolclient/mozpoolclient/mozpoolclient.py#182 may be too short, and that may be causing, for instance, the rc2 timeouts tracked in bug 883539. We should consider increasing this to 60 minutes or more. (:jmaher also notes that there are buildbot timeouts to consider...he thinks they are 60 minutes.)
If I understand it correctly, that is *only* a timeout for requesting a device to become available. This timeout could only happen at the beginning of the job, before even one line of the tests would be run. Could you please paste a log to see? Just to make sure that it is as I say or if I got it wrong.
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) (EDT/UTC-4) from comment #1) > If I understand it correctly, that is *only* a timeout for requesting a > device to become available. I checked with dustin and jake about this, the duration param to that function is indeed a value letting mozpool how long we 'want' the device for. And has nothing to do with the initial request timeout. > Could you please paste a log to see? Just to make sure that it is as I say > or if I got it wrong. The logs are not as informative (or easy to splice), atm, I need to convince the mozpool people that the way I envision it is worthwhile! :-)
Once the request duration runs out, the request is closed, and whatever user requested the device no longer "owns" the device. At that point, Mozpool is free to do whatever it would like to the device - reboot, selftest, reimage, mine bitcoins, etc. You can either request a sufficiently-long duration up front, or renew the request before it expires.
Knowing that, fixing it should fix the issues pointed out.
So currently, the no-output timeout on these jobs is set to 40-minutes, while the max-time they are allowed to run is 4 hours. (This includes all setup and teardown of mozharness) This patch makes the mozpool request last 4 hours, in theory this is fine even for short jobs since we should always be releasing the request when we're done with the job (even on failed jobs) - anyplace that doesn't hold true will need to be dealt with though.
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #779337 - Flags: review?(aki)
Landed and deployed to production: https://hg.mozilla.org/build/mozharness/rev/e79596c26aae
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
So, based on findings in Bug 883539 and a chat with :gbrown on IRC it turns out that I didn't actually do this bug right, I merely increased duration for *b2g* and not android. I'm sad that I made that mistake, and even sadder that it wasted time on a few fronts and wasn't caught until now. I'll spin up a new patch today.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8368663 - Flags: review?(gbrown)
Attachment #8368663 - Flags: review?(gbrown) → review+
merged to production
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → FIXED
Unfortunately, this did not fix the robocop 30 minute retries: https://tbpl.mozilla.org/php/getParsedLog.php?id=33894396&tree=Try&full=1#error0. Thanks anyway -- it was worth a try.
actually it did help! 11:40:17 INFO - Waiting for request 'ready' stage. Current state: 'contact_lifeguard' 11:41:17 INFO - Waiting for request 'ready' stage. Current state: 'pending' 11:42:17 INFO - Waiting for request 'ready' stage. Current state: 'pending' 11:43:17 INFO - Running command: ['/tools/buildbot/bin/python', '/builds/sut_tools/verify.py'] ... 12:15:00 INFO - 42 INFO SimpleTest FINISHED 12:15:00 INFO - INFO | automation.py | Application ran for: 0:01:40.592312 12:15:00 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmpFIKeeHpidlog 12:15:00 INFO - /data/anr/traces.txt not found 12:15:01 INFO - WARNING | leakcheck | refcount logging is off, so leaks can't be detected! 12:15:01 INFO - runtests.py | Running tests: end. 12:15:09 INFO - MochitestServer : launching [u'/builds/panda-0270/test/build/hostutils/bin/xpcshell', '-g', '/builds/panda-0270/test/build/hostutils/xre', '-v', '170', '-f', '/builds/panda-0270/test/build/hostutils/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpcRRPj1'; const _SERVER_PORT = '30270'; const _SERVER_ADDR = '10.12.130.18'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', './server.js'] 12:15:09 INFO - runtests.py | Server pid: 16259 12:15:10 INFO - runtests.py | Running tests: start. 12:16:36 INFO - Robocop process name: org.mozilla.fennec 12:17:27 INFO - Traceback (most recent call last): .... 12:18:01 INFO - Request 'http://mobile-imaging-003.p3.releng.scl1.mozilla.com/api/request/860091/' deleted on cleanup Mozpool agrees with that: 2014-01-31T11:42:40 statemachine sending imaging result 'complete' to Mozpool 2014-01-31T11:42:40 statemachine entering state ready 2014-01-31T12:22:53 sut connecting to SUT agent 2014-01-31T12:33:03 sut connecting to SUT agent 2014-01-31T12:43:04 sut connecting to SUT agent So that is 33 minutes instead of 30-on-the-dot. With mozpool never checking on the device until ~4 minutes after our request was cleaned up, after the job was done. So while _this_ is fixed, it certainly spells of some other issue in the test harness/tests
You need to log in before you can comment on or make changes to this bug.