Closed Bug 1256300 Opened 5 years ago Closed 5 years ago
Intermittent Android "command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android
_emulator _unittest .py'..."
As reported in bug 1051908 for the last 2 weeks. Maybe a hang in tooltool.py?
There is variability in the 2400 second timeout logs collected in this bug, so there may be multiple issues at play here. A few logs (the minority), like https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3785227#L848, look like: 04:14:56 INFO - Copy/paste: /tools/tooltool.py --url https://api.pub.build.mozilla.org/tooltool/ --authentication-file /builds/relengapi.tok fetch -m /builds/slave/test/build/tooltool.tt -o -c /builds/tooltool_cache 04:14:56 INFO - INFO - File android-sdk_r24.0.2-linux.tgz not present in local cache folder /builds/tooltool_cache 04:14:56 INFO - INFO - Attempting to fetch from 'https://api.pub.build.mozilla.org/tooltool/'... command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm_4_3.py', '--test-suite', 'mochitest-1', '--blob-upload-branch', 'mozilla-central', '--download-symbols', 'ondemand'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2417.380167 I think tooltool might be hung. At any rate, it seems possible that tooltool could hang, or download very, very slowly sometimes. I'd prefer to see these cases fail with a specific mozharness timeout rather than wait for the buildbot timeout. A random sampling of test logs suggest tooltool completes in less than 30 seconds for most of our current downloads, so this 600 second output timeout seems reasonably safe. https://treeherder.mozilla.org/#/jobs?repo=try&revision=610594efd242d3f5534534acd33886bc544ec48a suggests this does no harm. I intend to leave-open for follow-up on other timeout causes.
Attachment #8750480 - Flags: review?(rail)
The other, more common, log variation is: 09:38:17 INFO - Writing buildbot properties ['build_url'] to /builds/slave/test/properties/build_url 09:38:17 INFO - Writing to file /builds/slave/test/properties/build_url 09:38:17 INFO - Contents: 09:38:17 INFO - build_url:https://queue.taskcluster.net/v1/task/b0gIAvQVTUuQuR06YXr4zw/artifacts/public/build/fennec-49.0a1.en-US.android-arm.apk command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm_4_3.py', '--test-suite', 'mochitest-16', '--blob-upload-branch', 'mozilla-inbound', '--download-symbols', 'ondemand'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2449.852416 The final "build_url:" log comes from: _download_and_extract _download_installer set_buildbot_properties dump_buildbot_properties write_to_file Note that this variation occurs for both opt and debug jobs. For debug, the next non-trivial operation is to download symbols; for opt, the next non-trivial operation is downloading host utilities. The first, next, missing log statement is the "proxxy config: ..." reported by get_proxies_for_url.
This is a speculative fix for the log variation shown in Comment 7. A hang in urllib2.urlopen() here would be consistent with those logs for both opt and debug cases. Most (all?) other urllib2.urlopen calls in mozharness use a timeout parameter, typically between 10 and 60 seconds. In this case, additional logging in treeherder logs show urlopen taking approximately 20 seconds in the majority of cases - https://treeherder.mozilla.org/#/jobs?repo=try&revision=9087cd124b23 - so I've opted for the longer 120 second timeout. On timeout, I see urllib2.urlopen throwing an SSLError, which is a socket.error, but there may be some variation across urllib2 implementations, and I note other mozharness code handling socket.timeout also, so I've included that as well.
Attachment #8751366 - Flags: review?(jlund)
Comment on attachment 8751366 [details] [diff] [review] add timeout to urllib2.urlopen call Review of attachment 8751366 [details] [diff] [review]: ----------------------------------------------------------------- looks good to me. just one comment about proposing to fold up exception blocks ::: testing/mozharness/mozharness/mozilla/testing/testbase.py @@ +201,5 @@ > except urllib2.URLError: > self.warning("Can't figure out symbols_url from installer_url: %s!" % > self.installer_url) > + except socket.timeout, e: > + self.warning("Can't figure out symbols_url from installer_url: %s - %s" % so exception message and error_level is the same each time? I wonder if we should fold up all of these into one except block. and also use self.exception that will grab/log traceback for us. e.g. except (urllib2.URLError, socket.error, socket.timeout): self.exception("Can't figure out symbols_url from installer_url: %s!" % self.installer_url, level=WARNING)  https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/log.py#157
Attachment #8751366 - Flags: review?(jlund) → review+
Great idea - thanks jlund!
Failures continue on mozilla-aurora (48). It looks like this has never been reported on 47; I wonder why?
You need to log in before you can comment on or make changes to this bug.