Closed Bug 1256300 Opened 8 years ago Closed 8 years ago

Intermittent Android "command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py'..."

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(firefox47 unaffected, firefox48 fixed, firefox49 fixed)

RESOLVED FIXED
Tracking Status
firefox47 --- unaffected
firefox48 --- fixed
firefox49 --- fixed

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

As reported in bug 1051908 for the last 2 weeks. Maybe a hang in tooltool.py?
See Also: → 1237706
There is variability in the 2400 second timeout logs collected in this bug, so there may be multiple issues at play here. A few logs (the minority), like https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3785227#L848, look like:

 04:14:56     INFO - Copy/paste: /tools/tooltool.py --url https://api.pub.build.mozilla.org/tooltool/ --authentication-file /builds/relengapi.tok fetch -m /builds/slave/test/build/tooltool.tt -o -c /builds/tooltool_cache
 04:14:56     INFO -  INFO - File android-sdk_r24.0.2-linux.tgz not present in local cache folder /builds/tooltool_cache
 04:14:56     INFO -  INFO - Attempting to fetch from 'https://api.pub.build.mozilla.org/tooltool/'...
 
 command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm_4_3.py', '--test-suite', 'mochitest-1', '--blob-upload-branch', 'mozilla-central', '--download-symbols', 'ondemand'], attempting to kill
 process killed by signal 9
 program finished with exit code -1
 elapsedTime=2417.380167

I think tooltool might be hung. At any rate, it seems possible that tooltool could hang, or download very, very slowly sometimes. I'd prefer to see these cases fail with a specific mozharness timeout rather than wait for the buildbot timeout.

A random sampling of test logs suggest tooltool completes in less than 30 seconds for most of our current downloads, so this 600 second output timeout seems reasonably safe.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=610594efd242d3f5534534acd33886bc544ec48a suggests this does no harm.


I intend to leave-open for follow-up on other timeout causes.
Attachment #8750480 - Flags: review?(rail)
Attachment #8750480 - Flags: review?(rail) → review+
Keywords: leave-open
The other, more common, log variation is:

 09:38:17     INFO - Writing buildbot properties ['build_url'] to /builds/slave/test/properties/build_url
 09:38:17     INFO - Writing to file /builds/slave/test/properties/build_url
 09:38:17     INFO - Contents:
 09:38:17     INFO -  build_url:https://queue.taskcluster.net/v1/task/b0gIAvQVTUuQuR06YXr4zw/artifacts/public/build/fennec-49.0a1.en-US.android-arm.apk
 
 command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm_4_3.py', '--test-suite', 'mochitest-16', '--blob-upload-branch', 'mozilla-inbound', '--download-symbols', 'ondemand'], attempting to kill
 process killed by signal 9
 program finished with exit code -1
 elapsedTime=2449.852416

The final "build_url:" log comes from:

  _download_and_extract
    _download_installer
      set_buildbot_properties
        dump_buildbot_properties
          write_to_file

Note that this variation occurs for both opt and debug jobs. For debug, the next non-trivial operation is to download symbols; for opt, the next non-trivial operation is downloading host utilities.

The first, next, missing log statement is the "proxxy config: ..." reported by get_proxies_for_url.
This is a speculative fix for the log variation shown in Comment 7. A hang in urllib2.urlopen() here would be consistent with those logs for both opt and debug cases.

Most (all?) other urllib2.urlopen calls in mozharness use a timeout parameter, typically between 10 and 60 seconds. In this case, additional logging in treeherder logs show urlopen taking approximately 20 seconds in the majority of cases - https://treeherder.mozilla.org/#/jobs?repo=try&revision=9087cd124b23 - so I've opted for the longer 120 second timeout.

On timeout, I see urllib2.urlopen throwing an SSLError, which is a socket.error, but there may be some variation across urllib2 implementations, and I note other mozharness code handling socket.timeout also, so I've included that as well.
Attachment #8751366 - Flags: review?(jlund)
Comment on attachment 8751366 [details] [diff] [review]
add timeout to urllib2.urlopen call

Review of attachment 8751366 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me. just one comment about proposing to fold up exception blocks

::: testing/mozharness/mozharness/mozilla/testing/testbase.py
@@ +201,5 @@
>              except urllib2.URLError:
>                  self.warning("Can't figure out symbols_url from installer_url: %s!" %
>                               self.installer_url)
> +            except socket.timeout, e:
> +                self.warning("Can't figure out symbols_url from installer_url: %s - %s" %

so exception message and error_level is the same each time? I wonder if we should fold up all of these into one except block. and also use self.exception[1] that will grab/log traceback for us.

e.g.

except (urllib2.URLError, socket.error, socket.timeout):
    self.exception("Can't figure out symbols_url from installer_url: %s!" % self.installer_url, level=WARNING)


[1] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/log.py#157
Attachment #8751366 - Flags: review?(jlund) → review+
Great idea - thanks jlund!
Failures continue on mozilla-aurora (48). It looks like this has never been reported on 47; I wonder why?
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.