Closed Bug 822039 Opened 12 years ago Closed 10 years ago

Intermittent "command timed out: 60 seconds without output, attempting to kill" during "talos_from_code.py" step

Categories

(Release Engineering :: General, defect, P3)

All
macOS
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

Details

(Keywords: intermittent-failure, Whiteboard: [talos])

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=17976870&tree=Mozilla-Inbound

Rev4 MacOSX Lion 10.7 mozilla-inbound talos svgr on 2012-12-15 09:37:13 PST for push 82077de3f9bc
slave: talos-r4-lion-038

========= Started '/tools/buildbot/bin/python talos_from_code.py ...' failed (results: 2, elapsed: 20 mins, 0 secs) (at 2012-12-15 09:38:10.129904) =========
/tools/buildbot/bin/python talos_from_code.py --talos-json-url http://hg.mozilla.org/integration/mozilla-inbound/raw-file/82077de3f9bc/testing/talos/talos.json
 in dir /Users/cltbld/talos-slave/test/../talos-data (timeout 1200 secs)
 watching logfiles {}
 argv: ['/tools/buildbot/bin/python', 'talos_from_code.py', '--talos-json-url', 'http://hg.mozilla.org/integration/mozilla-inbound/raw-file/82077de3f9bc/testing/talos/talos.json']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-MbJcsm/Render
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-a4eJ9r/org.x:0
  HOME=/Users/cltbld
  LOGNAME=cltbld
  PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/Users/cltbld/talos-slave/talos-data
  PYTHONPATH=/Library/Python/2.5/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-UGEBS7/Listeners
  TMPDIR=/var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.7
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 using PTY: False

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1200.006306
========= Finished '/tools/buildbot/bin/python talos_from_code.py ...' failed (results: 2, elapsed: 20 mins, 0 secs) (at 2012-12-15 09:58:10.165531) =========
Component: General → Talos
This is probably Releng: Automation.  It is not Talos core
Interesting that it's 10.6/10.7 only - is /tools/buildbot/bin/python a version with a less hang-prone urllib2.urlopen on 10.8?
Component: Talos → Release Engineering: Automation (General)
Product: Testing → mozilla.org
QA Contact: catlee
Version: Trunk → other
Priority: -- → P3
Whiteboard: [talos]
I would not be surprised if a different versions of Python could be causing the discrepancies.
We can fine tune it in the factory.py code [1].

I don't think we have 2.7 across the slaves.
talos-r4-snow-042:talos-data cltbld$ /tools/buildbot/bin/python --version
Python 2.6.1
talos-mtnlion-r5-008:~ cltbld$ /tools/buildbot/bin/python --version
Python 2.7.2
talos-r4-snow-013:~ cltbld$ /Users/cltbld/bin//python --version
Python 2.5.4
talos-r4-snow-013:~ cltbld$ /tools/buildbot/bin/python --version
Python 2.6.1

Could Python be crashing? Can this be determined?
Maybe adding a timeout to f = urllib2.urlopen(req) could help? (even though I read there is a default value for it) [2]

[1] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l5541
[2] http://hg.mozilla.org/mozilla-central/file/default/testing/talos/talos_from_code.py#l94

http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l5874
  5874                 self.addStep(ShellCommand(
  5875                     name='download files specified in talos.json',
  5876                     command=[self.pythonWithJson(
  5877                         self.OS), 'talos_from_code.py',
  5878                         '--talos-json-url',
  5879                         WithProperties(
  5880                             '%(repo_path)s/raw-file/%(revision)s/testing/talos/talos.json')],
  5881                     workdir=self.workdirBase,
  5882                     haltOnFailure=True,
  5883                     log_eval_func=lambda c, s: regex_log_evaluator(
  5884                         c, s, talos_hgweb_errors),
  5885                 ))
If python is crashing, the step should finish at that point.

urllib2 does support a timeout: http://docs.python.org/2/library/urllib2.html#urllib2.urlopen

how long does this step normally take? 20 minutes seems like a really long time
Typically, between 0.5 and 0.6 seconds.
ok, so let's adjust the timeout here to 60 seconds and make sure we retry.
Comment on attachment 710173 [details] [diff] [review]
decrease timeout and add retry

lgtm
Attachment #710173 - Flags: review?(rail) → review+
Attachment #710173 - Flags: checked-in+
Well, you got your 60 second timeout, https://tbpl.mozilla.org/php/getParsedLog.php?id=19558405&tree=Mozilla-Aurora, but I don't think you're going to get a retry that way, since I'm pretty sure (in hindsight, of course) that the "command timed out: 60 seconds without output, attempting to kill" is one of those things like the "process killed by signal 9" that you don't get in the log, so your only hope of RETRY is one of those "rc_eval_func({0: SUCCESS, None: RETRY})" instead.
Summary: Intermittent "command timed out: 1200 seconds without output, attempting to kill" during "talos_from_code.py" step → Intermittent "command timed out: 60 seconds without output, attempting to kill" during "talos_from_code.py" step
yeah, I was wondering about that...
Product: mozilla.org → Release Engineering
Note these failures have stopped, rather than now being automatically retried (see comment 40).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: