Closed Bug 1024573 Opened 10 years ago Closed 10 years ago

Gaia jobs should wrap node git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']")

Categories

(Testing Graveyard :: JSMarionette, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jgriffin)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

The following git clone should be wrapped with releng's retry script, so that we have several attempts at cloning and most importantly output a parsable error message if it still doesn't succeed after that. b2g_ubuntu64_vm mozilla-inbound opt test gaia-linter on 2014-06-12 09:14:03 PDT for push edf5e2dc9198 https://tbpl.mozilla.org/php/getParsedLog.php?id=41608345&tree=Mozilla-Inbound { 09:23:29 INFO - ##### 09:23:29 INFO - ##### Running run-tests step. 09:23:29 INFO - ##### 09:23:29 INFO - Running pre-action listener: _resource_record_pre_action 09:23:29 INFO - Running main action method: run_tests 09:23:29 INFO - copying tree: /builds/slave/test/build/application/b2g to /builds/slave/test/gaia/b2g 09:23:29 INFO - rmtree: /builds/slave/test/gaia/b2g 09:23:30 INFO - Running command: ['npm', 'cache', 'clean'] 09:23:30 INFO - Copy/paste: npm cache clean 09:23:35 INFO - Return code: 0 09:23:35 INFO - retry: Calling <bound method GaiaIntegrationTest.run_command of <__main__.GaiaIntegrationTest object at 0x1ea7590>> with args: [['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']], kwargs: {'error_list': [{'substr': 'command not found', 'level': 'error'}, {'substr': 'npm ERR! Error:', 'level': 'error'}], 'cwd': '/builds/slave/test/gaia', 'output_timeout': 300}, attempt #1 09:23:35 INFO - Running command: ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git'] in /builds/slave/test/gaia 09:23:35 INFO - Copy/paste: make node_modules NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git 09:23:35 INFO - Calling ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git'] with output_timeout 300 09:23:36 INFO - if [ ! -d "git-gaia-node-modules" ] ; then \ 09:23:36 INFO - git clone "https://git.mozilla.org/b2g/gaia-node-modules.git" "git-gaia-node-modules" ; \ 09:23:36 INFO - fi 09:23:36 INFO - Cloning into 'git-gaia-node-modules'... 09:28:36 INFO - Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git'] 09:28:36 ERROR - timed out after 300 seconds of no output }
Though I guess this isn't specific to gaia-linter in fact, but any of the gaia jobs that require using node/npm.
Component: General → Gaia::Build
Summary: Gaia-linter should wrap git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']") → Gaia jobs should wrap node git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']")
b2g_ubuntu64_vm b2g-inbound opt test gaia-build on 2014-06-12 09:18:13 PDT for push cb841932d877 https://tbpl.mozilla.org/php/getParsedLog.php?id=41609484&tree=B2g-Inbound
b2g_ubuntu64_vm b2g-inbound opt test gaia-linter on 2014-06-12 09:32:35 PDT for push 405b97a3955c https://tbpl.mozilla.org/php/getParsedLog.php?id=41609879&tree=B2g-Inbound
Depends on: 1033472
Kevin, can you please take a look at some of these recent failures? I want to say this is infra-related, but it only seems to be hitting Gij. So either every other harness is handling things more robustly or there's something going on with the JSMarionette harness.
Flags: needinfo?(kgrandon)
Seems infra-related to me as well, but I'm not sure what's going on. It seems that we always get to the same line, then it just sits there. From the logs, it looks like we get to here before the timeout: 14:11:49 INFO - > websocket@1.0.14 install /builds/slave/test/gaia/node_modules/marionette-js-runner/node_modules/marionette-js-logger/node_modules/websocket 14:11:49 INFO - > (node-gyp rebuild 2> builderror.log) || (exit 0) The second line, (node-gyp rebuild 2> builderr.log), is also contained within a successful run, so I wonder if something is trying to get logged to builderr.log? I don't know our harness setup too well, but I suppose I would recommend trying to debug and seeing what gets logged to builderr.log. Adding a needinfo? on Gareth and Dale - who have been active in the marionette-js-logger repo[1] recently. [1] https://github.com/mozilla-b2g/marionette-js-logger/commits/master
Component: Gaia::Build → JSMarionette
Flags: needinfo?(kgrandon)
Flags: needinfo?(gaye)
Flags: needinfo?(dale)
Product: Firefox OS → Testing
My $0.02 here - I believe the issue is with the way the repository is being pulled, which prevents a retry from happening. (Based on experience with vcs-sync, we do see occasional network hangs talking to either {hg,git}.m.o -- I'll file a bug based on that, and the data I have. That experience says a "real" retry should succeed.) The mozharness code does have a retry implemented, as requested: http://hg.mozilla.org/build/mozharness/annotate/f05a3b9237d8/mozharness/mozilla/gaia.py#l309 However, it is NOT a retry of the git operation, it is a retry of the make command. Looking at that make target shows that it only cares if the directory exists and is newer than gaia_node_modules.revision: http://mxr.mozilla.org/gaia/source/Makefile#719 My _guess_ is the timed out run ends up leaving an incomplete directory in place That matches with the "instant successful retry" attempt that shows in the logs. adding a ni to :jgriffin - he might know of a way to remove the directory before the make command is retried.
Flags: needinfo?(jgriffin)
Depends on: vcshangs
Bug 1113944 is to find the root cause, which affects other systems. Continued TBPL Robot logging will be helpful, as some of the network devices do not retain logs for very long.
(In reply to Hal Wine [:hwine] (Back Dec 29) (use needinfo) from comment #78) > My $0.02 here - I believe the issue is with the way the repository is being > pulled, which prevents a retry from happening. (Based on experience with > vcs-sync, we do see occasional network hangs talking to either {hg,git}.m.o > -- I'll file a bug based on that, and the data I have. That experience says > a "real" retry should succeed.) > > The mozharness code does have a retry implemented, as requested: > http://hg.mozilla.org/build/mozharness/annotate/f05a3b9237d8/mozharness/ > mozilla/gaia.py#l309 > > However, it is NOT a retry of the git operation, it is a retry of the make > command. Looking at that make target shows that it only cares if the > directory exists and is newer than gaia_node_modules.revision: > http://mxr.mozilla.org/gaia/source/Makefile#719 > > My _guess_ is the timed out run ends up leaving an incomplete directory in > place That matches with the "instant successful retry" attempt that shows > in the logs. > > adding a ni to :jgriffin - he might know of a way to remove the directory > before the make command is retried. I agree with your diagnosis. There's no good way to remove the directory from within mozharness without refactoring the retry method to add something like an on_fail hook. This is a bit hacky, and it seems like this should happen within the Gaia build system instead. cc'ing a few more Gaia folks to see if they can do this.
Flags: needinfo?(kgrandon)
Flags: needinfo?(jlal)
Flags: needinfo?(jgriffin)
I'll leave this up to James, but it seems like taskcluster and automatic retries might be the way to go here.
Flags: needinfo?(kgrandon)
(In reply to Jonathan Griffin (:jgriffin) from comment #80) > (In reply to Hal Wine [:hwine] (Back Dec 29) (use needinfo) from comment #78) > > adding a ni to :jgriffin - he might know of a way to remove the directory > > before the make command is retried. > > I agree with your diagnosis. There's no good way to remove the directory > from within mozharness without refactoring the retry method to add something > like an on_fail hook. This is a bit hacky, and it seems like this should > happen within the Gaia build system instead. cc'ing a few more Gaia folks > to see if they can do this. AFAICT, there is a "cleanup" arg to self.retry, that can be used here instead of trying to hack together a refactor. http://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#536 Am I off base?
Flags: needinfo?(jgriffin)
Right you are! I wasn't aware of the 'cleanup' arg. I'll make a fix using that.
Flags: needinfo?(jgriffin)
Assignee: nobody → jgriffin
/r/2051 - Bug 1024573 - Cleanup node_modules dir after make failure, r=ahal Pull down this commit: hg pull review -r 7497fdcd22e6614b71904efc7f281b79f4fc609f
Attachment #8544674 - Flags: review?(ahalberstadt)
Attachment #8544674 - Flags: review?(ahalberstadt) → review+
Flags: needinfo?(jlal)
Flags: needinfo?(gaye)
Flags: needinfo?(dale)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Should this bug reopened for the additional failure reports, or a new bug be filed for them?
Flags: needinfo?(jgriffin)
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #135) > Should this bug reopened for the additional failure reports, or a new bug be > filed for them? A new bug should be filed for this. The TaskCluster team basically undid the fix for this once the tests were moved to TaskCluster, so it's likely to be a different set of people who need to resolve this.
Flags: needinfo?(jgriffin)
Blocks: 1150094
Attachment #8544674 - Attachment is obsolete: true
Attachment #8618172 - Flags: review+
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: