Closed
Bug 1024573
Opened 10 years ago
Closed 10 years ago
Gaia jobs should wrap node git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git ']")
Categories
(Testing Graveyard :: JSMarionette, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: jgriffin)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 1 obsolete file)
The following git clone should be wrapped with releng's retry script, so that we have several attempts at cloning and most importantly output a parsable error message if it still doesn't succeed after that.
b2g_ubuntu64_vm mozilla-inbound opt test gaia-linter on 2014-06-12 09:14:03 PDT for push edf5e2dc9198
https://tbpl.mozilla.org/php/getParsedLog.php?id=41608345&tree=Mozilla-Inbound
{
09:23:29 INFO - #####
09:23:29 INFO - ##### Running run-tests step.
09:23:29 INFO - #####
09:23:29 INFO - Running pre-action listener: _resource_record_pre_action
09:23:29 INFO - Running main action method: run_tests
09:23:29 INFO - copying tree: /builds/slave/test/build/application/b2g to /builds/slave/test/gaia/b2g
09:23:29 INFO - rmtree: /builds/slave/test/gaia/b2g
09:23:30 INFO - Running command: ['npm', 'cache', 'clean']
09:23:30 INFO - Copy/paste: npm cache clean
09:23:35 INFO - Return code: 0
09:23:35 INFO - retry: Calling <bound method GaiaIntegrationTest.run_command of <__main__.GaiaIntegrationTest object at 0x1ea7590>> with args: [['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']], kwargs: {'error_list': [{'substr': 'command not found', 'level': 'error'}, {'substr': 'npm ERR! Error:', 'level': 'error'}], 'cwd': '/builds/slave/test/gaia', 'output_timeout': 300}, attempt #1
09:23:35 INFO - Running command: ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git'] in /builds/slave/test/gaia
09:23:35 INFO - Copy/paste: make node_modules NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git
09:23:35 INFO - Calling ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git'] with output_timeout 300
09:23:36 INFO - if [ ! -d "git-gaia-node-modules" ] ; then \
09:23:36 INFO - git clone "https://git.mozilla.org/b2g/gaia-node-modules.git" "git-gaia-node-modules" ; \
09:23:36 INFO - fi
09:23:36 INFO - Cloning into 'git-gaia-node-modules'...
09:28:36 INFO - Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']
09:28:36 ERROR - timed out after 300 seconds of no output
}
Reporter | ||
Comment 1•10 years ago
|
||
Though I guess this isn't specific to gaia-linter in fact, but any of the gaia jobs that require using node/npm.
Component: General → Gaia::Build
Reporter | ||
Updated•10 years ago
|
Summary: Gaia-linter should wrap git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']") → Gaia jobs should wrap node git clones with a retry ("Automation Error: mozprocess timed out after 300 seconds running ['make', 'node_modules', 'NODE_MODULES_GIT_URL=https://git.mozilla.org/b2g/gaia-node-modules.git']")
Reporter | ||
Comment 3•10 years ago
|
||
b2g_ubuntu64_vm b2g-inbound opt test gaia-build on 2014-06-12 09:18:13 PDT for push cb841932d877
https://tbpl.mozilla.org/php/getParsedLog.php?id=41609484&tree=B2g-Inbound
Reporter | ||
Comment 4•10 years ago
|
||
b2g_ubuntu64_vm b2g-inbound opt test gaia-linter on 2014-06-12 09:32:35 PDT for push 405b97a3955c
https://tbpl.mozilla.org/php/getParsedLog.php?id=41609879&tree=B2g-Inbound
Reporter | ||
Comment 5•10 years ago
|
||
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 76•10 years ago
|
||
Kevin, can you please take a look at some of these recent failures? I want to say this is infra-related, but it only seems to be hitting Gij. So either every other harness is handling things more robustly or there's something going on with the JSMarionette harness.
Flags: needinfo?(kgrandon)
Comment 77•10 years ago
|
||
Seems infra-related to me as well, but I'm not sure what's going on. It seems that we always get to the same line, then it just sits there. From the logs, it looks like we get to here before the timeout:
14:11:49 INFO - > websocket@1.0.14 install /builds/slave/test/gaia/node_modules/marionette-js-runner/node_modules/marionette-js-logger/node_modules/websocket
14:11:49 INFO - > (node-gyp rebuild 2> builderror.log) || (exit 0)
The second line, (node-gyp rebuild 2> builderr.log), is also contained within a successful run, so I wonder if something is trying to get logged to builderr.log? I don't know our harness setup too well, but I suppose I would recommend trying to debug and seeing what gets logged to builderr.log.
Adding a needinfo? on Gareth and Dale - who have been active in the marionette-js-logger repo[1] recently.
[1] https://github.com/mozilla-b2g/marionette-js-logger/commits/master
Component: Gaia::Build → JSMarionette
Flags: needinfo?(kgrandon)
Flags: needinfo?(gaye)
Flags: needinfo?(dale)
Product: Firefox OS → Testing
Comment 78•10 years ago
|
||
My $0.02 here - I believe the issue is with the way the repository is being pulled, which prevents a retry from happening. (Based on experience with vcs-sync, we do see occasional network hangs talking to either {hg,git}.m.o -- I'll file a bug based on that, and the data I have. That experience says a "real" retry should succeed.)
The mozharness code does have a retry implemented, as requested:
http://hg.mozilla.org/build/mozharness/annotate/f05a3b9237d8/mozharness/mozilla/gaia.py#l309
However, it is NOT a retry of the git operation, it is a retry of the make command. Looking at that make target shows that it only cares if the directory exists and is newer than gaia_node_modules.revision:
http://mxr.mozilla.org/gaia/source/Makefile#719
My _guess_ is the timed out run ends up leaving an incomplete directory in place That matches with the "instant successful retry" attempt that shows in the logs.
adding a ni to :jgriffin - he might know of a way to remove the directory before the make command is retried.
Flags: needinfo?(jgriffin)
Comment 79•10 years ago
|
||
Bug 1113944 is to find the root cause, which affects other systems. Continued TBPL Robot logging will be helpful, as some of the network devices do not retain logs for very long.
Assignee | ||
Comment 80•10 years ago
|
||
(In reply to Hal Wine [:hwine] (Back Dec 29) (use needinfo) from comment #78)
> My $0.02 here - I believe the issue is with the way the repository is being
> pulled, which prevents a retry from happening. (Based on experience with
> vcs-sync, we do see occasional network hangs talking to either {hg,git}.m.o
> -- I'll file a bug based on that, and the data I have. That experience says
> a "real" retry should succeed.)
>
> The mozharness code does have a retry implemented, as requested:
> http://hg.mozilla.org/build/mozharness/annotate/f05a3b9237d8/mozharness/
> mozilla/gaia.py#l309
>
> However, it is NOT a retry of the git operation, it is a retry of the make
> command. Looking at that make target shows that it only cares if the
> directory exists and is newer than gaia_node_modules.revision:
> http://mxr.mozilla.org/gaia/source/Makefile#719
>
> My _guess_ is the timed out run ends up leaving an incomplete directory in
> place That matches with the "instant successful retry" attempt that shows
> in the logs.
>
> adding a ni to :jgriffin - he might know of a way to remove the directory
> before the make command is retried.
I agree with your diagnosis. There's no good way to remove the directory from within mozharness without refactoring the retry method to add something like an on_fail hook. This is a bit hacky, and it seems like this should happen within the Gaia build system instead. cc'ing a few more Gaia folks to see if they can do this.
Flags: needinfo?(kgrandon)
Flags: needinfo?(jlal)
Flags: needinfo?(jgriffin)
Comment 81•10 years ago
|
||
I'll leave this up to James, but it seems like taskcluster and automatic retries might be the way to go here.
Flags: needinfo?(kgrandon)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 102•10 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #80)
> (In reply to Hal Wine [:hwine] (Back Dec 29) (use needinfo) from comment #78)
> > adding a ni to :jgriffin - he might know of a way to remove the directory
> > before the make command is retried.
>
> I agree with your diagnosis. There's no good way to remove the directory
> from within mozharness without refactoring the retry method to add something
> like an on_fail hook. This is a bit hacky, and it seems like this should
> happen within the Gaia build system instead. cc'ing a few more Gaia folks
> to see if they can do this.
AFAICT, there is a "cleanup" arg to self.retry, that can be used here instead of trying to hack together a refactor.
http://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#536
Am I off base?
Flags: needinfo?(jgriffin)
Assignee | ||
Comment 103•10 years ago
|
||
Right you are! I wasn't aware of the 'cleanup' arg. I'll make a fix using that.
Flags: needinfo?(jgriffin)
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jgriffin
Assignee | ||
Comment 104•10 years ago
|
||
Trying a fix on mozharness-try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=73cb7beea997
Assignee | ||
Comment 105•10 years ago
|
||
Assignee | ||
Comment 106•10 years ago
|
||
/r/2051 - Bug 1024573 - Cleanup node_modules dir after make failure, r=ahal
Pull down this commit:
hg pull review -r 7497fdcd22e6614b71904efc7f281b79f4fc609f
Assignee | ||
Updated•10 years ago
|
Attachment #8544674 -
Flags: review?(ahalberstadt)
Updated•10 years ago
|
Attachment #8544674 -
Flags: review?(ahalberstadt) → review+
Comment 107•10 years ago
|
||
Assignee | ||
Comment 108•10 years ago
|
||
Flags: needinfo?(jlal)
Flags: needinfo?(gaye)
Flags: needinfo?(dale)
Assignee | ||
Comment 109•10 years ago
|
||
Merged to production: https://hg.mozilla.org/build/mozharness/rev/6638d8bea71c
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 110•10 years ago
|
||
In production: https://hg.mozilla.org/build/mozharness/rev/774932288f08
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Should this bug reopened for the additional failure reports, or a new bug be filed for them?
Flags: needinfo?(jgriffin)
Assignee | ||
Comment 136•10 years ago
|
||
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #135)
> Should this bug reopened for the additional failure reports, or a new bug be
> filed for them?
A new bug should be filed for this. The TaskCluster team basically undid the fix for this once the tests were moved to TaskCluster, so it's likely to be a different set of people who need to resolve this.
Flags: needinfo?(jgriffin)
Assignee | ||
Comment 137•9 years ago
|
||
Attachment #8544674 -
Attachment is obsolete: true
Attachment #8618172 -
Flags: review+
Assignee | ||
Comment 138•9 years ago
|
||
Updated•7 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•