Closed Bug 920161 Opened 7 years ago Closed 6 years ago

Cloning of talos repo does not retry and/or output a TBPL compatible failure message ("command timed out: 3600 seconds without output, attempting to kill")

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: aki)

Details

(Keywords: intermittent-failure, sheriffing-P1)

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #920153 +++

The talos equivalent of bug 920153.

eg:
https://tbpl.mozilla.org/php/getParsedLog.php?id=27701351&tree=Mozilla-Inbound

{
06:13:24     INFO - #####
06:13:24     INFO - ##### Running clone-talos step.
06:13:24     INFO - #####
06:13:24     INFO - Running pre-action listener: _resource_record_pre_action
06:13:24     INFO - Running main action method: clone_talos
06:13:24     INFO - Populating webroot /builds/slave/talos-slave/talos-data...
06:13:24     INFO - rmtree: /builds/slave/talos-slave/talos-data/talos
06:13:24     INFO - retry: Calling <function rmtree at 0xb6e6d454> with args: ('/builds/slave/talos-slave/talos-data/talos',), kwargs: {}, attempt #1
06:13:24     INFO - retry: Calling <bound method Talos._get_revision of <mozharness.mozilla.testing.talos.Talos object at 0xb7040b6c>> with args: (<mozharness.base.vcs.mercurial.MercurialVCS object at 0xb6ea1a2c>, '/builds/slave/talos-slave/test/build/talos_repo'), kwargs: {}, attempt #1
06:13:24     INFO - Setting /builds/slave/talos-slave/test/build/talos_repo to http://hg.mozilla.org/build/talos revision ca2229a32cb6.
06:13:24     INFO - Cloning http://hg.mozilla.org/build/talos to /builds/slave/talos-slave/test/build/talos_repo.
06:13:24     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', 'clone', 'http://hg.mozilla.org/build/talos', '/builds/slave/talos-slave/test/build/talos_repo']
06:13:24     INFO - Copy/paste: hg --config ui.merge=internal:merge clone http://hg.mozilla.org/build/talos /builds/slave/talos-slave/test/build/talos_repo

command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3658.177546
========= Finished '/tools/buildbot/bin/python scripts/scripts/talos_script.py ...' failed (results: 2, elapsed: 1 hrs, 58 secs) (at 2013-09-11 07:13:24.901622) =========
}

Expected:
* Few retries of the hg clone
* "Automation Error: Unable to clone talos repo" (and buildbot not having to kill the run).
Keywords: sheriffing-P1
Chris, is there someone that could take a look at this for us? :-)
Flags: needinfo?(catlee)
Attached patch talos_timeoutSplinter Review
This patch:

* allows for an "output_timeout" in the repo definition for MercurialVCS
* fixes a bunch of pep8 in mozharness.mozilla.testing.talos
* adds a hardcoded 1200 second talos clone timeout
* adds a hardcoded mozprocess dependency; we now create the virtualenv before cloning talos, and then update the virtualenv with talos+pyyaml later.

Got past the talos clone on ash with no concerns; the talos clone had an output timeout of 1200 as expected.
Assignee: nobody → aki
Attachment #8355690 - Flags: review?(jgriffin)
Flags: needinfo?(catlee)
Comment on attachment 8355690 [details] [diff] [review]
talos_timeout

Review of attachment 8355690 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8355690 - Flags: review?(jgriffin) → review+
Armen: are you going to merge mozharness at some point this week, or should I?
Flags: needinfo?(armenzg)
Merged mozharness (not getting CCed to this bug).

(In reply to Aki Sasaki [:aki] from comment #123)
> Armen: are you going to merge mozharness at some point this week, or should
> I?

I had pushed new code to Cypress to make sure that default was in good shape. I had to do a lot of retries until it all looked green.
Flags: needinfo?(armenzg)
Thanks!

This bug should be resolved for desktop talos.
Android panda talos runs a separate workflow and may not be fixed yet, but looks to be the minority of issues posted here.

I *think* I should resolve this bug, but could leave open/morph for Pandas.
Do you have a preference?
Flags: needinfo?(emorley)
(In reply to Aki Sasaki [:aki] from comment #125)
> I *think* I should resolve this bug, but could leave open/morph for Pandas.
> Do you have a preference?

Let's close this - very few of the failures were for the Pandas - we can always file another bug if needed.

Thank you all for fixing this! :-D
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(emorley)
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.