Closed Bug 696056 Opened 8 years ago Closed 8 years ago

Some release jobs did not re-trigger after HG failures

Categories

(Release Engineering :: Release Automation: Other, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

Details

(Whiteboard: [release-process-improvement][automation][mercurial][retry])

For instance:
* Linux64 had an hg timeout rather than an error and did not re-trigger [1]
* Linux64 xulrunner had a connection refused [2]
* Fennec source had not been clobbered from the previous run [3] (this was a re-triggered job)
* Android failed on mozharness multi-locale script [4]
* All 6 linux mobile repacks failed with a Bad Gateway [5]

All of these jobs required manual re-triggering rather than being retried automatically.

This is a very unique problem as HG was extremely bad but it shows that we have some steps going red rather than purple which in general is good to retry.

[1]
command timed out: 3600 seconds without output, attempting to kill
elapsedTime=3600.006155
program finished with exit code -1

[2]
abort: error: Connection refused
elapsedTime=0.717556
program finished with exit code 255

[3] 
Process stderr:
abort: destination 'mozilla-beta' is not empty

program finished with exit code 1
elapsedTime=1870.255260

[4]
11:28:52    ERROR -  abort: HTTP Error 500: Internal Server Error
11:28:52    ERROR - Return code: 255
...
11:28:52    ERROR - CalledProcessError: Command '['hg', 'share', '-U', '/builds/hg-shared/releases/l10n/mozilla-beta/it', '/builds/slave/rel-m-beta-lnx-andrd-bld/mozilla-beta/it']' returned non-zero exit status 255
...
Traceback (most recent call last):
  File "mozharness/scripts/multil10n.py", line 52, in <module>
    multi_locale_build.run()
  File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/script.py", line 509, in run
    self._possibly_run_method(method_name, error_if_missing=True)
  File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/script.py", line 480, in _possibly_run_method
    return getattr(self, method_name)()
  File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/l10n/multi_locale_build.py", line 189, in pull_locale_source
    tag_override=c.get('tag_override'))
  File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/vcs/vcsbase.py", line 102, in vcs_checkout_repos
    self.vcs_checkout(**kwargs)
  File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/vcs/vcsbase.py", line 87, in vcs_checkout
    raise VCSException, "No got_revision from ensure_repo_and_revision()"
mozharness.base.errors.VCSException: No got_revision from ensure_repo_and_revision()
program finished with exit code 1
elapsedTime=12.626796

[5]
abort: HTTP Error 502: Bad Gateway
program finished with exit code 255
elapsedTime=0.457380
As I've mentioned elsewhere, hg clone jobs that *were* retrying automatically actually made things worse by sustaining and increasing the load. 

While I'd probably make an exception for release jobs, I'm more in favor of improving our hg infr to handle our required load.
Priority: -- → P3
Summary: Some jobs did not re-trigger after HG failures → Some release jobs did not re-trigger after HG failures
Whiteboard: [release-process-improvement][automation][mercurial][retry]
Mass move of bugs to Release Automation component.
Component: Release Engineering → Release Engineering: Automation (Release Automation)
No longer blocks: hg-automation
Comment #0 talks about a bunch of different failures. I know that some of these are fixed, and that we're in a better place these days w.r.t. recovering from hg failures. Let's file any new issues that come up individually.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.