Closed Bug 696056 Opened 13 years ago Closed 13 years ago

Some release jobs did not re-trigger after HG failures

Categories

(Release Engineering :: Release Automation: Other, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

Details

(Whiteboard: [release-process-improvement][automation][mercurial][retry])

For instance: * Linux64 had an hg timeout rather than an error and did not re-trigger [1] * Linux64 xulrunner had a connection refused [2] * Fennec source had not been clobbered from the previous run [3] (this was a re-triggered job) * Android failed on mozharness multi-locale script [4] * All 6 linux mobile repacks failed with a Bad Gateway [5] All of these jobs required manual re-triggering rather than being retried automatically. This is a very unique problem as HG was extremely bad but it shows that we have some steps going red rather than purple which in general is good to retry. [1] command timed out: 3600 seconds without output, attempting to kill elapsedTime=3600.006155 program finished with exit code -1 [2] abort: error: Connection refused elapsedTime=0.717556 program finished with exit code 255 [3] Process stderr: abort: destination 'mozilla-beta' is not empty program finished with exit code 1 elapsedTime=1870.255260 [4] 11:28:52 ERROR - abort: HTTP Error 500: Internal Server Error 11:28:52 ERROR - Return code: 255 ... 11:28:52 ERROR - CalledProcessError: Command '['hg', 'share', '-U', '/builds/hg-shared/releases/l10n/mozilla-beta/it', '/builds/slave/rel-m-beta-lnx-andrd-bld/mozilla-beta/it']' returned non-zero exit status 255 ... Traceback (most recent call last): File "mozharness/scripts/multil10n.py", line 52, in <module> multi_locale_build.run() File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/script.py", line 509, in run self._possibly_run_method(method_name, error_if_missing=True) File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/script.py", line 480, in _possibly_run_method return getattr(self, method_name)() File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/l10n/multi_locale_build.py", line 189, in pull_locale_source tag_override=c.get('tag_override')) File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/vcs/vcsbase.py", line 102, in vcs_checkout_repos self.vcs_checkout(**kwargs) File "/builds/slave/rel-m-beta-lnx-andrd-bld/mozharness/mozharness/base/vcs/vcsbase.py", line 87, in vcs_checkout raise VCSException, "No got_revision from ensure_repo_and_revision()" mozharness.base.errors.VCSException: No got_revision from ensure_repo_and_revision() program finished with exit code 1 elapsedTime=12.626796 [5] abort: HTTP Error 502: Bad Gateway program finished with exit code 255 elapsedTime=0.457380
As I've mentioned elsewhere, hg clone jobs that *were* retrying automatically actually made things worse by sustaining and increasing the load. While I'd probably make an exception for release jobs, I'm more in favor of improving our hg infr to handle our required load.
Priority: -- → P3
Summary: Some jobs did not re-trigger after HG failures → Some release jobs did not re-trigger after HG failures
Whiteboard: [release-process-improvement][automation][mercurial][retry]
Mass move of bugs to Release Automation component.
Component: Release Engineering → Release Engineering: Automation (Release Automation)
No longer blocks: hg-automation
Comment #0 talks about a bunch of different failures. I know that some of these are fixed, and that we're in a better place these days w.r.t. recovering from hg failures. Let's file any new issues that come up individually.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.