Open Bug 1371378 Opened 7 years ago Updated 3 days ago

Intermittent abort: error applying bundle

Categories

(Developer Services :: Mercurial: robustcheckout, defect)

defect
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: aryx, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1317594 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=105581680&repo=autoland

[vcs 2017-06-08T18:46:53.914760Z] executing ['hg', 'robustcheckout', '--sharebase', '/home/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '09251a0a67718d0c341606c231c1cf0da3e41a3d', 'https://hg.mozilla.org/integration/autoland/', '/home/worker/checkouts/gecko']
[vcs 2017-06-08T18:46:53.977912Z] ensuring https://hg.mozilla.org/integration/autoland/@09251a0a67718d0c341606c231c1cf0da3e41a3d is available at /home/worker/checkouts/gecko
[vcs 2017-06-08T18:46:53.978092Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2017-06-08T18:46:54.849694Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2017-06-08T18:46:55.302676Z] applying clone bundle from https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-unified/fd72dc86c01db8ef2035c7aaeccd9758ec9bc8ae.packed1.hg
[vcs 2017-06-08T18:49:02.578380Z] error fetching bundle: Connection timed out
[vcs 2017-06-08T18:49:02.580006Z] abort: error applying bundle
There are 36 failures in the past 7 days.
Most of the occurrences are on Linux x64 debug/ asan/ opt, osx-cross opt/ debug, Linux opt and debug, android-4-0-armv7-api16 opt, linux64-stylo-only opt and debug, osx-cross-devedition opt and windows-mingw32 all.
Recent log failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=169147432&lineNumber=33
Whiteboard: [stockwell needswork]
:gps, is this something that you would be able to fix?
Flags: needinfo?(gps)
Whiteboard: [stockwell needswork] → [stockwell infra]
Yes, we should be able to fix this. But before we can, we need more details about the exception so we can detect it in robustcheckout. The patch submitted for review will facilitate that detection.
Flags: needinfo?(gps)
Comment on attachment 8961607 [details]
robustcheckout: log details of unhandled exception (bug 1371378);

https://reviewboard.mozilla.org/r/230470/#review236196
Attachment #8961607 - Flags: review?(sheehan) → review+
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/f8e922e5539f
robustcheckout: log details of unhandled exception ; r=sheehan
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I'll get this landed in bug 1448438. We'll have to wait for new failures to arrive before we can fix the underlying issue though.
Status: RESOLVED → REOPENED
Depends on: 1448438
Resolution: FIXED → ---
Over the last 7 days there are 39 failures on this bug. These happen on Android 4.2 x86, android-4-0-armv7-api16, android-4-0-armv7-api16, lint, Linux, Linux x64, linux64-noopt, osx-cross, osx-cross-noopt.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=171311222&lineNumber=34
Flags: needinfo?(aryx.bugmail)
We have 31 failures in the last 7 days.
Platforms: Linux opt/ pgo/ debug, Linux x64 opt/ debug and asan, osx-cross opt, and for the followings we have one occurrence for each: lint opt, android-4-0-armv7-api16 opt, Android 4.2 x86 opt and Windows 2012 x64 ccov.
Recent log failure: 
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=175181802&lineNumber=29
Flags: needinfo?(gbrown)
These failures come and go, but overall we seem to be getting them more frequently than we were a couple of months ago.

I don't see the logging added in comment 39, but now there is "TLS/SSL connection has been closed (EOF)":

[vcs 2018-04-20T20:14:44.941Z] applying clone bundle from https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-unified/8bcaeaf2a5a02a28e88a932c50fef64cd09df8b2.packed1.hg
[vcs 2018-04-20T20:15:38.696Z] error fetching bundle: TLS/SSL connection has been closed (EOF) (_ssl.c:581)
[vcs 2018-04-20T20:15:38.700Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["m4.4xlarge"], "lowerIsBetter": true, "name": "clone_errored", "shouldAlert": false, "subtests": [], "value": 54.4082088470459}, {"extraOptions": ["m4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 55.04014706611633}]}
[vcs 2018-04-20T20:15:38.700Z] abort: error applying bundle
[vcs 2018-04-20T20:15:38.700Z] (if this error persists, consider contacting the server operator or disable clone bundles via "--config ui.clonebundles=false")
[taskcluster 2018-04-20 20:15:39.017Z] === Task Finished ===

:gps - Do you know what's happening here?
Flags: needinfo?(gbrown) → needinfo?(gps)
Digging into Mercurial's source code, we see:

def trypullbundlefromurl(ui, repo, url):
    """Attempt to apply a bundle from a URL."""
    with repo.lock(), repo.transaction('bundleurl') as tr:
        try:
            fh = urlmod.open(ui, url)
            cg = readbundle(ui, fh, 'stream')

            if isinstance(cg, streamclone.streamcloneapplier):
                cg.apply(repo)
            else:
                bundle2.applybundle(repo, cg, tr, 'clonebundles', url)
            return True
        except urlerr.httperror as e:
            ui.warn(_('HTTP error fetching bundle: %s\n') %
                    stringutil.forcebytestr(e))
        except urlerr.urlerror as e:
            ui.warn(_('error fetching bundle: %s\n') %
                    stringutil.forcebytestr(e.reason))

        return False


And it is called as such:

    if trypullbundlefromurl(repo.ui, repo, url):
        repo.ui.status(_('finished applying clone bundle\n'))
    # Bundle failed.
    #
    # We abort by default to avoid the thundering herd of
    # clients flooding a server that was expecting expensive
    # clone load to be offloaded.
    elif repo.ui.configbool('ui', 'clonebundlefallback'):
        repo.ui.warn(_('falling back to normal clone\n'))
    else:
        raise error.Abort(_('error applying bundle'),
                          hint=_('if this error persists, consider contacting '
                                 'the server operator or disable clone '
                                 'bundles via '
                                 '"--config ui.clonebundles=false"'))


So Mercurial is swallowing the urlerr.urlerror error, converting it to a warning message. And the failure to complete the operation is being converted into a generic error.Abort, which has no knowledge of the underlying URLError.

This will make this issue difficult to fix without an upstream code change.
Flags: needinfo?(gps)
this seems fixed, :gps can we resolve this bug?
Flags: needinfo?(gps)
I'd like to keep this open to track implementing comment 58.
Flags: needinfo?(gps)
Blocks: 1461919
Blocks: 1501520

From the most recent errors that happened on autoland just now, it seems there might be a race condition between the actual availability of the bundles on s3 and the server returning the updated list of clonebundles.

Flags: needinfo?(sheehan)

... or not. The url that 404s (a stream bundle on us-west-1 s3) works just fine from home and from a small server I have on us-west-2a.

Interestingly, the occurrence today was all due to one rogue worker repeatedly getting a 404 over the course of 20 minutes (and maybe more if the tree hadn't been closed), while other workers were doing fine.

Flags: needinfo?(sheehan)

There have been 34 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • lint
  • toolchains
  • windows10-64
  • windows2012-32
  • windows2012-32-shippable
  • windows2012-64
  • windows2012-64-shippable
[setup 2023-01-14T12:06:59.762Z] run-task started in Z:\task_167369793059414
[vcs 2023-01-14T12:06:59.763Z] WARNING: vcs checkout path (./build/src) not in cache or volume; performance will likely suffer
[vcs 2023-01-14T12:06:59.763Z] WARNING: HG_STORE_PATH (y:/hg-shared) not in cache or volume; performance will likely suffer
[vcs 2023-01-14T12:06:59.763Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
[vcs 2023-01-14T12:06:59.763Z] executing ['C:\\Program Files\\Mercurial\\hg.exe', 'robustcheckout', '--sharebase', 'y:\\hg-shared', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '92bf950f388660ebc143d4a3308c0cd3f6249eb7', 'https://hg.mozilla.org/mozilla-central', 'Z:\\task_167369793059414\\build\\src']
[vcs 2023-01-14T12:07:02.318Z] (using Mercurial 6.2.1+hg6.a331725d01f8)
[vcs 2023-01-14T12:07:16.311Z] ensuring https://hg.mozilla.org/mozilla-central@92bf950f388660ebc143d4a3308c0cd3f6249eb7 is available at Z:\task_167369793059414\build\src
[vcs 2023-01-14T12:07:16.311Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2023-01-14T12:07:16.311Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2023-01-14T12:07:16.311Z] applying clone bundle from https://hg.cdn.mozilla.net/mozilla-unified/3881f3870b22ce4745c2be682760e9c5fcb003f7.stream-v2.hg
[vcs 2023-01-14T12:07:16.311Z] error fetching bundle: [Errno 11001] getaddrinfo failed
[vcs 2023-01-14T12:07:16.346Z] Traceback (most recent call last):
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.scmutil", line 153, in callcatch
[vcs 2023-01-14T12:07:16.346Z]     return func()
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.dispatch", line 441, in _runcatchfunc
[vcs 2023-01-14T12:07:16.346Z]     return _dispatch(req)
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.dispatch", line 1254, in _dispatch
[vcs 2023-01-14T12:07:16.346Z]     return runcommand(
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.dispatch", line 899, in runcommand
[vcs 2023-01-14T12:07:16.346Z]     ret = _runcommand(ui, options, cmd, d)
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.dispatch", line 1266, in _runcommand
[vcs 2023-01-14T12:07:16.346Z]     return cmdfunc()
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.dispatch", line 1252, in <lambda>
[vcs 2023-01-14T12:07:16.346Z]     d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.util", line 1880, in check
[vcs 2023-01-14T12:07:16.346Z]     return func(*args, **kwargs)
[vcs 2023-01-14T12:07:16.346Z]   File "C:/mozilla-build/robustcheckout.py", line 202, in robustcheckout
[vcs 2023-01-14T12:07:16.346Z]     return _docheckout(
[vcs 2023-01-14T12:07:16.346Z]   File "C:/mozilla-build/robustcheckout.py", line 613, in _docheckout
[vcs 2023-01-14T12:07:16.346Z]     res = hg.clone(
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.hg", line 798, in clone
[vcs 2023-01-14T12:07:16.346Z]     return clonewithshare(
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.hg", line 548, in clonewithshare
[vcs 2023-01-14T12:07:16.346Z]     clone(
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.hg", line 943, in clone
[vcs 2023-01-14T12:07:16.346Z]     exchange.pull(
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.exchange", line 1698, in pull
[vcs 2023-01-14T12:07:16.346Z]     _maybeapplyclonebundle(pullop)
[vcs 2023-01-14T12:07:16.346Z]   File "mercurial.exchange", line 2823, in _maybeapplyclonebundle
[vcs 2023-01-14T12:07:16.346Z]     raise error.Abort(
[vcs 2023-01-14T12:07:16.346Z] mercurial.error.Abort: error applying bundle
[vcs 2023-01-14T12:07:16.346Z] abort: error applying bundle
[vcs 2023-01-14T12:07:16.346Z] (if this error persists, consider contacting the server operator or disable clone bundles via "--config ui.clonebundles=false")
[taskcluster 2023-01-14T12:07:16.407Z]    Exit Code: 255
[taskcluster 2023-01-14T12:07:16.407Z]    User Time: 0s
[taskcluster 2023-01-14T12:07:16.407Z]  Kernel Time: 0s
[taskcluster 2023-01-14T12:07:16.407Z]    Wall Time: 16.7678695s
[taskcluster 2023-01-14T12:07:16.407Z]       Result: FAILED
[taskcluster 2023-01-14T12:07:16.407Z] === Task Finished ===
[taskcluster 2023-01-14T12:07:16.407Z] Task Duration: 16.7698697s
Flags: needinfo?(sheehan)

(In reply to Marian-Vasile Laza from comment #210)

...
[vcs 2023-01-14T12:07:16.311Z] error fetching bundle: [Errno 11001] getaddrinfo failed

This is a networking issue; DNS failed to return an IP address for hg.cdn.mozilla.net, or name resolution failed in some other way.

Flags: needinfo?(sheehan)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: