Closed Bug 1364695 Opened 3 years ago Closed 2 years ago

Intermittent ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Categories

(Infrastructure & Operations :: CIDuty, task, P1, critical)

task

Tracking

(firefox55 fixed, firefox56 fixed)

RESOLVED FIXED
Tracking Status
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: aryx, Assigned: garbas)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=98866763&repo=mozilla-inbound

07:57:24     INFO -   0:19.23 Downloading... 100.0 %
07:57:28     INFO -   0:23.48 Downloaded artifact to c:\builds\tooltool_cache\babc414ffc0457d27f5a1ed24a8e4873afbe2f1c1a4075469a27c005e1babc3b2a788f643f825efedff95b79686664c67ec4340ed535487168a3482e68559bc7
07:57:29     INFO -   0:24.83 hashed u'c:\\builds\\tooltool_cache\\babc414ffc0457d27f5a1ed24a8e4873afbe2f1c1a4075469a27c005e1babc3b2a788f643f825efedff95b79686664c67ec4340ed535487168a3482e68559bc7' with sha512 to be babc414ffc0457d27f5a1ed24a8e4873afbe2f1c1a4075469a27c005e1babc3b2a788f643f825efedff95b79686664c67ec4340ed535487168a3482e68559bc7
07:57:29     INFO -   0:24.83 Downloading clang.tar.bz2
07:57:29     INFO -   0:24.83 attempt 1/5
07:57:29     INFO -   0:24.83 Downloading to temporary location c:\builds\tooltool_cache\44dee70d525ea93952af27f943d1cc773311970c31d971d2bc2e3437cce0c899f3a03ddd8e42e86f1b4fd9ab1c4bc1767cdb0406eb4b3934ae4fc272dab830dc
07:57:30     INFO -  Error running mach:
07:57:30     INFO -      ['artifact', 'toolchain', '-v', '--retry', '4', '--tooltool-manifest', 'z:\\task_1494661831\\build\\src\\browser\\config\\tooltool-manifests\\win64\\clang.manifest', '--tooltool-url', 'https://api.pub.build.mozilla.org/tooltool/', '--authentication-file', 'c:\\builds\\relengapi.tok', '--cache-dir', 'c:/builds/tooltool_cache']
07:57:30     INFO -  The error occurred in code that was called by the mach command. This is either
07:57:30     INFO -  a bug in the called code itself or in the way that mach is calling it.
07:57:30     INFO -  You should consider filing a bug for this issue.
07:57:30     INFO -  If filing a bug, please include the full output of mach, including this error
07:57:30     INFO -  message.
07:57:30     INFO -  The details of the failure are as follows:
07:57:30     INFO -  ConnectionError: ('Connection aborted.', BadStatusLine("''",))
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/mozbuild/mozbuild/mach_commands.py", line 1755, in artifact_toolchain
07:57:30     INFO -      record.fetch_with(cache)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/mozbuild/mozbuild/mach_commands.py", line 1659, in fetch_with
07:57:30     INFO -      self.filename = cache.fetch(self.url)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/mozbuild\mozbuild\artifacts.py", line 816, in fetch
07:57:30     INFO -      dl.wait()
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/dlmanager\dlmanager\manager.py", line 101, in wait
07:57:30     INFO -      self.raise_if_error()
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/dlmanager\dlmanager\manager.py", line 116, in raise_if_error
07:57:30     INFO -      six.reraise(*self.__error)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/dlmanager\dlmanager\manager.py", line 157, in _download
07:57:30     INFO -      with closing(session.get(url, stream=True)) as response:
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\sessions.py", line 480, in get
07:57:30     INFO -      return self.request('GET', url, **kwargs)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\sessions.py", line 468, in request
07:57:30     INFO -      resp = self.send(prep, **send_kwargs)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\sessions.py", line 597, in send
07:57:30     INFO -      history = [resp for resp in gen] if allow_redirects else []
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\sessions.py", line 195, in resolve_redirects
07:57:30     INFO -      **adapter_kwargs
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\sessions.py", line 576, in send
07:57:30     INFO -      r = adapter.send(request, **kwargs)
07:57:30     INFO -    File "z:\task_1494661831\build\src\python/requests\requests\adapters.py", line 426, in send
07:57:30     INFO -      raise ConnectionError(err, request=request)
07:57:30    ERROR - Return code: 1
07:57:30    ERROR - 1 not in success codes: [0]
07:57:30  WARNING - setting return code to 2
Whiteboard: [stockwell infra]
one of the webheads of relengapi didn't have port 5432 open to listen to new database. this was fixed in Bug 1344364.
:garbas -- This is much better compared to the May 22 spike, but failures continue at a rate of a dozen or so per day. Do you think you'll be able to eliminate this failure? Would it be helpful to retry when this happens? (It looks like there is retry logic involved, but it is not utilized in this case - most failures happen on "attempt 1/5".)
Flags: needinfo?(rgarbas)
Severity: blocker → critical
Priority: -- → P1
:gbrown: looks like there is something going wrong with retry logic. i will give it a look.
Flags: needinfo?(rgarbas)
Retry currently only happens when ``requests.exceptions.HttpError`` exception is raised. I think retry should also happen on ``requests.exceptions.ConnectionError``.

Should I also add a timeout between each retries, (eg. one second)?
Attachment #8876734 - Flags: review?(gps)
Comment on attachment 8876734 [details] [diff] [review]
tooltool_retry_on_connection_error.patch

Review of attachment 8876734 [details] [diff] [review]:
-----------------------------------------------------------------

This is a valid solution so it earns r+.

But I think a better solution is to use the built-in retry logic in requests. See https://stackoverflow.com/questions/15431044/can-i-set-max-retries-for-requests-request for code patterns. Note how it is even possible to configure backoff intervals for the retry logic. Also, remember to .mount('https://') as well as 'http://'.
Attachment #8876734 - Flags: review?(gps) → review+
(In reply to Gregory Szorc [:gps] from comment #16)
> Comment on attachment 8876734 [details] [diff] [review]
> tooltool_retry_on_connection_error.patch
> 
> Review of attachment 8876734 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is a valid solution so it earns r+.
> 
> But I think a better solution is to use the built-in retry logic in
> requests. See
> https://stackoverflow.com/questions/15431044/can-i-set-max-retries-for-
> requests-request for code patterns. Note how it is even possible to
> configure backoff intervals for the retry logic. Also, remember to
> .mount('https://') as well as 'http://'.

The problem with using the built-in retry logic is that it won't retry for HTTP errors, and then you end up with two retry strategies.
Comment on attachment 8876734 [details] [diff] [review]
tooltool_retry_on_connection_error.patch

Review of attachment 8876734 [details] [diff] [review]:
-----------------------------------------------------------------

::: python/mozbuild/mozbuild/mach_commands.py
@@ +1790,5 @@
>                                                       sleeptime=60)):
>                  try:
>                      record.fetch_with(cache)
> +                except (requests.exceptions.HTTPError,
> +                        requests.exceptions.ConnectionError) as e:

Note it might be worth being broader than ConnectionError and HTTPError here, and use RequestException. Although that might be too broad... maybe add Timeout only?
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/45b27cacb06e
Make `mach artifact toolchain` also retry on ConnectionError. r=gps
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.