Closed Bug 1739202 Opened 4 years ago Closed 4 years ago

[traceback] ConnectionError HTTPSConnectionPool(...): Read timed out.

Categories

(Tecken :: Symbolication, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

Eliot got a ConnectionError when trying to download something from S3. It should be using a session that's set up to retry, so I'm puzzled.

Sentry: https://sentry.prod.mozaws.net/operations/symbols-stage/issues/13176320/

timeout: The read operation timed out
  File "urllib3/response.py", line 436, in _error_catcher
    yield
  File "urllib3/response.py", line 518, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "http/client.py", line 462, in read
    n = self.readinto(b)
  File "http/client.py", line 506, in readinto
    n = self.fp.readinto(b)
  File "python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)

ReadTimeoutError: HTTPSConnectionPool(host='s3.us-west-2.amazonaws.com', port=443): Read timed out.
  File "requests/models.py", line 758, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "urllib3/response.py", line 575, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "urllib3/response.py", line 540, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "urllib3/response.py", line 441, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")

ConnectionError: HTTPSConnectionPool(host='s3.us-west-2.amazonaws.com', port=443): Read timed out.
  File "falcon/app.py", line 361, in falcon.app.App.__call__
  File "markus/main.py", line 515, in _timer_decorator
    return fun(*args, **kwargs)
  File "eliot/symbolicate_resource.py", line 500, in on_post
    results.append(self.symbolicate(stacks, modules))
  File "eliot/symbolicate_resource.py", line 325, in symbolicate
    self.get_symcache(module_info)
  File "eliot/symbolicate_resource.py", line 257, in get_symcache
    sym_file = self.download_sym_file(debug_filename, debug_id)
  File "eliot/symbolicate_resource.py", line 183, in download_sym_file
    data = self.downloader.get(debug_filename, debug_id, sym_filename)
  File "eliot/downloader.py", line 163, in get
    return source.get(debug_filename, debug_id, filename)
  File "eliot/downloader.py", line 61, in _time_download_fun
    rv = fun(*args, **kwargs)
  File "eliot/downloader.py", line 114, in get
    resp = self.session.get(url, allow_redirects=True)
  File "requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "raven/breadcrumbs.py", line 341, in send
    resp = real_send(self, request, *args, **kwargs)
  File "requests/sessions.py", line 677, in send
    history = [resp for resp in gen]
  File "requests/sessions.py", line 677, in <listcomp>
    history = [resp for resp in gen]
  File "requests/sessions.py", line 237, in resolve_redirects
    resp = self.send(
  File "raven/breadcrumbs.py", line 341, in send
    resp = real_send(self, request, *args, **kwargs)
  File "requests/sessions.py", line 697, in send
    r.content
  File "requests/models.py", line 836, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "requests/models.py", line 765, in generate
    raise ConnectionError(e)

I briefly looked at the urllib3 retry code and it's possible we need to pass in an additional parameter specifying how many times to retry on a connection error:

https://github.com/urllib3/urllib3/blob/40b3228495ad9848fae56928ad75a90cebeebc6e/src/urllib3/util/retry.py#L84-L90

I think the requests library still vendors urllib3. If that's the case, we need to check what the vendored version is doing.

Anyhow, this bug covers figuring out what's going on and fixing it. We can't be having connection errors messing everything up.

Assignee: nobody → willkg
Blocks: 1636210
Status: NEW → ASSIGNED

I think what's going on is that it is retrying after a ConnectionError, but it's hitting the max retries and reraising the error.

This error is from September 17th, 2021. We were using urllib3 1.25.11 at the time.

The retry code is here:

https://github.com/urllib3/urllib3/blob/9b95f29c575d73260ec4f5a8c0ea368cf242019e/src/urllib3/util/retry.py#L365

When Eliot creates a Retry instance, it doesn't specify read, so read is set in the __init__ as None. However, this code is looking for False. I have no idea when read could ever be False. Also, the type hints from their master tip suggest the value is either int or None. Curious.

In our case, read is set to None and so it should skip the bit about read and retry total number of times. However, if that happens, then we should have gotten a MaxRetryError at the top of the error stack and we didn't.

Looking at the session code and the adapter code, it sure looks like the HTTPAdapterWithTimeout only covers sending the HTTP request and once a request has been sent, it generates a response instance and it's done. Thus the read timeout isn't covered in the retry setup we built.

I need to look into this more. Probably by writing an HTTP server that accepts the HTTP request and then starts to issue a response and then sits on it. Then I can set up a client with the session_with_retries session and see what it does.

Here's a script that reproduces the issue:

from eliot.librequests import session_with_retries

import logging                                                                            
import http                                                                               

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger('urllib3')
req_log.setLevel(logging.DEBUG)
req_log.propagate = True

session = session_with_retries()
print(session.get("https://httpbin.org/drip?duration=60&numbytes=2"))

Result:

... lots of traceback output ...

Traceback (most recent call last):
  File "/app/eliot-service/ret.py", line 17, in <module>
    print(session.get("https://httpbin.org/drip?duration=60&numbytes=2"))
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 697, in send
    r.content
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 836, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 765, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out.

httpbin drip will send send the response headers and then send numbytes of data (2 here) over duration seconds (60). That exceeds the 5 second timeout session_with_retries defaults to. We get the exact error as in the Sentry issue and no mention of retrying.

Let's switch it to httpbin delay which gets the HTTP request, then waits X seconds, and then sends the whole response:

from eliot.librequests import session_with_retries

import logging                                                                            
import http                                                                               

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger('urllib3')
req_log.setLevel(logging.DEBUG)
req_log.propagate = True

session = session_with_retries()
print(session.get("https://httpbin.org/delay/60"))

Result:

... lots of retry output for 10 attempts ...

Traceback (most recent call last):
  File "/app/eliot-service/ret.py", line 18, in <module>
    print(session.get("https://httpbin.org/delay/60"))
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/app/eliot-service/eliot/librequests.py", line 37, in send
    return super().send(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /delay/60 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=5.0)"))

Ergo, I'm right on and the session_with_retries setup doesn't cover timeouts when reading the body of the HTTP response and I need a different way of doing things.

I think I could probably return a session wrapper that does what I want it to, but it'll be messy because it's got multiple things "retrying". Peter used to use backoff. I'll look at that first.

willkg merged PR #2445: "bug 1739202: rework requests wrappers to handle content errors" in 94b5d32.

I ended up reworking things and moving the "retryable" stuff off to backoff. That allows me to wrap a method that sends the HTTP request and also reads the response payload as a retryable block.

This deployed to prod in bug #1739601. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: