Closed Bug 1305095 Opened 3 years ago Closed 3 years ago

Intermittent-infra "fetching hg.mozilla.org fingerprint" resulting in "socket.timeout: timed out"

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla52

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

We won't get an accurate idea of frequency, since there's no parseable failure message to it so it just gets ignored, but 

https://treeherder.mozilla.org/logviewer.html#?job_id=36375574&repo=mozilla-inbound

 [vcs 2016-09-23T15:27:20.479158Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
 Traceback (most recent call last):
   File "/home/worker/bin/run-task", line 263, in <module>
     sys.exit(main(sys.argv[1:]))
   File "/home/worker/bin/run-task", line 249, in main
     vcs_checkout(args)
   File "/home/worker/bin/run-task", line 101, in vcs_checkout
     res = urllib2.urlopen(FINGERPRINT_URL, timeout=10)
   File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
     return opener.open(url, data, timeout)
   File "/usr/lib/python2.7/urllib2.py", line 429, in open
     response = self._open(req, data)
   File "/usr/lib/python2.7/urllib2.py", line 447, in _open
     '_open', req)
   File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
     result = func(*args)
   File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
     return self.do_open(httplib.HTTPConnection, req)
   File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
     r = h.getresponse(buffering=True)
   File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
     response.begin()
   File "/usr/lib/python2.7/httplib.py", line 453, in begin
     version, status, reason = self._read_status()
   File "/usr/lib/python2.7/httplib.py", line 409, in _read_status
     line = self.fp.readline(_MAXLINE + 1)
   File "/usr/lib/python2.7/socket.py", line 480, in readline
     data = self._sock.recv(self._rbufsize)
socket.timeout: timed out
Duplicate of this bug: 1305104
HTTP requests that aren't retried with exponential backoff will never be reliable, see:
https://dxr.mozilla.org/mozilla-central/source/testing/docker/recipes/run-task#101

Note. I would just do the lazy thing here an retry in the event of any error.
That way it'll be fine even if it's a DNS lookup issue or something, which FYI, we've seen in EC2/heroku.
Looking at at least one of these instances shows that not only did the proxy have address resolution errors, but so did the worker itself when reporting an error to sentry.  It looks like DNS communication was a problem.
In theory, `taskcluster` is a local connection, and the taskclusterProxy handles retries, so run-task shouldn't need to retry this; rather, taskclusterProxy should retry in this case.
taskclusterProxy *may* have retried internally. But there's no way of knowing from the task-visible logs. All we know is the HTTP request to http://taskcluster timed out after 10s. An explanation for the timeout is taskclusterProxy was retrying.

I /think/ the bug here is the timeout querying http://taskcluster is too low?
See Also: → 1310505
Comment on attachment 8801718 [details]
Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'.

https://reviewboard.mozilla.org/r/86390/#review85128

Looks good to me.  With luck, we'll never update fingerprints again :)
Attachment #8801718 - Flags: review?(dustin) → review+
Comment on attachment 8801718 [details]
Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'.

https://reviewboard.mozilla.org/r/86390/#review85226

This is mostly good.

Just a few issues to fix.

::: testing/docker/recipes/run-task:108
(Diff revision 1)
> -        secret = json.loads(secret, encoding='utf-8')
> +            secret = json.loads(secret, encoding='utf-8')
> -    except ValueError:
> +        except ValueError:
> -        print('invalid JSON in hg fingerprint secret')
> +            print('invalid JSON in hg fingerprint secret')
> -        sys.exit(1)
> +            sys.exit(1)
> +    except urllib2.URLError:
> +        print('Unable to retrieve current hg.mozilla.org fingerprint using'

This should be using `print_line()` so output is consistent.

::: testing/docker/recipes/run-task:108
(Diff revision 1)
> +        print('Unable to retrieve current hg.mozilla.org fingerprint using'
> +              'the secret service, using fallback instead.')
> +        secret = {'secret': FALLBACK_FINGERPRINT}

It may be obvious, but I think there should be an inline comment saying the hard-coded fingerprint won't be accurate if running an old revision after the server fingerprint has changed. The important thing to convey is the fallback is here to increase automation's robustness in the common case and failure of the secret store or a certificate change are not common.
Attachment #8801718 - Flags: review?(gps) → review-
Comment on attachment 8801718 [details]
Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'.

https://reviewboard.mozilla.org/r/86390/#review85932
Attachment #8801718 - Flags: review?(gps) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d64b1dfabdef
Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'. r=dustin,gps
https://hg.mozilla.org/mozilla-central/rev/d64b1dfabdef
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Whiteboard: [checkin-needed-aurora][checkin-needed-beta]
Assignee: nobody → bugspam.Callek
https://hg.mozilla.org/releases/mozilla-aurora/rev/56cbbfe97981
Whiteboard: [checkin-needed-aurora][checkin-needed-beta] → [checkin-needed-beta]
Whiteboard: [checkin-needed-beta]
https://treeherder.mozilla.org/logviewer.html#?job_id=5499694&repo=autoland#L140
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Not working on this past the landed patch...
Assignee: bugspam.Callek → nobody
There's been a spike in these failures today. 

I do not see the fallback being attempted:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6324591#L0-L47

 [vcs 2016-11-08T14:02:03.814362Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint

 Traceback (most recent call last):

   File "/home/worker/bin/run-task", line 323, in <module>

     sys.exit(main(sys.argv[1:]))

   File "/home/worker/bin/run-task", line 300, in main

     branch=os.environ.get('GECKO_HEAD_REF'))

   File "/home/worker/bin/run-task", line 101, in vcs_checkout

     res = urllib2.urlopen(FINGERPRINT_URL, timeout=10)

   File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen

     return opener.open(url, data, timeout)

   File "/usr/lib/python2.7/urllib2.py", line 431, in open

     response = self._open(req, data)

   File "/usr/lib/python2.7/urllib2.py", line 449, in _open

     '_open', req)

   File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain

     result = func(*args)

   File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open

     return self.do_open(httplib.HTTPConnection, req)

   File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open

     r = h.getresponse(buffering=True)

   File "/usr/lib/python2.7/httplib.py", line 1132, in getresponse

     response.begin()

   File "/usr/lib/python2.7/httplib.py", line 453, in begin

     version, status, reason = self._read_status()

   File "/usr/lib/python2.7/httplib.py", line 409, in _read_status

     line = self.fp.readline(_MAXLINE + 1)

   File "/usr/lib/python2.7/socket.py", line 480, in readline

     data = self._sock.recv(self._rbufsize)

socket.timeout: timed out 

I think we need to handle socket.timeout in addition to URLError.
Attachment #8808832 - Flags: review?(bugspam.Callek)
Comment on attachment 8808832 [details] [diff] [review]
use fallback on socket.timeout also

Review of attachment 8808832 [details] [diff] [review]:
-----------------------------------------------------------------

<obligatory nag about using mozreview for gecko patches where possible>
Attachment #8808832 - Flags: review?(bugspam.Callek) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1fd37c448b3f
Use fallback hg fingerprint on socket.timeout also; r=Callek
https://hg.mozilla.org/mozilla-central/rev/1fd37c448b3f
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.