Closed
Bug 1305095
Opened 8 years ago
Closed 8 years ago
Intermittent-infra "fetching hg.mozilla.org fingerprint" resulting in "socket.timeout: timed out"
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla52
People
(Reporter: philor, Unassigned)
References
Details
(Keywords: intermittent-failure)
Attachments
(2 files)
We won't get an accurate idea of frequency, since there's no parseable failure message to it so it just gets ignored, but https://treeherder.mozilla.org/logviewer.html#?job_id=36375574&repo=mozilla-inbound [vcs 2016-09-23T15:27:20.479158Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint Traceback (most recent call last): File "/home/worker/bin/run-task", line 263, in <module> sys.exit(main(sys.argv[1:])) File "/home/worker/bin/run-task", line 249, in main vcs_checkout(args) File "/home/worker/bin/run-task", line 101, in vcs_checkout res = urllib2.urlopen(FINGERPRINT_URL, timeout=10) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 453, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 409, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) socket.timeout: timed out
Comment 2•8 years ago
|
||
HTTP requests that aren't retried with exponential backoff will never be reliable, see: https://dxr.mozilla.org/mozilla-central/source/testing/docker/recipes/run-task#101 Note. I would just do the lazy thing here an retry in the event of any error. That way it'll be fine even if it's a DNS lookup issue or something, which FYI, we've seen in EC2/heroku.
Comment 3•8 years ago
|
||
Looking at at least one of these instances shows that not only did the proxy have address resolution errors, but so did the worker itself when reporting an error to sentry. It looks like DNS communication was a problem.
Comment hidden (Intermittent Failures Robot) |
Comment 5•8 years ago
|
||
In theory, `taskcluster` is a local connection, and the taskclusterProxy handles retries, so run-task shouldn't need to retry this; rather, taskclusterProxy should retry in this case.
Comment 6•8 years ago
|
||
taskclusterProxy *may* have retried internally. But there's no way of knowing from the task-visible logs. All we know is the HTTP request to http://taskcluster timed out after 10s. An explanation for the timeout is taskclusterProxy was retrying. I /think/ the bug here is the timeout querying http://taskcluster is too low?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment 11•8 years ago
|
||
mozreview-review |
Comment on attachment 8801718 [details] Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'. https://reviewboard.mozilla.org/r/86390/#review85128 Looks good to me. With luck, we'll never update fingerprints again :)
Attachment #8801718 -
Flags: review?(dustin) → review+
Updated•8 years ago
|
Attachment #8801718 -
Flags: review?(gps)
Comment 12•8 years ago
|
||
mozreview-review |
Comment on attachment 8801718 [details] Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'. https://reviewboard.mozilla.org/r/86390/#review85226 This is mostly good. Just a few issues to fix. ::: testing/docker/recipes/run-task:108 (Diff revision 1) > - secret = json.loads(secret, encoding='utf-8') > + secret = json.loads(secret, encoding='utf-8') > - except ValueError: > + except ValueError: > - print('invalid JSON in hg fingerprint secret') > + print('invalid JSON in hg fingerprint secret') > - sys.exit(1) > + sys.exit(1) > + except urllib2.URLError: > + print('Unable to retrieve current hg.mozilla.org fingerprint using' This should be using `print_line()` so output is consistent. ::: testing/docker/recipes/run-task:108 (Diff revision 1) > + print('Unable to retrieve current hg.mozilla.org fingerprint using' > + 'the secret service, using fallback instead.') > + secret = {'secret': FALLBACK_FINGERPRINT} It may be obvious, but I think there should be an inline comment saying the hard-coded fingerprint won't be accurate if running an old revision after the server fingerprint has changed. The important thing to convey is the fallback is here to increase automation's robustness in the common case and failure of the secret store or a certificate change are not common.
Attachment #8801718 -
Flags: review?(gps) → review-
Comment hidden (mozreview-request) |
Comment 14•8 years ago
|
||
mozreview-review |
Comment on attachment 8801718 [details] Bug 1305095 - Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'. https://reviewboard.mozilla.org/r/86390/#review85932
Attachment #8801718 -
Flags: review?(gps) → review+
Comment 15•8 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d64b1dfabdef Add a fallback hg fingerprint, supports timeouts in automation and supports local developers who wish to use 'run locally'. r=dustin,gps
Reporter | ||
Comment 16•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d64b1dfabdef
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Updated•8 years ago
|
Whiteboard: [checkin-needed-aurora][checkin-needed-beta]
Updated•8 years ago
|
Assignee: nobody → bugspam.Callek
Comment 17•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/56cbbfe97981
Whiteboard: [checkin-needed-aurora][checkin-needed-beta] → [checkin-needed-beta]
Comment 18•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/ddd81d183374
Updated•8 years ago
|
Whiteboard: [checkin-needed-beta]
Reporter | ||
Comment 19•8 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?job_id=5499694&repo=autoland#L140
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 20•8 years ago
|
||
Not working on this past the landed patch...
Assignee: bugspam.Callek → nobody
Comment 21•8 years ago
|
||
There's been a spike in these failures today. I do not see the fallback being attempted: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6324591#L0-L47 [vcs 2016-11-08T14:02:03.814362Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint Traceback (most recent call last): File "/home/worker/bin/run-task", line 323, in <module> sys.exit(main(sys.argv[1:])) File "/home/worker/bin/run-task", line 300, in main branch=os.environ.get('GECKO_HEAD_REF')) File "/home/worker/bin/run-task", line 101, in vcs_checkout res = urllib2.urlopen(FINGERPRINT_URL, timeout=10) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1132, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 453, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 409, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) socket.timeout: timed out I think we need to handle socket.timeout in addition to URLError.
Attachment #8808832 -
Flags: review?(bugspam.Callek)
Comment hidden (Intermittent Failures Robot) |
Comment 23•8 years ago
|
||
Comment on attachment 8808832 [details] [diff] [review] use fallback on socket.timeout also Review of attachment 8808832 [details] [diff] [review]: ----------------------------------------------------------------- <obligatory nag about using mozreview for gecko patches where possible>
Attachment #8808832 -
Flags: review?(bugspam.Callek) → review+
Comment 24•8 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1fd37c448b3f Use fallback hg fingerprint on socket.timeout also; r=Callek
Comment 25•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/1fd37c448b3f
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Comment 26•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/20a4335dc0c0
Comment 27•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-release/rev/652c87a1a068
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•