Open Bug 1451080 Opened 7 years ago Updated 23 days ago

Intermittent abort: reached maximum number of network attempts; giving up

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P5)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra][stockwell unknown])

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=171689371&repo=autoland https://queue.taskcluster.net/v1/task/JBOula_1SQitmhjHyW04XQ/runs/0/artifacts/public/logs/live_backing.log [taskcluster 2018-04-03 17:58:30.125Z] === Task Starting === [setup 2018-04-03T17:58:30.465Z] run-task started [cache 2018-04-03T17:58:30.466Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1 [cache 2018-04-03T17:58:30.467Z] cache /builds/worker/tooltool-cache exists; requirements: gid=1000 uid=1000 version=1 [cache 2018-04-03T17:58:30.467Z] cache /builds/worker/workspace is empty; writing requirements: gid=1000 uid=1000 version=1 [volume 2018-04-03T17:58:30.467Z] volume /builds/worker/checkouts is a cache [volume 2018-04-03T17:58:30.467Z] volume /builds/worker/tooltool-cache is a cache [volume 2018-04-03T17:58:30.467Z] volume /builds/worker/workspace is a cache [setup 2018-04-03T17:58:30.467Z] running as worker:worker [vcs 2018-04-03T17:58:30.467Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '8f3d33f7c72babbf232f1938e982d4dd5f54f142', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src'] [vcs 2018-04-03T17:58:30.526Z] (using Mercurial 4.5.2) [vcs 2018-04-03T17:58:30.526Z] ensuring https://hg.mozilla.org/integration/autoland@8f3d33f7c72babbf232f1938e982d4dd5f54f142 is available at /builds/worker/workspace/build/src [vcs 2018-04-03T18:02:45.212Z] socket error: [Errno 110] Connection timed out [vcs 2018-04-03T18:02:45.212Z] (retrying after network failure on attempt 1 of 3) [vcs 2018-04-03T18:02:45.213Z] (waiting 6.22s before retry) [vcs 2018-04-03T18:02:51.441Z] ensuring https://hg.mozilla.org/integration/autoland@8f3d33f7c72babbf232f1938e982d4dd5f54f142 is available at /builds/worker/workspace/build/src [vcs 2018-04-03T18:07:06.074Z] socket error: [Errno 110] Connection timed out [vcs 2018-04-03T18:07:06.074Z] (retrying after network failure on attempt 2 of 3) [vcs 2018-04-03T18:07:06.074Z] (waiting 8.37s before retry) [vcs 2018-04-03T18:07:14.449Z] ensuring https://hg.mozilla.org/integration/autoland@8f3d33f7c72babbf232f1938e982d4dd5f54f142 is available at /builds/worker/workspace/build/src [vcs 2018-04-03T18:11:28.986Z] socket error: [Errno 110] Connection timed out [vcs 2018-04-03T18:11:28.990Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["m4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 778.4606559276581}]} [vcs 2018-04-03T18:11:28.990Z] abort: reached maximum number of network attempts; giving up [vcs 2018-04-03T18:11:28.990Z] [taskcluster 2018-04-03 18:11:29.287Z] === Task Finished === [taskcluster 2018-04-03 18:11:29.415Z] Artifact "public/build" not found at "/builds/worker/artifacts/" [taskcluster 2018-04-03 18:11:29.818Z] Unsuccessful task run with exit code: 255 completed in 780.697 seconds
:kmoir, is this your team or relops or some other team? without a triage owner I am not sure who to redirect this bug to.
Flags: needinfo?(kmoir)
Whiteboard: [stockwell infra]
It looks like robustcheckout failed and subsequent retries failed as well. I suspect a intermittent network error because the revision exists and can be checked out in other jobs for in that taskgroup. I would advise rerunning the task that failed. If this in an ongoing issue we can investigate.
Flags: needinfo?(kmoir)
dhouse: this smells like we may be hitting capacity limits on hg.mo. Would you be able to look at the zlb logs to see if we're hitting capacity on the hgweb machines and timing out on the zlb?
Flags: needinfo?(dhouse)
I found logs on the zeus nodes for all activity for the latest 2 hours. The latest event in the treeherder search was this afternoon so it was not in those logs. I'll check tomorrow to see if i can catch one within the 2 hour window and review the logs on zeus. I also downloaded the failure events from zeus for the last ten days and the few failures I spot-checked from treeherder have no matching failure events recorded.
Kendall showed me around the logs and where they are archived. I've requested access to the archives in bug 1453475 hgweb is on zlb5 and zlb3: 63.245.215.25/24 zlb5.ops.scl3.mozilla.com/eth0.5 63.245.215.102/24 zlb3.ops.scl3.mozilla.com/eth0.5 logs go to: /var/log/zeus/hg.mozilla.org-ssl.access_%{%Y-%m-%d-%H}t these are: rsync'd to metrics-logger1.private.scl3 /data/logs/
I have not yet found failures in the logs for the occurrence at 7:48PDT today (all 200's): ``` [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The L ounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node: 10.22.74.175:80" node_s:1.025409 req_s:1.400907 retries:0 [11/Apr/2018:07:32:06 -0700] "GET /integration/autoland/json-automationrelevance/73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1 .1" 200 1408 "-" "python-requests/2.9.1" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:0.144212 req_s:0.235547 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 3447 "https://bugzilla.mozilla.org/" "Moz illa/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:1.177323 req_s:1.467772 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/static/mercurial.js HTTP/1.1" 200 14266 "https://hg.mozilla.org/integrat ion/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH_AE S_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:0.020804 req_s:0.021473 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/static/style-gitweb.css HTTP/1.1" 200 3129 "https://hg.mozilla.org/integ ration/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH _AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:0.036429 req_s:0.067729 retries:0 [11/Apr/2018:07:39:43 -0700] "GET /integration/autoland/static/moz-logo-bw-rgb.svg HTTP/1.1" 200 2270 "https://hg.mozilla.org/in tegration/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_W ITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:0.016045 req_s:0.016045 retries:0 [11/Apr/2018:07:49:51 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.051242 req_s:0.071920 retries:0 ```
That was on zlb5, same 200's in that timeframe on zlb3. I do not think I'm seeing the requests from robustcheckout(urllib2) in the failure timeframe: ``` [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The Lounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:1.124576 req_s:1.552493 retries:0 [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The Lounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:1.244409 req_s:1.665172 retries:0 [11/Apr/2018:07:31:28 -0700] "GET /integration/autoland/raw-file/73e6728b22c5bc2a79c650907c4e30fcd84f55a0/.taskcluster.yml HTTP/1.1" 200 2947 "-" "node-superagent/1.8.5" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.335373 req_s:0.432932 retries:0 [11/Apr/2018:07:49:52 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.055291 req_s:0.084248 retries:0 [11/Apr/2018:07:49:52 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:0.042835 req_s:0.062447 retries:0 [11/Apr/2018:07:49:53 -0700] "GET /integration/autoland/raw-rev/73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 1365 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:1.291405 req_s:1.317616 retries:0 ```
See Also: → 1451193
Treeherder found this failure 3 times this morning. I checked the latest, failing between 6:30 and 6:43 am pdt from ip 34.207.64.52, and found no record of the requests in the zeus logs. (and same as the times before, other tools/ips are requesting and getting successes on the same raw-file urls)
Greg, I've added a see-also onto bug 1451193. It doesn't look like the requests are reaching the zlb and we're not seeing any failures or capacity issues.
Flags: needinfo?(dhouse)
I experienced this error on one of my own Try pushes. And the error also coincided with some nagios alerts from servers hitting max worker capacity. And they were at max worker capacity due to ActiveData hammering hg.mo with thousands of HTTP requests from many concurrent clients. So this appears to be intermittent worker pool at capacity issues. Or at least that's a partial cause.
no failures in >2 weeks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=178446831&repo=mozilla-inbound&lineNumber=104 [taskcluster 2018-05-15T00:09:37.555Z] === Task Starting === [taskcluster 2018-05-15T00:09:38.819Z] Uploading redirect artifact public/logs/live.log to URL https://elluejqaaaawgynk222j42rissfzgc7q6cktwplkx6y2r526.taskcluster-worker.net:60023/log/XtEc0nXRStSiUkA3H1SPwQ with mime type "text/plain; charset=utf-8" and expiry 2018-05-15T02:39:36.116Z [taskcluster 2018-05-15T00:09:39.306Z] Executing command 0: :: sccache currently uses the full compiler commandline as input to the [taskcluster 2018-05-15T00:09:39.306Z] :: cache hash key, so create a symlink to the task dir and build from [taskcluster 2018-05-15T00:09:39.306Z] :: the symlink dir to get consistent paths. [taskcluster 2018-05-15T00:09:39.306Z] if exist z:\build rmdir z:\build Z:\task_1526342528>if exist z:\build rmdir z:\build [taskcluster 2018-05-15T00:09:39.346Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.346Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.346Z] Kernel Time: 15.625ms [taskcluster 2018-05-15T00:09:39.346Z] Wall Time: 40.0037ms [taskcluster 2018-05-15T00:09:39.346Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.346Z] Executing command 1: mklink /d z:\build %cd% Z:\task_1526342528>mklink /d z:\build Z:\task_1526342528 symbolic link created for z:\build <<===>> Z:\task_1526342528 [taskcluster 2018-05-15T00:09:39.396Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.396Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.396Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.396Z] Wall Time: 47.0002ms [taskcluster 2018-05-15T00:09:39.396Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.396Z] Executing command 2: icacls z:\build /grant *S-1-1-0:D /L Z:\task_1526342528>icacls z:\build /grant *S-1-1-0:D /L processed file: z:\build Successfully processed 1 files; Failed processing 0 files [taskcluster 2018-05-15T00:09:39.454Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.454Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.454Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.454Z] Wall Time: 58.0001ms [taskcluster 2018-05-15T00:09:39.454Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.454Z] Executing command 3: cd /d z:\build Z:\task_1526342528>cd /d z:\build [taskcluster 2018-05-15T00:09:39.502Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.502Z] User Time: 15.625ms [taskcluster 2018-05-15T00:09:39.502Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.502Z] Wall Time: 47.0491ms [taskcluster 2018-05-15T00:09:39.502Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.502Z] Executing command 4: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7581c1e47d4dd23a4b00c9627fbfc626cf48c087 https://hg.mozilla.org/integration/mozilla-inbound .\build\src z:\build>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7581c1e47d4dd23a4b00c9627fbfc626cf48c087 https://hg.mozilla.org/integration/mozilla-inbound .\build\src (using Mercurial 4.5.3) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (retrying after network failure on attempt 1 of 3) (waiting 3.39s before retry) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (retrying after network failure on attempt 2 of 3) (waiting 8.74s before retry) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 138.41100001335144}]} Traceback (most recent call last): File "mercurial\scmutil.pyc", line 154, in callcatch File "mercurial\dispatch.pyc", line 314, in _runcatchfunc File "mercurial\dispatch.pyc", line 918, in _dispatch File "mercurial\dispatch.pyc", line 673, in runcommand File "mercurial\dispatch.pyc", line 926, in _runcommand File "mercurial\dispatch.pyc", line 915, in <lambda> File "mercurial\util.pyc", line 1195, in check File "C:/mozilla-build/robustcheckout.py", line 260, in robustcheckout File "C:/mozilla-build/robustcheckout.py", line 473, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 295, in callself File "C:/mozilla-build/robustcheckout.py", line 473, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 295, in callself File "C:/mozilla-build/robustcheckout.py", line 472, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 448, in handlepullerror File "C:/mozilla-build/robustcheckout.py", line 401, in handlenetworkfailure Abort: reached maximum number of network attempts; giving up abort: reached maximum number of network attempts; giving up [taskcluster:error] Exit Code: 255 [taskcluster:error] User Time: 0s [taskcluster:error] Kernel Time: 15.625ms [taskcluster:error] Wall Time: 2m18.9039991s [taskcluster:error] Result: FAILED [taskcluster 2018-05-15T00:11:58.406Z] === Task Finished ===
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WORKSFORME
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

There are 23 total failures in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-10-08&endday=2019-10-15&tree=trunk&bug=1451080

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271223279&repo=autoland&lineNumber=96

The only affected platform so far is linux64 opt.

Connor, can you redirect this to someone who can take a look at it?

Flags: needinfo?(sheehan)

Most of these failures have the following in common:

  • The task can't access Taskcluster secrets and thus fall back to hg.mozilla.org public interface in mdc1.
  • The task hits the socket error 3 times in a row.
  • One or two tasks in the same push experience the same issue.

My initial thought is that the failure to access the TC secret is a symptom of a larger networking issue, that also shows up as the socket error. This is also likely as linux64 shows up as the same platform in most cases, so there could be an improperly configured task class for that platform. Another theory is the task is improperly scoped to get the secret, but then something in MDC1 is causing intermittent socket errors that show up in a few tasks taking place at the same time.

Flags: needinfo?(sheehan)

This has quite a spike since the 23th November on source-test-jsshell-bench spidemonkey tests. https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-18&endday=2019-11-25&tree=trunk&bug=1451080

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278016165&repo=autoland&lineNumber=95

Connor, could you please take a look over what's going on here? Thank you.

Flags: needinfo?(sheehan)

Hi Connor, did you have the time to look at Cosmin's comment?

Flags: needinfo?(sheehan)
Flags: needinfo?(sheehan)

Apologies, I had a team onboarding work-week last week and haven't been responding to any needinfo's unless they are urgent.

It's tough to determine what could be happening here. These machines are RelOps managed hosts residing in a Mozilla DC (from the releng-hardware/gecko-t-linux-talos on the first line). They all seem to be failing to talk to hg.mozilla.org over SSL. If these machines are in MDC1, they would be in the same DC, which is especially odd. Why this is affecting only jsshell bench tests, and only on autoland, is even more confusing.

Dave, can you think of anything that might be causing these problems?

Flags: needinfo?(sheehan) → needinfo?(dhouse)

(In reply to Connor Sheehan [:sheehan] from comment #64)

Apologies, I had a team onboarding work-week last week and haven't been responding to any needinfo's unless they are urgent.

It's tough to determine what could be happening here. These machines are RelOps managed hosts residing in a Mozilla DC (from the releng-hardware/gecko-t-linux-talos on the first line). They all seem to be failing to talk to hg.mozilla.org over SSL. If these machines are in MDC1, they would be in the same DC, which is especially odd. Why this is affecting only jsshell bench tests, and only on autoland, is even more confusing.

Dave, can you think of anything that might be causing these problems?

I cannot think of anything yet, but there must be something because of the odd specific circumstances: only in MDC1, only on linux64 hardware workers, and only with a single test set when run on autoland.

My first thoughts are:

  1. Maybe this one test set has fewer retries configured for this pull, and we might be seeing failures+retries on other workers at these times.
  2. Maybe we have something wrong in our networking configuration for these workers. But then we'd see the same failures on any type of task.
  3. Do these workers have an old version of robustcheckout that is not as robust?

I'm gambling on an outdated robustcheckout first. I'll check the robustcheckout on these linux64 workers (expecting it is ancient).

Flags: needinfo?(dhouse)

Connor, do you think changes in robustcheckout since March could fix this problem?

--it looks like you ( thanks! :) ) updated the robustcheckout for the linux64 hardware workers in bug 1535011.

commit a3ea1d4a01712d3dfff733a6e529a9033e6fd35c
Author: Connor Sheehan <sheehan@mozilla.com>
Date:   Wed Mar 20 17:11:35 2019 -0400

    Bug 1535011: vendor latest `robustcheckout` from version-control-tools (#430)

    Recent versions of robustcheckout include new metrics output
    and compatibility with newer versions of Mercurial. This commit
    vendors the latest robustcheckout from version-control-tools,
    revision 8e3bb142dfa9.

commit:

8e3bb142dfa9d1f20b25394baad9a5038a5aea8b
2019-03-14 16:39 +0000 	Connor Sheehan - robustcheckout: add a `serverUrl` key to CI metrics output (Bug 1535011) r=smacleod
Flags: needinfo?(sheehan)

None of the recent changes to robustcheckout would affect this bug, I'm afraid. Most changes in recent memory have been Py3 and Windows related.

The error message we're getting in the logs is "socket error: EOF occurred in violation of protocol". We see this error message when the network error caught by robustcheckout originally came from urllib, and the exception itself satisfies isinstance(e.reason, socket.error). The fact that such a low level exception is being hit by this bug makes me think that option 2 (from comment 65) is likely correct.

It could also be option 1, assuming you're talking about Taskcluster retries. robustcheckout is only written to do three retries, although you can configure min and max jitter values in the Mercurial hgrc file.

Flags: needinfo?(sheehan)

There are 28 total failures in the last 7 days on linux64 opt out of which 2 are on macosx64-shippable opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285739388&repo=autoland&lineNumber=96

[vcs 2020-01-21T09:42:20.988Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:20.989Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:21.701Z] (pulling to obtain 4f49e933eb839b72b69757ba8c633687e474dbae)
[vcs 2020-01-21T09:42:21.861Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:21.861Z] (retrying after network failure on attempt 1 of 3)
[vcs 2020-01-21T09:42:21.861Z] (waiting 5.02s before retry)
[vcs 2020-01-21T09:42:26.889Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:26.889Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:27.049Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:27.049Z] (retrying after network failure on attempt 2 of 3)
[vcs 2020-01-21T09:42:27.049Z] (waiting 5.65s before retry)
[vcs 2020-01-21T09:42:32.701Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:32.701Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:32.861Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:32.865Z] Traceback (most recent call last):
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/scmutil.py", line 161, in callcatch
[vcs 2020-01-21T09:42:32.865Z] return func()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 344, in _runcatchfunc
[vcs 2020-01-21T09:42:32.865Z] return _dispatch(req)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 984, in _dispatch
[vcs 2020-01-21T09:42:32.865Z] cmdpats, cmdoptions)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 730, in runcommand
[vcs 2020-01-21T09:42:32.865Z] ret = _runcommand(ui, options, cmd, d)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 992, in _runcommand
[vcs 2020-01-21T09:42:32.865Z] return cmdfunc()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 981, in <lambda>
[vcs 2020-01-21T09:42:32.865Z] d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/util.py", line 1530, in check
[vcs 2020-01-21T09:42:32.865Z] return func(*args, **kwargs)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 185, in robustcheckout
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparseprofile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 602, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] return callself()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparse_profile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] return callself()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparse_profile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 460, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] if handlepullerror(e):
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 436, in handlepullerror
[vcs 2020-01-21T09:42:32.865Z] handlenetworkfailure()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 381, in handlenetworkfailure
[vcs 2020-01-21T09:42:32.865Z] raise error.Abort(b'reached maximum number of network attempts; '
[vcs 2020-01-21T09:42:32.865Z] Abort: reached maximum number of network attempts; giving up
[vcs 2020-01-21T09:42:32.865Z]
[vcs 2020-01-21T09:42:32.865Z] abort: reached maximum number of network attempts; giving up
[vcs 2020-01-21T09:42:32.865Z]
[taskcluster 2020-01-21T09:42:32.891Z] Exit Code: 255
[taskcluster 2020-01-21T09:42:32.891Z] User Time: 288ms
[taskcluster 2020-01-21T09:42:32.891Z] Kernel Time: 32ms
[taskcluster 2020-01-21T09:42:32.891Z] Wall Time: 12.047262683s
[taskcluster 2020-01-21T09:42:32.891Z] Result: FAILED
[taskcluster 2020-01-21T09:42:32.891Z] === Task Finished ===
[taskcluster 2020-01-21T09:42:32.891Z] Task Duration: 12.048786882s
[taskcluster 2020-01-21T09:42:32.891Z] [mounts] Preserving cache: Moving "/home/cltbld/tasks/task_1579599343/home/cltbld/checkouts" to "/home/cltbld/caches/QvQ7HXaSTC6VWX-cFaKMkg"
[taskcluster:error] exit status 255

Update:
There have been 33 failures within the last 7 days all of them on Linux x64 opt.

Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286741101&repo=autoland&lineNumber=97

Connor, could you please take a look at the recent failures?
There seems to be a spike of failures on Linux as Natalia mentioned above.

Flags: needinfo?(sheehan)

There are 30 total failures in the last 7 days on linux64 opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=290536502&repo=autoland&lineNumber=98

There are 23 failures associated to this bug in the last 7 days. These are occurring on linux64 opt.

Over the last 7 days there are 39 failures on this bug. These happen only on linux64 opt.

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293226376&repo=autoland&lineNumber=96

This bug failed 41 times in the last 7 days. Occurs on linux64 on opt builds types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=294205582&repo=autoland&lineNumber=96

Connor: Can you please have someone that a look at this bug?

@dhouse Are these requests routed through a proxy server?

Flags: needinfo?(dhouse)

releng-hardware/gecko-t-linux-talos
uses the datacenter proxies:
proxy.dmz.mdc1.mozilla.com [10.48.122.10]
proxy.dmz.mdc2.mozilla.com [10.50.122.10]

Connor, let me know if you want for me to set up one of these workers for you to access or with a test re-trying a pull.

Flags: needinfo?(dhouse)

robustcheckout.py on these workers looks correct matching the configuration repository:

[root@t-linux64-ms-011 ~]# md5sum /usr/local/lib/hgext/robustcheckout.py
e16e36fb8a283c12b2d60714618eecef  /usr/local/lib/hgext/robustcheckout.py
[root@t-linux64-ms-011 ~]# md5sum /home/cltbld/workspace/mozharness/external_tools/robustcheckout.py
1157d85ed13577db9329be690d6aad82  /home/cltbld/workspace/mozharness/external_tools/robustcheckout.py
[root@t-linux64-ms-011 ~]# md5sum /home/cltbld/checkouts/gecko/testing/mozharness/external_tools/robustcheckout.py
e16e36fb8a283c12b2d60714618eecef  /home/cltbld/checkouts/gecko/testing/mozharness/external_tools/robustcheckout.py
[root@t-linux64-ms-011 ~]# wget -O - https://raw.githubusercontent.com/mozilla-releng/build-puppet/718bb3a0ded3de96b0b5fa83eee1c3d15ef10017/modules/mercurial/files/robustcheckout.py | md5sum
[...]
e16e36fb8a283c12b2d60714618eecef  -

I don't think it is using the "/home/cltbld/workspace/mozharness/external_tools/robustcheckout.py"
e16e36fb8a283c12b2d60714618eecef matches version-control-tools at ff8a7c9c843a9c516588606aaafc39ac03d1eea0

I'm reviewing the config to see if our hg is using the proxy, and if we can turn it off for some testing (maybe on a subset of the linux64 hardware workers).

Flags: needinfo?(dhouse)

(In reply to Dave House [:dhouse] from comment #92)

releng-hardware/gecko-t-linux-talos
uses the datacenter proxies:
proxy.dmz.mdc1.mozilla.com [10.48.122.10]
proxy.dmz.mdc2.mozilla.com [10.50.122.10]

Connor, let me know if you want for me to set up one of these workers for you to access or with a test re-trying a pull.

I was wrong. I checked with Jake and in the build-puppet configuration and verified the proxies are not being used on these workers.

Flags: needinfo?(dhouse)

There are 37 failures associated with this bug in the last 7 days https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-03-24&endday=2020-03-31&tree=trunk&bug=1451080
Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295499712&repo=autoland&lineNumber=99
This has failed on linux64, macosx1014-64, android-4-0-armv7-api16, osx-shippable - pgo, opt, asan build types.
With most of the failures on linux64 - opt.

Over the last 7 days there are 48 failures present on this bug. These happen on macosx1014-64-qr and linux64.

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=296679202&repo=autoland&lineNumber=98

There are 83 failures associated to this bug in the last 7 days. These are occurring on linux64-shippable, linux64 opt builds.

recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298012527&repo=autoland&lineNumber=95
[vcs 2020-04-16T23:36:36.411Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-04-16T23:36:36.415Z] Traceback (most recent call last):
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/scmutil.py", line 161, in callcatch
[vcs 2020-04-16T23:36:36.415Z] return func()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 344, in _runcatchfunc
[vcs 2020-04-16T23:36:36.415Z] return _dispatch(req)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 984, in _dispatch
[vcs 2020-04-16T23:36:36.415Z] cmdpats, cmdoptions)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 730, in runcommand
[vcs 2020-04-16T23:36:36.415Z] ret = _runcommand(ui, options, cmd, d)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 992, in _runcommand
[vcs 2020-04-16T23:36:36.415Z] return cmdfunc()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 981, in <lambda>
[vcs 2020-04-16T23:36:36.415Z] d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/util.py", line 1530, in check
[vcs 2020-04-16T23:36:36.415Z] return func(*args, **kwargs)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 185, in robustcheckout
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparseprofile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] return callself()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparse_profile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] return callself()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparse_profile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 460, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] if handlepullerror(e):
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 436, in handlepullerror
[vcs 2020-04-16T23:36:36.415Z] handlenetworkfailure()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 381, in handlenetworkfailure
[vcs 2020-04-16T23:36:36.415Z] raise error.Abort(b'reached maximum number of network attempts; '
[vcs 2020-04-16T23:36:36.415Z] Abort: reached maximum number of network attempts; giving up
[vcs 2020-04-16T23:36:36.415Z]
[vcs 2020-04-16T23:36:36.415Z] abort: reached maximum number of network attempts; giving up
[vcs 2020-04-16T23:36:36.415Z]
[taskcluster 2020-04-16T23:36:36.444Z] Exit Code: 255
[taskcluster 2020-04-16T23:36:36.444Z] User Time: 168ms
[taskcluster 2020-04-16T23:36:36.444Z] Kernel Time: 8ms
[taskcluster 2020-04-16T23:36:36.444Z] Wall Time: 13.103446892s
[taskcluster 2020-04-16T23:36:36.444Z] Result: FAILED
[taskcluster 2020-04-16T23:36:36.444Z] === Task Finished ===
[taskcluster 2020-04-16T23:36:36.444Z] Task Duration: 13.105217581s
[taskcluster 2020-04-16T23:36:36.444Z] [mounts] Preserving cache: Moving "/home/cltbld/tasks/task_1587080152/home/cltbld/checkouts" to "/home/cltbld/caches/V3pwC4KoSt2nL1xBVE5H4w"
[taskcluster:error] exit status 255

:sheehan any updates on this?

Connor there seems to be an increase in failures here, can you take a look?

Flags: needinfo?(sheehan)
Flags: needinfo?(sheehan)

Issue over the weekend might be related to whatever happened in bug 1633152.

This bug failed 41 times in the last 7 days. Occurs on linux64-shippable on opt build types:

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302567102&repo=autoland&lineNumber=96

Connor: Any updates on this bug?

Whiteboard: [stockwell infra] → [stockwell infra][stockwell needswork]

There were 26 failures in the past 7 days. Happens on android-5-0-x86_64 debug, lint, linux64-shippable opt and windows2012-32 opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=318148519&repo=autoland&lineNumber=41

See Also: → 1889955

I don't have any meaningful updates here. Please re-needinfo me if more action is warranted on this bug.

Flags: needinfo?(sheehan)
You need to log in before you can comment on or make changes to this bug.