<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 5

•

7 years ago

I found logs on the zeus nodes for all activity for the latest 2 hours. The latest event in the treeherder search was this afternoon so it was not in those logs. I'll check tomorrow to see if i can catch one within the 2 hour window and review the logs on zeus. I also downloaded the failure events from zeus for the last ten days and the few failures I spot-checked from treeherder have no matching failure events recorded.

Comment 6

•

7 years ago

Kendall showed me around the logs and where they are archived. I've requested access to the archives in bug 1453475 hgweb is on zlb5 and zlb3: 63.245.215.25/24 zlb5.ops.scl3.mozilla.com/eth0.5 63.245.215.102/24 zlb3.ops.scl3.mozilla.com/eth0.5 logs go to: /var/log/zeus/hg.mozilla.org-ssl.access_%{%Y-%m-%d-%H}t these are: rsync'd to metrics-logger1.private.scl3 /data/logs/

Comment 7

•

7 years ago

I have not yet found failures in the logs for the occurrence at 7:48PDT today (all 200's): ``` [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The L ounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node: 10.22.74.175:80" node_s:1.025409 req_s:1.400907 retries:0 [11/Apr/2018:07:32:06 -0700] "GET /integration/autoland/json-automationrelevance/73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1 .1" 200 1408 "-" "python-requests/2.9.1" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:0.144212 req_s:0.235547 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 3447 "https://bugzilla.mozilla.org/" "Moz illa/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:1.177323 req_s:1.467772 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/static/mercurial.js HTTP/1.1" 200 14266 "https://hg.mozilla.org/integrat ion/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH_AE S_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:0.020804 req_s:0.021473 retries:0 [11/Apr/2018:07:39:42 -0700] "GET /integration/autoland/static/style-gitweb.css HTTP/1.1" 200 3129 "https://hg.mozilla.org/integ ration/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_WITH _AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.175:80" node_s:0.036429 req_s:0.067729 retries:0 [11/Apr/2018:07:39:43 -0700] "GET /integration/autoland/static/moz-logo-bw-rgb.svg HTTP/1.1" 200 2270 "https://hg.mozilla.org/in tegration/autoland/rev/73e6728b22c5" "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0" "ssl:SSL_ECDHE_RSA_W ITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:0.016045 req_s:0.016045 retries:0 [11/Apr/2018:07:49:51 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.051242 req_s:0.071920 retries:0 ```

Comment 8

•

7 years ago

That was on zlb5, same 200's in that timeframe on zlb3. I do not think I'm seeing the requests from robustcheckout(urllib2) in the failure timeframe: ``` [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The Lounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.174:80" node_s:1.124576 req_s:1.552493 retries:0 [11/Apr/2018:07:31:26 -0700] "GET /integration/autoland/rev/73e6728b22c5 HTTP/1.1" 200 10118 "-" "Mozilla/5.0 (compatible; The Lounge IRC Client; +https://github.com/thelounge/lounge)" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:1.244409 req_s:1.665172 retries:0 [11/Apr/2018:07:31:28 -0700] "GET /integration/autoland/raw-file/73e6728b22c5bc2a79c650907c4e30fcd84f55a0/.taskcluster.yml HTTP/1.1" 200 2947 "-" "node-superagent/1.8.5" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.335373 req_s:0.432932 retries:0 [11/Apr/2018:07:49:52 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.161:80" node_s:0.055291 req_s:0.084248 retries:0 [11/Apr/2018:07:49:52 -0700] "GET /integration/autoland/json-pushes?changeset=73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 698 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:0.042835 req_s:0.062447 retries:0 [11/Apr/2018:07:49:53 -0700] "GET /integration/autoland/raw-rev/73e6728b22c5bc2a79c650907c4e30fcd84f55a0 HTTP/1.1" 200 1365 "-" "autophone" "ssl:SSL_RSA_WITH_AES_128_GCM_SHA256, version=TLSv1.2, bits=128" "node:10.22.74.176:80" node_s:1.291405 req_s:1.317616 retries:0 ```

Comment 9

•

7 years ago

No failures found in the treeherder report today: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-04-12&endday=2018-04-19&tree=trunk&bug=1451080

Updated

•

7 years ago

Comment 10

•

7 years ago

Treeherder found this failure 3 times this morning. I checked the latest, failing between 6:30 and 6:43 am pdt from ip 34.207.64.52, and found no record of the requests in the zeus logs. (and same as the times before, other tools/ips are requesting and getting successes on the same raw-file urls)

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

7 years ago

Greg, I've added a see-also onto bug 1451193. It doesn't look like the requests are reaching the zlb and we're not seeing any failures or capacity issues.

Flags: needinfo?(dhouse)

Comment hidden (Intermittent Failures Robot)

Gregory Szorc [:gps]

Comment 13

•

7 years ago

I experienced this error on one of my own Try pushes. And the error also coincided with some nagios alerts from servers hitting max worker capacity. And they were at max worker capacity due to ActiveData hammering hg.mo with thousands of HTTP requests from many concurrent clients. So this appears to be intermittent worker pool at capacity issues. Or at least that's a partial cause.

Comment hidden (Intermittent Failures Robot)

Comment 15

•

7 years ago

no failures in >2 weeks.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Cristian Brindusan [:cbrindusan]

Comment 16

•

7 years ago

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=178446831&repo=mozilla-inbound&lineNumber=104 [taskcluster 2018-05-15T00:09:37.555Z] === Task Starting === [taskcluster 2018-05-15T00:09:38.819Z] Uploading redirect artifact public/logs/live.log to URL https://elluejqaaaawgynk222j42rissfzgc7q6cktwplkx6y2r526.taskcluster-worker.net:60023/log/XtEc0nXRStSiUkA3H1SPwQ with mime type "text/plain; charset=utf-8" and expiry 2018-05-15T02:39:36.116Z [taskcluster 2018-05-15T00:09:39.306Z] Executing command 0: :: sccache currently uses the full compiler commandline as input to the [taskcluster 2018-05-15T00:09:39.306Z] :: cache hash key, so create a symlink to the task dir and build from [taskcluster 2018-05-15T00:09:39.306Z] :: the symlink dir to get consistent paths. [taskcluster 2018-05-15T00:09:39.306Z] if exist z:\build rmdir z:\build Z:\task_1526342528>if exist z:\build rmdir z:\build [taskcluster 2018-05-15T00:09:39.346Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.346Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.346Z] Kernel Time: 15.625ms [taskcluster 2018-05-15T00:09:39.346Z] Wall Time: 40.0037ms [taskcluster 2018-05-15T00:09:39.346Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.346Z] Executing command 1: mklink /d z:\build %cd% Z:\task_1526342528>mklink /d z:\build Z:\task_1526342528 symbolic link created for z:\build <<===>> Z:\task_1526342528 [taskcluster 2018-05-15T00:09:39.396Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.396Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.396Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.396Z] Wall Time: 47.0002ms [taskcluster 2018-05-15T00:09:39.396Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.396Z] Executing command 2: icacls z:\build /grant *S-1-1-0:D /L Z:\task_1526342528>icacls z:\build /grant *S-1-1-0:D /L processed file: z:\build Successfully processed 1 files; Failed processing 0 files [taskcluster 2018-05-15T00:09:39.454Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.454Z] User Time: 0s [taskcluster 2018-05-15T00:09:39.454Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.454Z] Wall Time: 58.0001ms [taskcluster 2018-05-15T00:09:39.454Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.454Z] Executing command 3: cd /d z:\build Z:\task_1526342528>cd /d z:\build [taskcluster 2018-05-15T00:09:39.502Z] Exit Code: 0 [taskcluster 2018-05-15T00:09:39.502Z] User Time: 15.625ms [taskcluster 2018-05-15T00:09:39.502Z] Kernel Time: 0s [taskcluster 2018-05-15T00:09:39.502Z] Wall Time: 47.0491ms [taskcluster 2018-05-15T00:09:39.502Z] Result: SUCCEEDED [taskcluster 2018-05-15T00:09:39.502Z] Executing command 4: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7581c1e47d4dd23a4b00c9627fbfc626cf48c087 https://hg.mozilla.org/integration/mozilla-inbound .\build\src z:\build>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7581c1e47d4dd23a4b00c9627fbfc626cf48c087 https://hg.mozilla.org/integration/mozilla-inbound .\build\src (using Mercurial 4.5.3) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (retrying after network failure on attempt 1 of 3) (waiting 3.39s before retry) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (retrying after network failure on attempt 2 of 3) (waiting 8.74s before retry) ensuring https://hg.mozilla.org/integration/mozilla-inbound@7581c1e47d4dd23a4b00c9627fbfc626cf48c087 is available at .\build\src socket error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 138.41100001335144}]} Traceback (most recent call last): File "mercurial\scmutil.pyc", line 154, in callcatch File "mercurial\dispatch.pyc", line 314, in _runcatchfunc File "mercurial\dispatch.pyc", line 918, in _dispatch File "mercurial\dispatch.pyc", line 673, in runcommand File "mercurial\dispatch.pyc", line 926, in _runcommand File "mercurial\dispatch.pyc", line 915, in <lambda> File "mercurial\util.pyc", line 1195, in check File "C:/mozilla-build/robustcheckout.py", line 260, in robustcheckout File "C:/mozilla-build/robustcheckout.py", line 473, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 295, in callself File "C:/mozilla-build/robustcheckout.py", line 473, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 295, in callself File "C:/mozilla-build/robustcheckout.py", line 472, in _docheckout File "C:/mozilla-build/robustcheckout.py", line 448, in handlepullerror File "C:/mozilla-build/robustcheckout.py", line 401, in handlenetworkfailure Abort: reached maximum number of network attempts; giving up abort: reached maximum number of network attempts; giving up [taskcluster:error] Exit Code: 255 [taskcluster:error] User Time: 0s [taskcluster:error] Kernel Time: 15.625ms [taskcluster:error] Wall Time: 2m18.9039991s [taskcluster:error] Result: FAILED [taskcluster 2018-05-15T00:11:58.406Z] === Task Finished ===

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → WORKSFORME

Natalia Csoregi [:nataliaCs]

Comment 18

•

6 years ago

https://treeherder.mozilla.org/logviewer.html#?job_id=201856718&repo=autoland&lineNumber=246

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment hidden (Intermittent Failures Robot)

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup

Comment 23

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

Resolution: --- → INCOMPLETE

Comment hidden (Intermittent Failures Robot)

Raul Gurzau (:RaulG)

Comment 25

•

6 years ago

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216087336&repo=mozilla-central&lineNumber=106

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Comment hidden (Intermittent Failures Robot)

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup

Comment 28

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → INCOMPLETE

Narcis Beleuzu [:NarcisB]

Comment 29

•

6 years ago

Fail reappeared on autoland.

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=225775098&repo=autoland&lineNumber=66

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Comment hidden (Intermittent Failures Robot)

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup

Comment 36

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → INCOMPLETE

Comment 37

•

6 years ago

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=246067204&repo=autoland&lineNumber=199

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Comment hidden (Intermittent Failures Robot)

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup

Comment 39

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → INCOMPLETE

Narcis Beleuzu [:NarcisB]

Updated

•

6 years ago

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Comment hidden (Intermittent Failures Robot)

Alexandru Michis [:malexandru]

Comment 52

•

5 years ago

There are 23 total failures in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-10-08&endday=2019-10-15&tree=trunk&bug=1451080

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271223279&repo=autoland&lineNumber=96

The only affected platform so far is linux64 opt.

Connor, can you redirect this to someone who can take a look at it?

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Connor Sheehan [:sheehan]

Comment 54

•

5 years ago

Most of these failures have the following in common:

The task can't access Taskcluster secrets and thus fall back to hg.mozilla.org public interface in mdc1.
The task hits the socket error 3 times in a row.
One or two tasks in the same push experience the same issue.

My initial thought is that the failure to access the TC secret is a symptom of a larger networking issue, that also shows up as the socket error. This is also likely as linux64 shows up as the same platform in most cases, so there could be an improperly configured task class for that platform. Another theory is the task is improperly scoped to get the secret, but then something in MDC1 is causing intermittent socket errors that show up in a few tasks taking place at the same time.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 60

•

5 years ago

This has quite a spike since the 23th November on source-test-jsshell-bench spidemonkey tests. https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-18&endday=2019-11-25&tree=trunk&bug=1451080

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278016165&repo=autoland&lineNumber=95

Connor, could you please take a look over what's going on here? Thank you.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Comment 63

•

5 years ago

Hi Connor, did you have the time to look at Cosmin's comment?

Flags: needinfo?(sheehan)

Connor Sheehan [:sheehan]

Updated

•

5 years ago

Flags: needinfo?(sheehan)

Comment 64

•

5 years ago

Apologies, I had a team onboarding work-week last week and haven't been responding to any needinfo's unless they are urgent.

It's tough to determine what could be happening here. These machines are RelOps managed hosts residing in a Mozilla DC (from the releng-hardware/gecko-t-linux-talos on the first line). They all seem to be failing to talk to hg.mozilla.org over SSL. If these machines are in MDC1, they would be in the same DC, which is especially odd. Why this is affecting only jsshell bench tests, and only on autoland, is even more confusing.

Dave, can you think of anything that might be causing these problems?

Flags: needinfo?(sheehan) → needinfo?(dhouse)

Comment 65

•

5 years ago

(In reply to Connor Sheehan [:sheehan] from comment #64)

Apologies, I had a team onboarding work-week last week and haven't been responding to any needinfo's unless they are urgent.

It's tough to determine what could be happening here. These machines are RelOps managed hosts residing in a Mozilla DC (from the releng-hardware/gecko-t-linux-talos on the first line). They all seem to be failing to talk to hg.mozilla.org over SSL. If these machines are in MDC1, they would be in the same DC, which is especially odd. Why this is affecting only jsshell bench tests, and only on autoland, is even more confusing.

Dave, can you think of anything that might be causing these problems?

I cannot think of anything yet, but there must be something because of the odd specific circumstances: only in MDC1, only on linux64 hardware workers, and only with a single test set when run on autoland.

My first thoughts are:

Maybe this one test set has fewer retries configured for this pull, and we might be seeing failures+retries on other workers at these times.
Maybe we have something wrong in our networking configuration for these workers. But then we'd see the same failures on any type of task.
Do these workers have an old version of robustcheckout that is not as robust?

I'm gambling on an outdated robustcheckout first. I'll check the robustcheckout on these linux64 workers (expecting it is ancient).

Flags: needinfo?(dhouse)

Connor Sheehan [:sheehan]

Comment 66

•

5 years ago

Connor, do you think changes in robustcheckout since March could fix this problem?

--it looks like you ( thanks! :) ) updated the robustcheckout for the linux64 hardware workers in bug 1535011.

commit a3ea1d4a01712d3dfff733a6e529a9033e6fd35c
Author: Connor Sheehan <sheehan@mozilla.com>
Date:   Wed Mar 20 17:11:35 2019 -0400

    Bug 1535011: vendor latest `robustcheckout` from version-control-tools (#430)

    Recent versions of robustcheckout include new metrics output
    and compatibility with newer versions of Mercurial. This commit
    vendors the latest robustcheckout from version-control-tools,
    revision 8e3bb142dfa9.

commit:

8e3bb142dfa9d1f20b25394baad9a5038a5aea8b
2019-03-14 16:39 +0000 	Connor Sheehan - robustcheckout: add a `serverUrl` key to CI metrics output (Bug 1535011) r=smacleod

Flags: needinfo?(sheehan)

Comment 67

•

5 years ago

None of the recent changes to robustcheckout would affect this bug, I'm afraid. Most changes in recent memory have been Py3 and Windows related.

The error message we're getting in the logs is "socket error: EOF occurred in violation of protocol". We see this error message when the network error caught by robustcheckout originally came from urllib, and the exception itself satisfies isinstance(e.reason, socket.error). The fact that such a low level exception is being hit by this bug makes me think that option 2 (from comment 65) is likely correct.

It could also be option 1, assuming you're talking about Taskcluster retries. robustcheckout is only written to do three retries, although you can configure min and max jitter values in the Mercurial hgrc file.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Comment 75

•

5 years ago

There are 28 total failures in the last 7 days on linux64 opt out of which 2 are on macosx64-shippable opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285739388&repo=autoland&lineNumber=96

[vcs 2020-01-21T09:42:20.988Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:20.989Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:21.701Z] (pulling to obtain 4f49e933eb839b72b69757ba8c633687e474dbae)
[vcs 2020-01-21T09:42:21.861Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:21.861Z] (retrying after network failure on attempt 1 of 3)
[vcs 2020-01-21T09:42:21.861Z] (waiting 5.02s before retry)
[vcs 2020-01-21T09:42:26.889Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:26.889Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:27.049Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:27.049Z] (retrying after network failure on attempt 2 of 3)
[vcs 2020-01-21T09:42:27.049Z] (waiting 5.65s before retry)
[vcs 2020-01-21T09:42:32.701Z] ensuring https://hg.mozilla.org/integration/autoland@4f49e933eb839b72b69757ba8c633687e474dbae is available at /home/cltbld/checkouts/gecko
[vcs 2020-01-21T09:42:32.701Z] (existing repository shared store: /home/cltbld/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-01-21T09:42:32.861Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-01-21T09:42:32.865Z] Traceback (most recent call last):
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/scmutil.py", line 161, in callcatch
[vcs 2020-01-21T09:42:32.865Z] return func()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 344, in _runcatchfunc
[vcs 2020-01-21T09:42:32.865Z] return _dispatch(req)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 984, in _dispatch
[vcs 2020-01-21T09:42:32.865Z] cmdpats, cmdoptions)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 730, in runcommand
[vcs 2020-01-21T09:42:32.865Z] ret = _runcommand(ui, options, cmd, d)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 992, in _runcommand
[vcs 2020-01-21T09:42:32.865Z] return cmdfunc()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 981, in <lambda>
[vcs 2020-01-21T09:42:32.865Z] d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/lib/python2.7/dist-packages/mercurial/util.py", line 1530, in check
[vcs 2020-01-21T09:42:32.865Z] return func(*args, **kwargs)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 185, in robustcheckout
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparseprofile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 602, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] return callself()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparse_profile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] return callself()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-01-21T09:42:32.865Z] sparse_profile=sparse_profile)
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 460, in _docheckout
[vcs 2020-01-21T09:42:32.865Z] if handlepullerror(e):
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 436, in handlepullerror
[vcs 2020-01-21T09:42:32.865Z] handlenetworkfailure()
[vcs 2020-01-21T09:42:32.865Z] File "/usr/local/lib/hgext/robustcheckout.py", line 381, in handlenetworkfailure
[vcs 2020-01-21T09:42:32.865Z] raise error.Abort(b'reached maximum number of network attempts; '
[vcs 2020-01-21T09:42:32.865Z] Abort: reached maximum number of network attempts; giving up
[vcs 2020-01-21T09:42:32.865Z]
[vcs 2020-01-21T09:42:32.865Z] abort: reached maximum number of network attempts; giving up
[vcs 2020-01-21T09:42:32.865Z]
[taskcluster 2020-01-21T09:42:32.891Z] Exit Code: 255
[taskcluster 2020-01-21T09:42:32.891Z] User Time: 288ms
[taskcluster 2020-01-21T09:42:32.891Z] Kernel Time: 32ms
[taskcluster 2020-01-21T09:42:32.891Z] Wall Time: 12.047262683s
[taskcluster 2020-01-21T09:42:32.891Z] Result: FAILED
[taskcluster 2020-01-21T09:42:32.891Z] === Task Finished ===
[taskcluster 2020-01-21T09:42:32.891Z] Task Duration: 12.048786882s
[taskcluster 2020-01-21T09:42:32.891Z] [mounts] Preserving cache: Moving "/home/cltbld/tasks/task_1579599343/home/cltbld/checkouts" to "/home/cltbld/caches/QvQ7HXaSTC6VWX-cFaKMkg"
[taskcluster:error] exit status 255

Comment hidden (Intermittent Failures Robot)

Comment 77

•

5 years ago

Update:
There have been 33 failures within the last 7 days all of them on Linux x64 opt.

Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286741101&repo=autoland&lineNumber=97

Comment hidden (Intermittent Failures Robot)

Alexandru Michis [:malexandru]

Comment 79

•

5 years ago

Connor, could you please take a look at the recent failures?
There seems to be a spike of failures on Linux as Natalia mentioned above.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Oana Pop-Rus

Comment 83

•

5 years ago

There are 30 total failures in the last 7 days on linux64 opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=290536502&repo=autoland&lineNumber=98

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 85

•

5 years ago

There are 23 failures associated to this bug in the last 7 days. These are occurring on linux64 opt.

Comment hidden (Intermittent Failures Robot)

Stefan Hindli [:stefan_hindli]

Comment 87

•

5 years ago

Over the last 7 days there are 39 failures on this bug. These happen only on linux64 opt.

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293226376&repo=autoland&lineNumber=96

Comment hidden (Intermittent Failures Robot)

Comment 90

•

5 years ago

This bug failed 41 times in the last 7 days. Occurs on linux64 on opt builds types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=294205582&repo=autoland&lineNumber=96

Connor: Can you please have someone that a look at this bug?

:glob ✱

Comment 91

•

5 years ago

@dhouse Are these requests routed through a proxy server?

Flags: needinfo?(dhouse)

Comment 92

•

5 years ago

releng-hardware/gecko-t-linux-talos
uses the datacenter proxies:
proxy.dmz.mdc1.mozilla.com [10.48.122.10]
proxy.dmz.mdc2.mozilla.com [10.50.122.10]

Connor, let me know if you want for me to set up one of these workers for you to access or with a test re-trying a pull.

Flags: needinfo?(dhouse)

Comment 93

•

5 years ago

robustcheckout.py on these workers looks correct matching the configuration repository:

[root@t-linux64-ms-011 ~]# md5sum /usr/local/lib/hgext/robustcheckout.py
e16e36fb8a283c12b2d60714618eecef  /usr/local/lib/hgext/robustcheckout.py
[root@t-linux64-ms-011 ~]# md5sum /home/cltbld/workspace/mozharness/external_tools/robustcheckout.py
1157d85ed13577db9329be690d6aad82  /home/cltbld/workspace/mozharness/external_tools/robustcheckout.py
[root@t-linux64-ms-011 ~]# md5sum /home/cltbld/checkouts/gecko/testing/mozharness/external_tools/robustcheckout.py
e16e36fb8a283c12b2d60714618eecef  /home/cltbld/checkouts/gecko/testing/mozharness/external_tools/robustcheckout.py
[root@t-linux64-ms-011 ~]# wget -O - https://raw.githubusercontent.com/mozilla-releng/build-puppet/718bb3a0ded3de96b0b5fa83eee1c3d15ef10017/modules/mercurial/files/robustcheckout.py | md5sum
[...]
e16e36fb8a283c12b2d60714618eecef  -

I don't think it is using the "/home/cltbld/workspace/mozharness/external_tools/robustcheckout.py"
e16e36fb8a283c12b2d60714618eecef matches version-control-tools at ff8a7c9c843a9c516588606aaafc39ac03d1eea0

Comment 94

•

5 years ago

I'm reviewing the config to see if our hg is using the proxy, and if we can turn it off for some testing (maybe on a subset of the linux64 hardware workers).

Flags: needinfo?(dhouse)

Stefan Hindli [:stefan_hindli]

Comment 95

•

5 years ago

(In reply to Dave House [:dhouse] from comment #92)

releng-hardware/gecko-t-linux-talos
uses the datacenter proxies:
proxy.dmz.mdc1.mozilla.com [10.48.122.10]
proxy.dmz.mdc2.mozilla.com [10.50.122.10]

Connor, let me know if you want for me to set up one of these workers for you to access or with a test re-trying a pull.

I was wrong. I checked with Jake and in the build-puppet configuration and verified the proxies are not being used on these workers.

Flags: needinfo?(dhouse)

Comment hidden (Intermittent Failures Robot)

Daniel Varga [:dvarga]

Comment 97

•

5 years ago

•

Edited

There are 37 failures associated with this bug in the last 7 days https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-03-24&endday=2020-03-31&tree=trunk&bug=1451080
Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295499712&repo=autoland&lineNumber=99
This has failed on linux64, macosx1014-64, android-4-0-armv7-api16, osx-shippable - pgo, opt, asan build types.
With most of the failures on linux64 - opt.

Comment hidden (Intermittent Failures Robot)

Comment 99

•

5 years ago

Over the last 7 days there are 48 failures present on this bug. These happen on macosx1014-64-qr and linux64.

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=296679202&repo=autoland&lineNumber=98

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 104

•

5 years ago

There are 83 failures associated to this bug in the last 7 days. These are occurring on linux64-shippable, linux64 opt builds.

recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298012527&repo=autoland&lineNumber=95
[vcs 2020-04-16T23:36:36.411Z] socket error: EOF occurred in violation of protocol (_ssl.c:590)
[vcs 2020-04-16T23:36:36.415Z] Traceback (most recent call last):
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/scmutil.py", line 161, in callcatch
[vcs 2020-04-16T23:36:36.415Z] return func()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 344, in _runcatchfunc
[vcs 2020-04-16T23:36:36.415Z] return _dispatch(req)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 984, in _dispatch
[vcs 2020-04-16T23:36:36.415Z] cmdpats, cmdoptions)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 730, in runcommand
[vcs 2020-04-16T23:36:36.415Z] ret = _runcommand(ui, options, cmd, d)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 992, in _runcommand
[vcs 2020-04-16T23:36:36.415Z] return cmdfunc()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/dispatch.py", line 981, in <lambda>
[vcs 2020-04-16T23:36:36.415Z] d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/lib/python2.7/dist-packages/mercurial/util.py", line 1530, in check
[vcs 2020-04-16T23:36:36.415Z] return func(*args, **kwargs)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 185, in robustcheckout
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparseprofile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] return callself()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparse_profile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 461, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] return callself()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 274, in callself
[vcs 2020-04-16T23:36:36.415Z] sparse_profile=sparse_profile)
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 460, in _docheckout
[vcs 2020-04-16T23:36:36.415Z] if handlepullerror(e):
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 436, in handlepullerror
[vcs 2020-04-16T23:36:36.415Z] handlenetworkfailure()
[vcs 2020-04-16T23:36:36.415Z] File "/usr/local/lib/hgext/robustcheckout.py", line 381, in handlenetworkfailure
[vcs 2020-04-16T23:36:36.415Z] raise error.Abort(b'reached maximum number of network attempts; '
[vcs 2020-04-16T23:36:36.415Z] Abort: reached maximum number of network attempts; giving up
[vcs 2020-04-16T23:36:36.415Z]
[vcs 2020-04-16T23:36:36.415Z] abort: reached maximum number of network attempts; giving up
[vcs 2020-04-16T23:36:36.415Z]
[taskcluster 2020-04-16T23:36:36.444Z] Exit Code: 255
[taskcluster 2020-04-16T23:36:36.444Z] User Time: 168ms
[taskcluster 2020-04-16T23:36:36.444Z] Kernel Time: 8ms
[taskcluster 2020-04-16T23:36:36.444Z] Wall Time: 13.103446892s
[taskcluster 2020-04-16T23:36:36.444Z] Result: FAILED
[taskcluster 2020-04-16T23:36:36.444Z] === Task Finished ===
[taskcluster 2020-04-16T23:36:36.444Z] Task Duration: 13.105217581s
[taskcluster 2020-04-16T23:36:36.444Z] [mounts] Preserving cache: Moving "/home/cltbld/tasks/task_1587080152/home/cltbld/checkouts" to "/home/cltbld/caches/V3pwC4KoSt2nL1xBVE5H4w"
[taskcluster:error] exit status 255

:sheehan any updates on this?

Comment hidden (Intermittent Failures Robot)

Comment 106

•

5 years ago

Connor there seems to be an increase in failures here, can you take a look?

Flags: needinfo?(sheehan)

Updated

•

5 years ago

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Comment 111

•

5 years ago

Issue over the weekend might be related to whatever happened in bug 1633152.

Comment hidden (Intermittent Failures Robot)

Comment 114

•

5 years ago

This bug failed 41 times in the last 7 days. Occurs on linux64-shippable on opt build types:

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302567102&repo=autoland&lineNumber=96

Connor: Any updates on this bug?