Closed Bug 1481291 Opened 6 years ago Closed 9 months ago

Intermittent abort: HTTP Error 500: Internal Server Error after unhandled URLError

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: sheehan)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(5 files)

Filed by: apavel [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=192324452&repo=mozilla-beta

https://queue.taskcluster.net/v1/task/C3UDurVvQ9GtqG6XBDiTuA/runs/0/artifacts/public/logs/live_backing.log

[vcs 2018-08-06T18:54:23.735Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', 'f7876ad4ed3ed8660fe0982e6898bcde59580f60', 'https://hg.mozilla.org/releases/mozilla-beta', '/builds/worker/workspace/build/src']
[vcs 2018-08-06T18:54:23.795Z] (using Mercurial 4.5.2)
[vcs 2018-08-06T18:54:23.795Z] ensuring https://hg.mozilla.org/releases/mozilla-beta@f7876ad4ed3ed8660fe0982e6898bcde59580f60 is available at /builds/worker/workspace/build/src
[vcs 2018-08-06T18:54:24.226Z] unhandled URLError; reason type: str; value: Internal Server ErrorPERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["m4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 0.42868590354919434}]}
[vcs 2018-08-06T18:54:24.226Z] abort: HTTP Error 500: Internal Server Error
[taskcluster 2018-08-06 18:54:24.866Z] === Task Finished ===
[taskcluster 2018-08-06 18:54:24.983Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2018-08-06 18:54:25.458Z] Unsuccessful task run with exit code: 255 completed in 4.04 seconds
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE

There are 24 total failures in the last 7 days on android, lint, linux64, osx and win.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=264522423&repo=autoland&lineNumber=32

[vcs 2019-09-02T08:58:05.613Z] unhandled URLError; reason type: str; value: Internal Server ErrorPERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.6516640186309814}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7938621044158936}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7938621044158936}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull_fullcheckout", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7938621044158936}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull_populatedwdir", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7938621044158936}]}
[vcs 2019-09-02T08:58:05.613Z] abort: HTTP Error 500: Internal Server Error
[taskcluster 2019-09-02 08:58:05.977Z] === Task Finished ===
[taskcluster 2019-09-02 08:58:06.094Z] Artifact "public/build/maven" not found at "/builds/worker/workspace/build/src/obj-firefox/gradle/build/mobile/android/geckoview/maven/"
[taskcluster 2019-09-02 08:58:06.206Z] Artifact "public/build/geckoview_example.apk" not found at "/builds/worker/workspace/build/src/obj-firefox/gradle/build/mobile/android/geckoview_example/outputs/apk/withGeckoBinaries/debug/geckoview_example-withGeckoBinaries-debug.apk"
[taskcluster 2019-09-02 08:58:06.290Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-09-02 08:58:06.406Z] Artifact "public/build/geckoview-androidTest.apk" not found at "/builds/worker/workspace/build/src/obj-firefox/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk"
[taskcluster 2019-09-02 08:58:06.510Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-09-02 08:58:06.815Z] Unsuccessful task run with exit code: 255 completed in 3.942 seconds

Geoff can this get the stockwell infra whiteboard label?

Flags: needinfo?(gbrown)

Sure. That seems appropriate.

Flags: needinfo?(gbrown)
Whiteboard: [stockwell infra]

We keep hitting this today:
https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&searchStr=build&fromchange=7b3c76142ad70ce1fa8df3d55f32206b42cf3e4b&selectedJob=269616224

and also https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=269616453&repo=autoland&lineNumber=86

[vcs 2019-10-03T14:56:27.292Z] socket error: TLS/SSL connection has been closed (EOF) (_ssl.c:661)
[vcs 2019-10-03T14:56:27.292Z] (retrying after network failure on attempt 1 of 3)
[vcs 2019-10-03T14:56:27.294Z] (waiting 2.85s before retry)
[vcs 2019-10-03T14:57:03.295Z] ensuring https://hg.mozilla.org/integration/autoland@eb3f06b807587f023efa7d9e919df5b72c71a156 is available at Z:\task_1570112862\build\src
[vcs 2019-10-03T14:57:03.300Z] unhandled URLError; reason type: str; value: bad HTTP status line: ''PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 61.12000012397766}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall_nopull", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 61.12000012397766}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall_nopull_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 61.12000012397766}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall_nopull_populatedwdir", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 61.12000012397766}]}
[vcs 2019-10-03T14:57:03.300Z] Traceback (most recent call last):
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\scmutil.pyc", line 161, in callcatch
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\dispatch.pyc", line 344, in _runcatchfunc
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\dispatch.pyc", line 984, in _dispatch
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\dispatch.pyc", line 730, in runcommand
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\dispatch.pyc", line 992, in _runcommand
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\dispatch.pyc", line 981, in <lambda>
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\util.pyc", line 1528, in check
[vcs 2019-10-03T14:57:03.300Z] File "C:/mozilla-build/robustcheckout.py", line 283, in robustcheckout
[vcs 2019-10-03T14:57:03.300Z] File "C:/mozilla-build/robustcheckout.py", line 551, in _docheckout
[vcs 2019-10-03T14:57:03.300Z] File "C:/mozilla-build/robustcheckout.py", line 372, in callself
[vcs 2019-10-03T14:57:03.300Z] File "C:/mozilla-build/robustcheckout.py", line 544, in _docheckout
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\hg.pyc", line 189, in peer
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\hg.pyc", line 163, in _peerorrepo
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\httppeer.pyc", line 988, in instance
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\httppeer.pyc", line 952, in makepeer
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\httppeer.pyc", line 872, in performhandshake
[vcs 2019-10-03T14:57:03.300Z] File "mercurial\httppeer.pyc", line 311, in sendrequest
[vcs 2019-10-03T14:57:03.300Z] File "urllib2.pyc", line 429, in open
[vcs 2019-10-03T14:57:03.300Z] File "urllib2.pyc", line 447, in _open
[vcs 2019-10-03T14:57:03.300Z] File "urllib2.pyc", line 407, in _call_chain
[vcs 2019-10-03T14:57:03.301Z] File "mercurial\url.pyc", line 391, in https_open
[vcs 2019-10-03T14:57:03.301Z] File "mercurial\keepalive.pyc", line 246, in do_open
[vcs 2019-10-03T14:57:03.301Z] URLError: <urlopen error bad HTTP status line: ''>
[vcs 2019-10-03T14:57:03.301Z] abort: error: bad HTTP status line: ''
[taskcluster 2019-10-03T14:57:03.331Z] Exit Code: 255
[taskcluster 2019-10-03T14:57:03.331Z] User Time: 0s
[taskcluster 2019-10-03T14:57:03.331Z] Kernel Time: 0s
[taskcluster 2019-10-03T14:57:03.331Z] Wall Time: 1m1.8007817s
[taskcluster 2019-10-03T14:57:03.331Z] Result: FAILED
[taskcluster 2019-10-03T14:57:03.331Z] === Task Finished ===
[taskcluster 2019-10-03T14:57:03.331Z] Task Duration: 1m1.9797908s
[taskcluster:error] Uploading error artifact public/logs from file logs with message "Could not read directory 'Z:\task_1570112862\logs'", reason "file-missing-on-worker" and expiry 2020-10-02T14:52:20.198Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1570112862\logs'
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\task_1570112862\public\build'", reason "file-missing-on-worker" and expiry 2020-10-02T14:52:20.198Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1570112862\public\build'
[taskcluster 2019-10-03T14:57:04.568Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-10-02T14:52:20.198Z
[taskcluster 2019-10-03T14:57:05.174Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-10-02T14:52:20.198Z
[taskcluster 2019-10-03T14:57:05.561Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-10-02T14:52:20.198Z
[taskcluster 2019-10-03T14:57:06.130Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/esd7_bjWQoSZmKkNJcI_1A/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-10-02T14:52:20.198Z
[taskcluster:error] exit status 255
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1570112862\logs'
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1570112862\public\build'

TREES ARE CLOSED FOR THIS.

Severity: normal → blocker
Flags: needinfo?(sheehan)

Incident report: https://mozilla.statuspage.io/incidents/fcmbt2v0cx0j

Issue is now solved, trees are open.

Severity: blocker → normal
Flags: needinfo?(sheehan)

For the record, the spike in comment 61 came from a deploy to hgmo that included a race condition (between an extension upgrade and either the associated config update, or different clients which had already connected and communicated with the old version of the extension creating new connections that broke when communicating with the new version).

Hi Connor, can you look into the recent spike here or is that not necessary?

Flags: needinfo?(sheehan)

There were two events that took place around the time of the spike. One is bug 1593126 and the other was a human error of Tom's. Since most of the failures are from autoland, I think the spike correlates to the latter and can be safely deferred as a one-off problem for now.

Flags: needinfo?(sheehan)

Thank you.

Recent failures here are mostly on try.

There are 24 total failures in the last 7 days on all platforms.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=280376418&repo=autoland&lineNumber=38

[taskcluster 2019-12-09 22:14:04.190Z] Downloaded image is corrupted: (HTTP code 404) no such image - no such image: ec9897eeafd6a65aeb03bce7aa330eac: No such image: ec9897eeafd6a65aeb03bce7aa330eac:latest
[taskcluster 2019-12-09 22:14:04.190Z] Downloading artifact "public/image.tar.zst" from task ID: eaJxbTjBQ2mb2ICxRfeFVg.
[taskcluster 2019-12-09 22:14:06.061Z] Downloaded artifact successfully.
[taskcluster 2019-12-09 22:14:06.061Z] Downloaded 252.049 mb
[taskcluster 2019-12-09 22:14:06.061Z] Decompressing downloaded image
[taskcluster 2019-12-09 22:14:07.767Z] Loading docker image from downloaded archive.
[taskcluster 2019-12-09 22:14:17.482Z] Image 'public/image.tar.zst' from task 'eaJxbTjBQ2mb2ICxRfeFVg' loaded. Using image ID sha256:9c39ee7bfc854fb2dc37acb89be4defe9fcbad88c5b9d3bd14de48a1a61e3224.
[taskcluster 2019-12-09 22:14:17.664Z] === Task Starting ===
[setup 2019-12-09T22:14:18.103Z] run-task started in /builds/worker
[cache 2019-12-09T22:14:18.104Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
[cache 2019-12-09T22:14:18.104Z] cache /builds/worker/tooltool-cache exists; requirements: gid=1000 uid=1000 version=1
[cache 2019-12-09T22:14:18.104Z] cache /builds/worker/workspace is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2019-12-09T22:14:18.104Z] volume /builds/worker/checkouts is a cache
[volume 2019-12-09T22:14:18.104Z] volume /builds/worker/tooltool-cache is a cache
[volume 2019-12-09T22:14:18.104Z] volume /builds/worker/workspace is a cache
[setup 2019-12-09T22:14:18.105Z] running as worker:worker
[vcs 2019-12-09T22:14:18.105Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2019-12-09T22:14:18.233Z] hgmointernal rate hit; cloning from private hgweb mirror
[vcs 2019-12-09T22:14:18.233Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2019-12-09T22:14:18.340Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://us-east-1.hgmointernal.net/mozilla-unified', '--revision', 'c4144927452f239d86ee82147524096300e5cd60', 'https://us-east-1.hgmointernal.net/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2019-12-09T22:14:18.399Z] (using Mercurial 4.8.1)
[vcs 2019-12-09T22:14:18.399Z] ensuring https://us-east-1.hgmointernal.net/integration/autoland@c4144927452f239d86ee82147524096300e5cd60 is available at /builds/worker/workspace/build/src
[vcs 2019-12-09T22:14:18.536Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
[vcs 2019-12-09T22:14:18.584Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2019-12-09T22:14:18.895Z] searching for changes
[vcs 2019-12-09T22:14:19.188Z] unhandled URLError; reason type: str; value: Internal Server ErrorPERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.6505281925201416}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7872011661529541}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7872011661529541}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull_fullcheckout", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7872011661529541}, {"extraOptions": ["m5.4xlarge"], "lowerIsBetter": true, "name": "overall_pull_populatedwdir", "serverUrl": "us-east-1.hgmointernal.net", "shouldAlert": false, "subtests": [], "value": 0.7872011661529541}]}
[vcs 2019-12-09T22:14:19.188Z] abort: HTTP Error 500: Internal Server Error
[taskcluster 2019-12-09 22:14:19.510Z] === Task Finished ===
[taskcluster 2019-12-09 22:14:19.619Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-12-09 22:14:19.706Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-12-09 22:14:19.864Z] Unsuccessful task run with exit code: 255 completed in 16.41 seconds

Recent failures here were caused by some hgweb3 issues, alerts and conversation on #vcs, slack.

Issue got solved.

This has an interesting ripple-effect.
Because some jobs in try are failing, they are show as error at review phase:
https://phabricator.services.mozilla.com/D61187#1883463

Kim, is that something on your radar?
thanks

Flags: needinfo?(kmoir)
Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)
Flags: needinfo?(kmoir)

When the code review bot hits the code path where it comments "The analysis task <task name> failed, but we could not detect any issue.", could it not simply re-run the task? From the Taskcluster task view it doesn't seem to have done so.

The failure rate for this bug is pretty low - I'm not sure it's worth the effort required to fix it right now, given other priorities.

Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)

Forgot NI.

Flags: needinfo?(sledru)

We are thinking of doing that, but there is a big drawback as it would increase the time between when the bot detects an error and when it publishes it.
We could mitigate it by making the bot report errors as soon as they are detected, but this too has a drawback: developers might think the analysis is finished when it isn't yet.

Seems rare enough indeed.
Maybe we should indeed do a retry of the task.

Flags: needinfo?(sledru)

I'm not sure if this is happening more frequently lately, or if not all failures are being filed up to this bug.. but it happens all the time in my try pushes. Here's a push with 5 instances of this failure:
https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&searchStr=source-test&revision=ccd362035562c9c5b623732e9592241d8c43e344

I guess I should start starring it more.

It seems to be specific to source-test tasks which might indicate an issue in the lint image.

(In reply to Andrew Halberstadt [:ahal] from comment #104)

It seems to be specific to source-test tasks which might indicate an issue in the lint image.

Probably because of bug 1589712.

True, maybe that's the only reason it feels like it happens on source-test tasks more often.. other tasks don't clone as much.

Still.. given that other tasks vastly outnumber source-test, it still feels (being very scientific here) disproportionally weighted to source-test. The error happens due to a 500 being returned from the server, so there are likely many underlying causes for this bug.

It's possible in the lint image we are doing something sub-optimally in our robustcheckout call that makes the problem more likely to appear.

This is happening a lot right now on try. Could someone give that a look?

Flags: needinfo?(sheehan)

(In reply to Simon Giesecke [:sg] [he/him] from comment #147)

This is happening a lot right now on try. Could someone give that a look?

There were some issues in that day discussed and fixed here: https://app.slack.com/client/T027LFU12/CH54CSSK0/thread/CH54CSSK0-1606771188.499200

Flags: needinfo?(sheehan)

Hm, it's still happening to me today.

The current calculation we use sets the MaxClients value to just 18 on
the private mirrors. Looking at the logs in Chronograf I can see that the
occurrances of 500s generally happen when we reach the max number of clients.
When we hit the maximum the instances still have CPU and memory to spare,
leading me to believe the Apache config should be tweaked to allow more
concurrent requests.

Also remove an unnecessary comment while we're here.

Assignee: nobody → sheehan

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/860410f1fe86
ansible: set apache MaxClients to 25 on private hgweb mirrors r=mhentges

Status: REOPENED → RESOLVED
Closed: 6 years ago3 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

~600 failures in the last week.

Flags: needinfo?(sheehan)

Seeing lots of them today.

Today's failures are focused on t-linux-xlarge-source machines (which run python test suites?), e.g. https://treeherder.mozilla.org/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=c3af8f493570d672875fa83cd918e8c8b7227eef

If anybody from Developer Services could check the situation (no hg mirror?), this would be appreciated.

It looks like we're hitting capacity on the mirrors in us-east-1 and us-west-2. I'm going to add a new hgweb node in each region where we're seeing failures.

Flags: needinfo?(sheehan)

I see multiple failures of this form on every try push (and have done so for months). Can we at least figure out how to classify these as infrastructure problems so that the job automatically retries?

Expand capacity in us-east-1 and us-west-2 as it seems we've increased
traffic in those regions.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Is it possible that this failure is more common on android?
I'm sometimes seeing up to half of the android perf jobs fail, e.g.
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=fVhVdt7HQrCduZkkkx0YMQ.0&tier=1%2C2%2C3&revision=464ab169672eeb54ce6fd542c3e6adca9e8757b3

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #178)

Is it possible that this failure is more common on android?
I'm sometimes seeing up to half of the android perf jobs fail, e.g.
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=fVhVdt7HQrCduZkkkx0YMQ.0&tier=1%2C2%2C3&revision=464ab169672eeb54ce6fd542c3e6adca9e8757b3

The Android (perf) infrastructure is rather different than mainline TC, since the jobs get routed into a different cloud: Bitbar. Or, at least, that was the setup 1.5 years ago. So it's entirely possible that the rates are different for Android. But that doesn't help address your problem :(

In this run we see that 2 of the 4 android performance tests failed due to the "HTTP Error 500: Internal Server Error after unhandled URLError"
https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&selectedTaskRun=Pc_ik6u5RdWJJoVC86f9qw.0&tier=1%2C2%2C3&revision=1d03336aafcfb8af4e39d61a52ff4983aaa4d007&searchStr=perftest%2Cview

We'd like to have these Geckoview and Fenix tests sheriffed, but the failures mean that we miss results.

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #184)

In this run we see that 2 of the 4 android performance tests failed due to the "HTTP Error 500: Internal Server Error after unhandled URLError"
https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&selectedTaskRun=Pc_ik6u5RdWJJoVC86f9qw.0&tier=1%2C2%2C3&revision=1d03336aafcfb8af4e39d61a52ff4983aaa4d007&searchStr=perftest%2Cview

We'd like to have these Geckoview and Fenix tests sheriffed, but the failures mean that we miss results.

For these specific tasks, I can see this in the logs:

[vcs 2021-04-12T04:07:02.821Z] (using Mercurial 4.5.3)
[vcs 2021-04-12T04:07:02.821Z] ensuring https://hg.mozilla.org/mozilla-central@1d03336aafcfb8af4e39d61a52ff4983aaa4d007 is available at /builds/task_161820041870969/checkouts/gecko
[vcs 2021-04-12T04:07:03.094Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2021-04-12T04:07:03.145Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2021-04-12T04:07:03.494Z] no compatible clone bundles available on server; falling back to regular clone
[vcs 2021-04-12T04:07:03.494Z] (you may want to report this to the server operator)

It seems the image that runs these tests uses Mercurial 4.5.3. That version of hg is unaware of the sparserevlog repository format and is therefore unable to use clonebundles to clone the repo. We should upgrade the version of hg in those docker images to the latest. This is likely the cause of the inflated failure rates on Android, since other images in CI are on newer versions of hg.

(In reply to Connor Sheehan [:sheehan] from comment #185)

It seems the image that runs these tests uses Mercurial 4.5.3. That version of hg is unaware of the sparserevlog repository format and is therefore unable to use clonebundles to clone the repo. We should upgrade the version of hg in those docker images to the latest. This is likely the cause of the inflated failure rates on Android, since other images in CI are on newer versions of hg.

:sheehan should we file a dependency bug for updating Mercurial in these Docker images, or are you planning to use this bug? Has there been any progress? Can we do anything to help?

Flags: needinfo?(sheehan)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚GMT from comment #192)

(In reply to Connor Sheehan [:sheehan] from comment #185)

It seems the image that runs these tests uses Mercurial 4.5.3. That version of hg is unaware of the sparserevlog repository format and is therefore unable to use clonebundles to clone the repo. We should upgrade the version of hg in those docker images to the latest. This is likely the cause of the inflated failure rates on Android, since other images in CI are on newer versions of hg.

:sheehan should we file a dependency bug for updating Mercurial in these Docker images, or are you planning to use this bug? Has there been any progress? Can we do anything to help?

We should probably file a new bug in the relevant component for those Docker images. If you could point out where the images are defined I can take a crack at updating them to the latest hg. It's probably just a requirements.txt file update, or we can move the install process from something like a deb package to a requirements.txt file.

Flags: needinfo?(sheehan) → needinfo?(dave.hunt)

:aerickson are you able to direct :sheehan to the images?

Flags: needinfo?(dave.hunt) → needinfo?(aerickson)

I believe they're all located in tree at taskcluster/docker. Mihai/Releng would know best.

Updating the Ubuntu 18.04 Docker image is problematic (see https://bugzilla.mozilla.org/show_bug.cgi?id=1678451#c6).

Flags: needinfo?(aerickson)

Hi Connor, can you please take a look?

There are 217 total failures in the last 2 days on multiple platforms. Recent failure: https://treeherder.mozilla.org/logviewer?job_id=364747532&repo=autoland&lineNumber=29

setup 2022-01-20T03:04:06.206Z] running as worker:worker
[vcs 2022-01-20T03:04:06.206Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2022-01-20T03:04:06.300Z] hgmointernal rate miss; using public hg.mozilla.org service
[vcs 2022-01-20T03:04:06.300Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2022-01-20T03:04:06.369Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:4D:EB:21:6E:35:2F:99:C6:8F:C3:47:9B:57:B8:6C:17:15:8F:86:09:D4:6C:17:1D:87:B0:DE:F9:0E:51:70:FC,sha256:FF:E7:8D:93:E9:56:3C:C0:19:FC:00:4C:18:B9:86:E5:08:E5:10:F5:E2:EA:48:E8:22:D3:A3:3A:CA:99:C3:4C', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '323f3c52b376c577c6288b757cae7f17413759bc', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/checkouts/gecko']
[vcs 2022-01-20T03:04:06.464Z] (using Mercurial 5.8.1)
[vcs 2022-01-20T03:04:06.464Z] ensuring https://hg.mozilla.org/integration/autoland@323f3c52b376c577c6288b757cae7f17413759bc is available at /builds/worker/checkouts/gecko
[vcs 2022-01-20T03:04:06.465Z] (existing repository shared store: /builds/worker/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2022-01-20T03:05:52.889Z] unhandled URLError; reason type: str; value: Internal Server Error
[vcs 2022-01-20T03:05:52.889Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c5n.4xlarge"], "hgVersion": "5.8.1", "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 106.42477297782898}, {"extraOptions": ["c5n.4xlarge"], "hgVersion": "5.8.1", "lowerIsBetter": true, "name": "overall_nopull", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 106.42477297782898}, {"extraOptions": ["c5n.4xlarge"], "hgVersion": "5.8.1", "lowerIsBetter": true, "name": "overall_nopull_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 106.42477297782898}, {"extraOptions": ["c5n.4xlarge"], "hgVersion": "5.8.1", "lowerIsBetter": true, "name": "overall_nopull_populatedwdir", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 106.42477297782898}]}
[vcs 2022-01-20T03:05:52.889Z] abort: HTTP Error 500: Internal Server Error
[taskcluster 2022-01-20 03:05:53.228Z] === Task Finished ===
[taskcluster 2022-01-20 03:05:53.359Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2022-01-20 03:05:53.442Z] Artifact "public/build/geckoview_example.aab" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview_example/outputs/bundle/withGeckoBinariesDebug/geckoview_example-withGeckoBinaries-debug.aab"
[taskcluster 2022-01-20 03:05:53.546Z] Artifact "public/build/maven" not found at "/builds/worker/workspace/obj-build/gradle/maven/"
[taskcluster 2022-01-20 03:05:53.630Z] Artifact "public/build/geckoview_example.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview_example/outputs/apk/withGeckoBinaries/debug/geckoview_example-withGeckoBinaries-debug.apk"
[taskcluster 2022-01-20 03:05:53.718Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2022-01-20 03:05:53.834Z] Artifact "public/build/geckoview-androidTest.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk"
[taskcluster 2022-01-20 03:05:53.938Z] Artifact "public/build/geckoview-test_runner.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/test_runner/outputs/apk/withGeckoBinaries/debug/test_runner-withGeckoBinaries-debug.apk"
[taskcluster 2022-01-20 03:05:54.030Z] Artifact "public/build/geckoview-test_runner.aab" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/test_runner/outputs/bundle/withGeckoBinariesDebug/test_runner-withGeckoBinaries-debug.aab"
[taskcluster 2022-01-20 03:05:54.119Z] Unsuccessful task run with exit code: 255 completed in 112.642 seconds

There were also alerts on #vcs during the same time

:nagios_status_critical_red: Thu 03:07:56 UTC [91014] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:Load is CRITICAL CRITICAL - load average: 48.01, 36.94, 19.79

5:08
:nagios_status_critical_red: Thu 03:08:11 UTC [91016] [devsvcslist] hg.public.mdc1.mozilla.com:https - ssl hgweb string search is CRITICAL CRITICAL - Socket timeout
5:08
:nagios_status_critical_red: Thu 03:08:13 UTC [91018] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
5:08
:nagios_status_critical_red: Thu 03:08:19 UTC [91020] [devsvcslist] hgweb1.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
5:09
:nagios_status_critical_red: Thu 03:09:17 UTC [91022] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:Load is CRITICAL CRITICAL - load average: 40.98, 36.06, 20.59



nagiosbot-mdc1APP  5:59 AM
:nagios_status_critical_red: Thu 03:59:29 UTC [91117] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
6:01
:nagios_status_critical_red: Thu 04:01:28 UTC [91119] [devsvcslist] hg.public.mdc1.mozilla.com:https - ssl hgweb string search is CRITICAL CRITICAL - Socket timeout

nagiosbot-mdc1APP  6:07 AM
:nagios_status_warning_yellow: Thu 04:07:01 UTC [91121] [devsvcslist] hgweb1.dmz.mdc1.mozilla.com:Zookeeper - hg is WARNING ENSEMBLE WARNING - only have 3/4 expected followers
6:08
:nagios_status_critical_red: Thu 04:08:11 UTC [91123] [devsvcslist] hgssh1.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:08
:nagios_status_critical_red: Thu 04:08:11 UTC [91125] [devsvcslist] hgweb1.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:08
:nagios_status_critical_red: Thu 04:08:14 UTC [91127] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:08
:nagios_status_warning_yellow: Thu 04:08:25 UTC [91129] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:Zookeeper - hg is WARNING ENSEMBLE WARNING - only have 3/4 expected followers
6:09
:nagios_status_ok_green: Thu 04:09:18 UTC [91131] [devsvcslist] hgssh1.dmz.mdc1.mozilla.com:Zookeeper - hg is OK zookeeper node and ensemble OK
6:09
:nagios_status_ok_green: Thu 04:09:18 UTC [91133] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:Zookeeper - hg is OK zookeeper node and ensemble OK
6:09
:nagios_status_ok_green: Thu 04:09:21 UTC [91135] [devsvcslist] hgweb1.dmz.mdc1.mozilla.com:Zookeeper - hg is OK zookeeper node and ensemble OK
6:09
:nagios_status_critical_red: Thu 04:09:35 UTC [91137] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:10
:nagios_status_ok_green: Thu 04:10:42 UTC [91139] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:Zookeeper - hg is OK zookeeper node and ensemble OK
6:11
:nagios_status_critical_red: Thu 04:11:26 UTC [91141] [devsvcslist] hgweb1.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
6:12
:nagios_status_critical_red: Thu 04:12:38 UTC [91143] [devsvcslist] hgssh1.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:12
:nagios_status_critical_red: Thu 04:12:39 UTC [91145] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:Zookeeper - hg is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
6:13
:nagios_status_warning_yellow: Thu 04:13:36 UTC [91147] [devsvcslist] hg.mozilla.org:https - /try is UNKNOWN (Service check timed out after 90.01 seconds)
6:13
:nagios_status_critical_red: Thu 04:13:41 UTC [91149] [devsvcslist] hg.mozilla.org:https - ssl hgweb string search is CRITICAL CRITICAL - Socket timeout
6:14
:nagios_status_critical_red: Thu 04:14:36 UTC [91151] [devsvcslist] hgssh1.dmz.mdc1.mozilla.com:Kafka - hg heartbeat is CRITICAL CHECK_NRPE STATE CRITICAL: Socket timeout after 20 seconds.
6:15
:nagios_status_ok_green: Thu 04:15:41 UTC [91153] [devsvcslist] hgssh1.dmz.mdc1.mozilla.com:Kafka - hg heartbeat is OK OK - heartbeat message sent to 8/8 partitions successfully





6:17
:nagios_status_critical_red: Thu 04:17:05 UTC [91155] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
6:17
:nagios_status_warning_yellow: Thu 04:17:13 UTC [91157] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:Swap is WARNING SWAP WARNING - 26% free (4022 MB out of 16064 MB)
6:19
:nagios_status_critical_red: Thu 04:19:41 UTC [91159] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:http - hgweb Port 80 is CRITICAL CRITICAL - Socket timeout
6:19
:nagios_status_warning_yellow: Thu 04:19:45 UTC [91161] [devsvcslist] zlb1.external.ops.mdc1.mozilla.com:ZLB Backend - hg.mozilla.org is UNKNOWN (Service check timed out after 90.01 seconds)
6:22
:nagios_status_ok_green: Thu 04:22:13 UTC [91163] [devsvcslist] hgweb2.dmz.mdc1.mozilla.com:Swap is OK SWAP OK - 90% free (14315 MB out of 16064 MB)

nagiosbot-mdc1APP  6:28 AM
:nagios_status_ok_green: Thu 04:28:39 UTC [91165] [devsvcslist] hg.mozilla.org:https - /try is OK HTTP OK: HTTP/1.1 200 Script output follows - 27957 bytes in 3.646 second response time
Flags: needinfo?(sheehan)
Status: REOPENED → RESOLVED
Closed: 3 years ago9 months ago
Resolution: --- → FIXED
Flags: needinfo?(sheehan)
Pushed by jmoss@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/a75101fc50fb
Update robustcheckout.py for win11 pools. r=relsre-reviewers,markco
Pushed by jmoss@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/6de1919d9803
Update robustcheckout.py for windows builders. r=relsre-reviewers,markco
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: