Closed Bug 1276813 Opened 8 years ago Closed 6 years ago

Intermittent "Failed to download from all available URLs, aborting" downloading from queue.taskcluster.net

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(2 files)

Worth noting that these are 500 errors.

I can open a new case with AWS.  Before we do that, it's worth looking at the CloudFront logs, though, to give the AWS support folks some request-ids to look at.  I'm at PyCon and don't have the resources to pull that data.
Appears this issue hasn't appeared again in over a week.  If it happens again, we need to reopen this bug, and attached logs and request IDs that can be sent to AWS.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
(In reply to Greg Arndt [:garndt] from comment #7)
> Appears this issue hasn't appeared again in over a week.  If it happens
> again, we need to reopen this bug, and attached logs and request IDs that
> can be sent to AWS.


https://treeherder.mozilla.org/logviewer.html#?job_id=67211628&repo=autoland&lineNumber=315
Status: RESOLVED → REOPENED
Flags: needinfo?(garndt)
Resolution: WORKSFORME → ---
Looking over the errors, they were all from datacenter os x machines and had a variety of errors (some connection, some ssl).  I'm guessing there was a hiccup here that resolved quickly on its own and seem to be different than what this bug was about.
Flags: needinfo?(garndt)
this picked up again on the 19th, :garndt, can you help figure out what is going on here?  This looks to be osx specific.
Flags: needinfo?(garndt)
Looking over the results for the last month, outside of a few linux tests that were incorrectly starred under this bug, and some 503's reported by the windows machines early in the month, the remaining are entirely buildbot OS X machines in the datacenter.

Unfortunately most of the errors reported display one of the following:
Caught exception: ''
OR
Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>

neither of which gives us much information as to what's going on.  I suspect that these are blips considering it's not every push and not on the thousands of other machines running for linux and windows.  Is there a way we can get more information when this fails from mozharness?

Also, looking at our logs from the queue for that artifact, I see plenty of standard responses (redirects to s3), but no unexpected ones (like ISEs).
Flags: needinfo?(garndt)
Just a heads up, you're going to see hundreds of these come in today.  There was a transient issue that Dustin and John have been looking into with downloading artifacts.  Appears to be different than the OS X ones because an actual exception was printed to the logs (time out).
Whiteboard: [stockwell infra]
As noted in comment 21, were expecting this uptick due some timeout issues experienced on the 22nd.  There were only a few failures after that, all on OS X machines as mentioned in comment 20.  I am not sure why just the OS X machines have the error like that.
oh another big spike today :(
I spot checked a dozen of the jobs that were starred and they were for downloading a file from github, not queue.taskcluster.net.  We probably should move that somewhere else so we're not depending on github.

Socket error when accessing https://raw.githubusercontent.com/mozilla/build-tooltool/master/tooltool.py: The read operation timed out
Depends on: 1345972
Almost all of the ones in comment 29 are because of comment 27
this is behaving better lately :)
It's great that it's a little better.  It's still troubling that this is primarily happening on the OS X machines using BB in the datacenter and the logging does not indicate what the problem is. 

I do not know who we need to loop in to check out those machines and see if they can put together some information as to what is failing, but so far because it's largely limited to that set of machines, I think it's isolated to those machines in the datacenter, rather than a larger taskcluster issue.
See Also: → 1350320
Kim, could you have a look at this, or redirect to someone who might know what's happening here?

This bug likely started out as multiple problems, but now is only about intermittent download failures on osx. We've already run it by taskcluster folks (comment 33).
Flags: needinfo?(kmoir)
The code involved in the failure is _download_file() rather than _fetch_into_memory() [1]
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#413-485

It seems to generally involve the .dmg file after having downloaded other files from that same task ID.
Is there something specially about how cloud mirror stores or copies dmgs around that could fail intermittently?

I've noticed in one of the logs something useful (I hope) [2]:
04:24:35     INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>

Here's also the try/except block related to this:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1102-1111


[1] https://docs.google.com/document/d/1HRIvVBqTDsSebOkXqMPxTRw3nPwsYYkonooHrC0ksvA/edit
[2]
04:21:21     INFO - Downloading https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg
04:21:21     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #1
04:21:53     INFO - retry: attempt #1 caught exception: ''
04:21:53     INFO - retry: Failed, sleeping 30 seconds before retrying
04:22:23     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #2
04:22:57     INFO - retry: attempt #2 caught exception: ''
04:22:57     INFO - retry: Failed, sleeping 60 seconds before retrying
04:23:57     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #3
04:24:35  WARNING - URL Error: https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg
04:24:35     INFO - Running command: ['nslookup', 'queue.taskcluster.net']
04:24:35     INFO - Copy/paste: nslookup queue.taskcluster.net
04:24:35     INFO -  Server:		10.26.75.40
04:24:35     INFO -  Address:	10.26.75.40#53
04:24:35     INFO -  queue.taskcluster.net	canonical name = toyama-73636.herokussl.com.
04:24:35     INFO -  toyama-73636.herokussl.com	canonical name = elb081832-1105536650.us-east-1.elb.amazonaws.com.
04:24:35     INFO -  Name:	elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35     INFO -  Address: 54.243.65.185
04:24:35     INFO -  Name:	elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35     INFO -  Address: 23.21.42.179
04:24:35     INFO -  Name:	elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35     INFO -  Address: 54.243.87.215
04:24:35     INFO - Return code: 0
04:24:35     INFO - retry: attempt #3 caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
04:24:35     INFO - Can't download from https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg!
04:24:35     INFO - Caught exception: ''
04:24:35     INFO - Caught exception: ''
04:24:35     INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
Since Armen responded, I'm going to clear my needinfo. If you need someone from buildduty to look at this, let me know and I can redirect it.
Flags: needinfo?(kmoir)
Yes, please! (I don't think Armen will be following up...he was just adding some potentially helpful info. We need someone to dig in here and try to resolve these failures.)
Flags: needinfo?(kmoir)
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #39)

> It seems to generally involve the .dmg file after having downloaded other
> files from that same task ID.
> Is there something specially about how cloud mirror stores or copies dmgs
> around that could fail intermittently?
> 

These are machines downloading from the datacenter (outside of AWS) so cloud-mirror is not a factor.  Cloud-mirror only copies things between AWS regions as machines within AWS request them.  Otherwise, all machines outside of AWS are going to download those files through cloudfront in front of our us-west-2 s3 bucket.
Alin, could you get someone from the buildduty team to investigate this bug when you have a chance?
Flags: needinfo?(kmoir) → needinfo?(aselagea)
Yup, Andrei will take a look.
Flags: needinfo?(aselagea)
Attached image 3status.PNG —
Did some investigation here but couldn't find any reason why sometime it fails to download and sometime is downloading from [url].
An example of jobs in mochitest e10s [1] show same jobs that sometime fail and sometime manage to download the [url].Jobs [2] and [3] are example of jobs that ran almost in the same time but had different status.The logs indicate that sometime the download is not working when taking data from [url].I added 3 log examples of the same job when begining to download from the [url] in the 3status.png.


[url] https://queue.taskcluster.net/v1/task/JAmeJ_9LSL6fKECXBE-gFA/artifacts/public/build/target.test_packages.json
[1] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2
[2] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126373
[3] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126364
I'd like to try increasing some timeouts or retries -- maybe we can recover if we wait long enough?
Flags: needinfo?(gbrown)
Looking at failures such as https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83e7c9bfbd76715d6e6a73f0ac45bb3cd4305d7e&selectedJob=91888786&filter-searchStr=osx%20debug, I notice that most (maybe all) of the failures happened between 05:11 and 05:13 - a fairly short window.

There are successful jobs at about the same time, but most of those initially fail and then succeed on retry. For example,

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-crashtest-e10s-bm136-tests1-macosx-build180.txt.gz

05:11:03     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:11:35     INFO - retry: attempt #1 caught exception: ''
05:11:35     INFO - retry: Failed, sleeping 60 seconds before retrying
05:12:35     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #2
05:12:35     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:13:07     INFO - retry: attempt #2 caught exception: ''
05:13:07     INFO - retry: Failed, sleeping 120 seconds before retrying
05:15:07     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #3
05:15:07     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:15:13     INFO - Expected file size: 35872624
05:15:13     INFO - Obtained file size: 35872624

and

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-gtest-bm106-tests1-macosx-build199.txt.gz

05:12:35     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:13:06     INFO - retry: attempt #1 caught exception: ''
05:13:06     INFO - retry: Failed, sleeping 60 seconds before retrying
05:14:06     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #2
05:14:06     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:14:40     INFO - retry: attempt #2 caught exception: ''
05:14:40     INFO - retry: Failed, sleeping 120 seconds before retrying
05:16:40     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #3
05:16:40     INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:16:44     INFO - Expected file size: 134502514
05:16:44     INFO - Obtained file size: 134502514
Also, I notice that most recent OSX and Windows failures occur while downloading the test_packages.json file, which sleeps less between attempts than the larger files downloaded with fetch_url_into_memory -- we seem to fail less often when we retry over a longer time period.
Attached patch retry more — — Splinter Review
Based on comment 53 and comment 54, I think at least some of the failures here would be avoided if mozharness retried downloads over a longer time period. 

In download_proxied_file() (where most failures occur), I increased retries from 3 to 5 and kept the initial sleep time to 30. Instead of the current behavior:
 - attempt #1 (might take up to 30 seconds)
 - sleep 30
 - attempt #2 (up to 30 seconds)
 - sleep 60
 - attempt #3 (...)
Pathological worst case time: 180 seconds

Now I will expect:
 - attempt #1 (might take up to 30 seconds)
 - sleep 30
 - attempt #2
 - sleep 60
 - attempt #3
 - sleep 120
 - attempt #4
 - sleep 240
 - attempt #5
Pathological worst case time: 600 seconds

fetch_url_into_memory failures are fewer. The current behavior there is:
 - attempt #1 (might take up to 30 seconds)
 - sleep 60
 - attempt #2 (up to 30 seconds)
 - sleep 120
 - attempt #3 (...)
Pathological worst case time: 270 seconds

and I'm changing those retry parameters to be the same as download_proxied_file() - initial sleep of 30 seconds (might recover faster) and 5 retries.

There is at least one disadvantage to more retries: It will take longer for jobs to fail, so when there is a long-term, non-recoverable outage, it may take longer to notice the problem and take corrective action. Still, given the persistent failures we are seeing, I think this change is appropriate.

(There is one minor ride-along change here, to log the exception name - see comment 51.)

My plan is to leave this bug open and monitor it; if failures persist, I'll likely backout this patch.
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Attachment #8858919 - Flags: review?(armenzg)
Attachment #8858919 - Flags: review?(armenzg) → review+
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5317782b6984
Retry mozharness failed downloads more persistently; r=armenzg
The change from comment 57 seems to have made a world of difference: no failures since it landed!

I expect occasional failures due to extended network/service outages, so let's keep this open.
Assignee: gbrown → nobody
I am seeing a few failures today on osx:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=99473976&filter-searchStr=osx%20other

09:05:57     INFO - retry: attempt #5 caught HTTPError exception: HTTP Error 404: Not Found
09:05:57     INFO - Can't download from https://queue.taskcluster.net/v1/task/a6-fWNwlTnKG2YkAJaSDMQ/artifacts/public/build/firefox-55.0a1.en-US.mac.test_packages.json to /builds/slave/test/build/firefox-55.0a1.en-US.mac.test_packages.json!
09:05:57     INFO - Caught exception: HTTP Error 404: Not Found
09:05:57     INFO - Caught exception: HTTP Error 404: Not Found
09:05:57     INFO - Caught exception: HTTP Error 404: Not Found
09:05:57     INFO - Caught exception: HTTP Error 404: Not Found
09:05:57     INFO - Caught exception: HTTP Error 404: Not Found
09:05:57    FATAL - Failed to download from all available URLs, aborting
09:05:57    FATAL - Running post_fatal callback...
09:05:57    FATAL - Exiting 3



:garndt, can you check if all services are running ok?
Flags: needinfo?(garndt)
This appears to be a different problem and I have openeed a bug for it.  Bug 1365350
Flags: needinfo?(garndt)
This bug still happens from time-to-time is mainly concentrated to OS X machines and self resolves.
:garndt, is there any further work we could do to reduce this issue?
Flags: needinfo?(garndt)
I'm not sure what further work could be done here and I'm open for suggestions.  This seems to be always limited to the datacenter machines and it appears the client trying to download the files are retrying with back off.
Flags: needinfo?(garndt)
Also in many cases the exception being raised is not helpful:
Caught exception: ''
There appears to have been a moment in time where the queue was returning ISEs more than usual.
I have setup alerts in heroku to help us identify periods of time where the queue might be part of the problem (higher rate of failed requests).
This spike was caused by the problems AWS was having with s3 and it returning 503 (slow down) responses.
:gbrown, a year ago you had marked this as leave-open, we have had little to no failures since then, can we close this?
Flags: needinfo?(gbrown)
Status: REOPENED → RESOLVED
Closed: 8 years ago6 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WORKSFORME
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: