Closed Bug 1231938 Opened 10 years ago Closed 7 years ago

Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849)

Categories

(Testing :: Firefox UI Tests, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 obsolete file)

I can see a couple of transfer issues for our update tests by downloading the MAR files: 05:04:22 INFO - *** AUS:SVC Downloader:onProgress - progress: 1500000/4221828 05:04:22 INFO - *** AUS:SVC Downloader:onStopRequest - original URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20151210004006/Firefox-mozilla-aurora-44.0a2-macosx64-ru-20151206004008-20151210004006.partial.mar?versionId=46vPz_RkI1vMdmWLhWSdh3E3VKotWk3I, final URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20151210004006/Firefox-mozilla-aurora-44.0a2-macosx64-ru-20151206004008-20151210004006.partial.mar?versionId=46vPz_RkI1vMdmWLhWSdh3E3VKotWk3I, status: 2152398924 05:04:22 INFO - *** AUS:SVC Downloader:onStopRequest - status: 2152398924, current fail: 0, max fail: 10, retryTimeout: 2000 05:04:22 INFO - *** AUS:SVC Downloader:onStopRequest - non-verification failure 05:04:22 INFO - *** AUS:SVC getStatusTextFromCode - transfer error: Неудача (причина неизвестна), default code: 2152398849 This looks very similar to bug 1219934 which is about downloading the installer of Firefox.
I missed an important line in the former comment: 05:03:06 INFO - *** AUS:SVC Downloader:onProgress - progress: 1496661/4221828 05:04:02 INFO - *** AUS:SVC Downloader:onProgress - progress: 1550693/4221828 05:04:22 INFO - *** AUS:SVC Downloader:onProgress - progress: 1500000/4221828 So the progress meter decreases maybe due the networking problems we still have in SCL3. As result Firefox cancels the download. Jeremy, can you confirm that this is because of bug 1219934 and bug 1170832?
Flags: needinfo?(oremj)
This bucket isn't owned by Cloud Services, it also doesn't go through cloudfront, so is likely unrelated to the two bugs listed.
Flags: needinfo?(oremj)
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (status: 2152398924)
The status codes seem to be different. Got another one here which reports as 2152398849: https://treeherder.mozilla.org/logviewer.html#?job_id=2887237&repo=mozilla-central 05:41:35 INFO - *** AUS:SVC Downloader:onProgress - progress: 1067745/7894430 05:41:55 INFO - *** AUS:SVC Downloader:onProgress - progress: 900000/7894430 05:41:55 INFO - *** AUS:SVC Downloader:onStopRequest - original URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20151213030241/Firefox-mozilla-central-45.0a1-macosx64-en-US-20151211030207-20151213030241.partial.mar?versionId=.4DCj2N7fEHuAxy5.34H.kK.74aa.rLw, final URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20151213030241/Firefox-mozilla-central-45.0a1-macosx64-en-US-20151211030207-20151213030241.partial.mar?versionId=.4DCj2N7fEHuAxy5.34H.kK.74aa.rLw, status: 2152398924 05:41:55 INFO - *** AUS:SVC Downloader:onStopRequest - status: 2152398924, current fail: 0, max fail: 10, retryTimeout: 2000 05:41:55 INFO - *** AUS:SVC Downloader:onStopRequest - non-verification failure 05:41:55 INFO - *** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (status: 2152398924) → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed)
Nick or Rail, I assume this bucket is owned by RelEng then? Do we have any logs which could be checked for the above reported issue?
Flags: needinfo?(rail)
Flags: needinfo?(nthomas)
See Also: 1219934
No longer depends on: 1170832
I do have access to that bucket and logging is enabled for it. I can try to dig into the logs and extract some information, but I doubt it would help, because we have no way to fix S3. I've seen a bunch of different errors coming from S3: socket, ssl, 5xx, 419, etc. So far the best solution is to retry with exponential backoff. Not sure if it would be useful for tests though.
Flags: needinfo?(rail)
(In reply to Rail Aliiev [:rail] from comment #6) > I do have access to that bucket and logging is enabled for it. I can try to > dig into the logs and extract some information, but I doubt it would help, > because we have no way to fix S3. I've seen a bunch of different errors So why can't we get those failures investigated and reported to Amazon? It's clearly on their side to get those issues fixed. And I think they would need some logs which showcase the exact failure. > coming from S3: socket, ssl, 5xx, 419, etc. So far the best solution is to > retry with exponential backoff. Not sure if it would be useful for tests > though. I don't think that this is something we should do for our update tests, because it would mean that we do no longer test what we actually want to verify. A failure like that could be a potential problem for lots of people if it comes to a release and Firefox fails to update to it.
FTR, to make any progress with tickets against Amazon you need the header that identifies the request, x-amz-request-id I think. Even then it's not a high priority to them. It would be helpful to know what code 2152398849 translates to, and why this appears to go backwards: 05:41:35 INFO - *** AUS:SVC Downloader:onProgress - progress: 1067745/7894430 05:41:55 INFO - *** AUS:SVC Downloader:onProgress - progress: 900000/7894430
Flags: needinfo?(nthomas)
Based on the message printed to the log it should be one of those: https://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateService.js#4130 } else if (status != Cr.NS_BINDING_ABORTED && status != Cr.NS_ERROR_ABORT && status != Cr.NS_ERROR_DOCUMENT_NOT_CACHED) { LOG("Downloader:onStopRequest - non-verification failure"); I believe that Brian will know what 2152398924 means.
Flags: needinfo?(netzen)
Flags: needinfo?(netzen)
Hm it doesn't tell an exact failure. When I check the list (http://james-ross.co.uk/mozilla/misc/nserror_list) for 0x804B004C I do not find any entry. The entry before is NS_ERROR_REMOTE_XUL (0x804B004B) (75). Currently it can be found here: https://dxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#233 The next error in that list will be NS_ERROR_LOAD_SHOWED_ERRORPAGE (77) but not sure if that maps due to the difference of 2.
The HTTP log can be found here: https://firefox-ui-tests.s3.amazonaws.com/170c2332-621b-4ff2-babd-21ac77484a2c/http.log First HTTP response: 2015-12-13 13:41:55.387702 UTC - 245157888[1003299f0]: http response [ 2015-12-13 13:41:55.387732 UTC - 245157888[1003299f0]: HTTP/1.1 200 OK 2015-12-13 13:41:55.387739 UTC - 245157888[1003299f0]: x-amz-id-2: xskWWw8hDXMSY+R9qYWtVkxb1tVd7UJzTkijdJ77uNhaJEYFN1z/B/6MiwLiexQl1rNdI/qJRQk= 2015-12-13 13:41:55.387745 UTC - 245157888[1003299f0]: x-amz-request-id: 492E91CDF7C23A9B 2015-12-13 13:41:55.387751 UTC - 245157888[1003299f0]: Date: Sun, 13 Dec 2015 13:41:52 GMT 2015-12-13 13:41:55.387756 UTC - 245157888[1003299f0]: x-amz-version-id: x7Uq3NKdMuO_Kxg0fkczOFR_fyfLgZqG 2015-12-13 13:41:55.387762 UTC - 245157888[1003299f0]: Last-Modified: Sun, 13 Dec 2015 13:38:34 GMT 2015-12-13 13:41:55.387769 UTC - 245157888[1003299f0]: Etag: "18a3d8154c66fff54c71c832370851b3" 2015-12-13 13:41:55.387778 UTC - 245157888[1003299f0]: Accept-Ranges: bytes 2015-12-13 13:41:55.387787 UTC - 245157888[1003299f0]: Content-Type: application/octet-stream 2015-12-13 13:41:55.387795 UTC - 245157888[1003299f0]: Content-Length: 108719352 2015-12-13 13:41:55.387804 UTC - 245157888[1003299f0]: Server: AmazonS3 2015-12-13 13:41:55.387814 UTC - 245157888[1003299f0]: ] Here the interesting part. Have a look at the Content-Length value: 2015-12-13 13:42:04.959719 UTC - 196444160[1003299f0]: http response [ 2015-12-13 13:42:04.959760 UTC - 196444160[1003299f0]: HTTP/1.1 206 Partial Content 2015-12-13 13:42:04.959773 UTC - 196444160[1003299f0]: x-amz-id-2: gbMaffi7IdRY7vJQzRUOalsz3M0XKIwTyLMLN062eSXsMI96UNuzwBPIyQLQBzD+lBC9EMDEym0= 2015-12-13 13:42:04.959783 UTC - 196444160[1003299f0]: x-amz-request-id: CCE0A7F52ED37B45 2015-12-13 13:42:04.959793 UTC - 196444160[1003299f0]: Date: Sun, 13 Dec 2015 13:42:02 GMT 2015-12-13 13:42:04.959802 UTC - 196444160[1003299f0]: x-amz-version-id: x7Uq3NKdMuO_Kxg0fkczOFR_fyfLgZqG 2015-12-13 13:42:04.959811 UTC - 196444160[1003299f0]: Last-Modified: Sun, 13 Dec 2015 13:38:34 GMT 2015-12-13 13:42:04.959820 UTC - 196444160[1003299f0]: Etag: "18a3d8154c66fff54c71c832370851b3" 2015-12-13 13:42:04.959829 UTC - 196444160[1003299f0]: Accept-Ranges: bytes 2015-12-13 13:42:04.959838 UTC - 196444160[1003299f0]: Content-Range: bytes 900000-108719351/108719352 2015-12-13 13:42:04.959847 UTC - 196444160[1003299f0]: Content-Type: application/octet-stream 2015-12-13 13:42:04.959856 UTC - 196444160[1003299f0]: Content-Length: 107819352 2015-12-13 13:42:04.959864 UTC - 196444160[1003299f0]: Server: AmazonS3 2015-12-13 13:42:04.959873 UTC - 196444160[1003299f0]: ] As you can see we have 108719352 vs. 107819352. Two numbers have been flipped. Brian, could this be the reason for the failure?
Flags: needinfo?(netzen)
Go to http://james-ross.co.uk/mozilla/misc/nserror Enter the error and you'll get 2152398924 = NS_BINDING_FAILED
Flags: needinfo?(netzen)
Product: Mozilla QA → Testing
So what does NS_BINDING_FAILED in this case mean? We hit the same issue 3 times today for the ru Nightly build on OS X: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=a1843cf99df1&filter-searchStr=fxup%20os&filter-tier=1&filter-tier=2&filter-tier=3
(In reply to Henrik Skupin (:whimboo) from comment #15) > So what does NS_BINDING_FAILED in this case mean? We hit the same issue 3 > times today for the ru Nightly build on OS X: > > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > aurora&revision=a1843cf99df1&filter-searchStr=fxup%20os&filter-tier=1&filter- > tier=2&filter-tier=3 That comes from the networking code and someone on the networking team should be able to answer that.
Patrick, would you be able to have a look at the comment 8 and comment 12? We see some strange behavior with our update tests when downloading the MAR files from S3. In case you need HTTP logs, you can find those attached on Treeherder (see the given links in this bug). Thanks.
Flags: needinfo?(mcmanus)
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849)
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849) → TEST-UNEXPECTED-FAIL | test_direct_update.py, test_fallback_update.py | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849)
Rail, interestingly this test failure happened a lot on Saturday 30th of January for mostly all de-2 funsize updates across platforms. It is not visible for de-3 on Sunday, so maybe something was wrong with the mar files creation? Whether en-US not the other 3 locals we test were affected.
Flags: needinfo?(rail)
Sorry it took so long to reply... TBH I have troubles to debug the issue here. Definitely it is somehow related to way how S3 works (or doesn't! :) ). I know we see similar issues when try to run our update verification tests for releases and hit a lot of errors just after we upload the bits. Sometimes it takes hours before we stop seeing errors, sometimes minutes...
Flags: needinfo?(rail)
So given that this seems to be an issue with S3 can we get a ticket filed with them? I think that we have the necessary information for them to take a look at this. Having dozen of test failures for releases is something I'm a bit concerned about because it would always require additional checks and if necessary retriggering of test jobs.
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py, test_fallback_update.py | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849) → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849)
Attached file Update log (obsolete) —
I think that I hit this problem today while trying to upgrade my Surface Tablet from an old Nightly build. As attached the log which finally results in that known error message. As it looks like the access to the file AUS suggests is denied.
Flags: needinfo?(mcmanus) → needinfo?(rail)
So the last comment was actually a different issue which is now on file as bug 1273899 and is related to the current expire settings of S3 for old mar files.
Flags: needinfo?(rail)
Attachment #8753846 - Attachment is obsolete: true
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849) → Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Update has been downloaded (transfer error: Failed) (2152398849)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: