Closed
Bug 817688
Opened 12 years ago
Closed 12 years ago
Firefox 15.0.1 -> 17.0.1 fallback updates fail intermittently
Categories
(Release Engineering :: Release Requests, defect)
Release Engineering
Release Requests
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: u279076, Unassigned)
References
Details
When testing the Firefox 17.0.1 live updates on Friday nearly all of our update tests for 15.0.1 -> 17.0.1 failed in automation. I tried to reproduce the behaviour manually and found that the failure happens when trying to connect to the update server to grab the complete snippet. I was able to reproduce this only a couple of times out of 10 attempts. I tried to debug this with Ben Hearsum but we couldn't find anything which would indicate a problem. The update.xml file looked correct, the mar file size was correct, the snippet via AUS URL looked correct, and I was able to download the snippet manually by navigating to the download URL. Unfortunately, I don't have any of the actual logs anymore as they were on pastebin which doesn't host the files for that long. If necessary, I can try to generate the logs again. One of the Mozmill reports can be seen here: http://mozmill-ondemand.blargon7.com/#/update/report/352218d7e3125c857fc1d371674ccfcd Please advise how I can help investigate this further (if there's something worth investigating).
Updated•12 years ago
|
Component: Release Engineering → Release Engineering: Releases
QA Contact: bhearsum
Comment 1•12 years ago
|
||
bhearsum told me this was a problem downloading the complete.mar, which comes from the meta-CDN, so this might be quite hard to debug. Can you see if the failure is reproducible today using mozmill ? If so can you try either of curl -i "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" or wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" to see if we can figure out which CDN is the issue.
(In reply to Nick Thomas [:nthomas] from comment #1) > Can you see if the failure is reproducible today using mozmill ? I ran the test again, it happened once out of 27 testruns. > curl -i "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" I tried this 50 times and each time it returned "HTTP/1.1 302 Found" along with a bunch of other information. > wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" I tried this 50 times and each time the complete snippet was downloaded.
Comment 3•12 years ago
|
||
FTR, snippets are turned into an update.xml by AUS, while a mar file is quite separate. The wget/curl calls were done on the machine that the single test (of 27) failed on ? After initial 302 from donwload.m.o, what was the HTTP code from the CDN ? Did you get the same CDN IP on those 50 tests ? Have you seen any other network flakiness on the mozmill machine ?
(In reply to Nick Thomas [:nthomas] from comment #3) > FTR, snippets are turned into an update.xml by AUS, while a mar file is > quite separate. > > The wget/curl calls were done on the machine that the single test (of 27) > failed on ? After initial 302 from donwload.m.o, what was the HTTP code from > the CDN ? Did you get the same CDN IP on those 50 tests ? Have you seen any > other network flakiness on the mozmill machine ? Here's the entire output of one of the wgets: > wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" > --2012-12-03 14:28:24-- http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US > Resolving download.mozilla.org (download.mozilla.org)... 63.245.217.39, 63.245.217.36, 2620:101:8008:5::2:a > Connecting to download.mozilla.org (download.mozilla.org)|63.245.217.39|:80... connected. > HTTP request sent, awaiting response... > HTTP/1.1 302 Found > Server: Apache > X-Backend-Server: bouncer7.webapp.phx1.mozilla.com > Cache-Control: max-age=15 > Content-Type: text/html; charset=UTF-8 > Date: Mon, 03 Dec 2012 22:28:15 GMT > Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar > Keep-Alive: timeout=3, max=500 > Content-Length: 0 > Connection: Keep-Alive > X-Cache-Info: cached > Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar [following] > --2012-12-03 14:28:24-- http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar > Resolving download.cdn.mozilla.net (download.cdn.mozilla.net)... 173.223.52.177, 173.223.52.195 > Connecting to download.cdn.mozilla.net (download.cdn.mozilla.net)|173.223.52.177|:80... connected. > HTTP request sent, awaiting response... > HTTP/1.1 200 OK > Server: Apache > X-Backend-Server: ftp2.dmz.scl3.mozilla.com > Content-Type: application/octet-stream > Accept-Ranges: bytes > Access-Control-Allow-Origin: * > ETag: "f0434e-1728b62-4cf9e410d0f80" > Last-Modified: Thu, 29 Nov 2012 08:41:50 GMT > Content-Length: 24284002 > X-Cache-Info: cached > Cache-Control: max-age=78231 > Expires: Tue, 04 Dec 2012 20:12:15 GMT > Date: Mon, 03 Dec 2012 22:28:24 GMT > Connection: keep-alive > Length: 24284002 (23M) [application/octet-stream] > Saving to: `index.html?product=firefox-17.0.1-complete&os=win&lang=en-US.19' Each time I run the command I get a different bouncerN.webapp.phx1.mozilla.com but I always get ftp2.dmz.scl3.mozilla.com for the actual download. Sometimes the IP resolves to 173.223.52.177 or 173.223.52.195.
Comment 5•12 years ago
|
||
Here the failing update test in the Jenkins CI: http://10.250.73.243:8080/job/ondemand_update/5233/console *** AUS:SVC Downloader:downloadUpdate - downloading from http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1 to C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar TEST-PASS | c:\users\mozauto\appdata\local\temp\tmpiy36pt.mozmill-tests\tests\update\testFallbackUpdate\test3.js | test3.js::setupModule TEST-START | c:\users\mozauto\appdata\local\temp\tmpiy36pt.mozmill-tests\tests\update\testFallbackUpdate\test3.js | testFallbackUpdate_ErrorPatching *** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar, status: 2152398850 *** AUS:SVC Downloader:onStopRequest - setting state to: downloading *** AUS:UI gDownloadingPage:onStopRequest - pausing download *** AUS:SVC Downloader:onStartRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1 *** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, status: 2147549183 *** AUS:SVC Downloader:onStopRequest - non-verification failure *** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849 *** AUS:SVC Downloader:onStopRequest - setting state to: download-failed *** AUS:UI gErrorsPage:onPageShow - update.statusText: Failed (unknown reason) *** AUS:SVC Downloader:onStopRequest - all update patch downloads failed
Comment 6•12 years ago
|
||
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #4) The X-Backend-Server of ftp2.dmz is a red-herring here. Your request isn't actually hitting that, it's the cached response from when the CDN grabbed the file. The important line is 'Connecting to download.cdn.mozilla.net ...', specifically the IP it resolves to. That domain is actually two or three different CDN providers, and you get the one that a company called Cedexis thinks is best for your location. In this case you got Akamai (172.222.*.* and 172.223.*.*). Really we're looking the IP when a download like this fails. (In reply to Henrik Skupin (:whimboo) from comment #5) How does this differ from a fallback test that passes ? Is 'status: 2152398850' normal ?
Comment 7•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #6) > (In reply to Henrik Skupin (:whimboo) from comment #5) > How does this differ from a fallback test that passes ? Is 'status: > 2152398850' normal ? 2152398849 == 0x804B0001 2152398850 == 0x804B0002 https://developer.mozilla.org/en-US/docs/Table_Of_Errors#Network_Errors NS_BINDING_FAILED (0x804B0001) NS_BINDING_ABORTED (0x804B0002) The asynchronous request failed because it was aborted by some user action. Not sure what 2152398850 actually is but we can also see it in passing update tests as you can see here: http://10.250.73.243:8080/job/ondemand_update/5252/console spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, status: 2152398850 *** AUS:SVC Downloader:onStopRequest - setting state to: downloading *** AUS:UI gDownloadingPage:onStopRequest - pausing download *** AUS:SVC Downloader:onStartRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar *** AUS:SVC Downloader:onProgress - progress: 768982/24284002 *** AUS:SVC Downloader:onProgress - progress: 6324282/24284002 *** AUS:SVC Downloader:onProgress - progress: 12459202/24284002 *** AUS:SVC Downloader:onProgress - progress: 18220362/24284002 *** AUS:SVC Downloader:onProgress - progress: 24225342/24284002 *** AUS:SVC Downloader:onProgress - progress: 24284002/24284002 *** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar, status: 0 The only difference is that we finish with 'status: 0'. Robert, could you please tell us what we should look for to get this problem investigated and fixed? Thanks.
Flags: needinfo?(robert.bugzilla)
Comment 8•12 years ago
|
||
Is anyone at all able to reproduce without using mozmill?
Flags: needinfo?(robert.bugzilla)
Comment 9•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #8) > Is anyone at all able to reproduce without using mozmill? Rob, see the initial comment. Anthony was able to reproduce it manually.
Comment 10•12 years ago
|
||
Can you provide the same log output as provided in comment #7 for a failure?
Updated•12 years ago
|
Flags: needinfo?
Updated•12 years ago
|
Flags: needinfo? → needinfo?(anthony.s.hughes)
Reporter | ||
Comment 11•12 years ago
|
||
I'm not able to reproduce this manually anymore. I tried 40 times without failure. It's quite tedious and time consuming. Given this is happening far less in automation (1/27 runs yesterday vs 14/14 runs on release day) I'm not sure it's worth investigating further.
Flags: needinfo?(anthony.s.hughes)
Comment 12•12 years ago
|
||
I have been told in passing that the CDN sometimes serves the wrong bits though I have no confirmation of this. If you can get the AUS error console messages for a failure I can look at this further.
Comment 13•12 years ago
|
||
I also saw (in another bug) there was a proxy added between some QA machines and the outside internet. Does that apply to the mozmill machines ?
Comment 14•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #13) > I also saw (in another bug) there was a proxy added between some QA machines > and the outside internet. Does that apply to the mozmill machines ? Not sure which QA machines you are referring but none of our mozmill machines have a proxy yet. But it will change soon when we are moving our CI to SCL3.
Comment 15•12 years ago
|
||
Is this reproducible at all anymore (through mozmill or otherwise)?
Comment 16•12 years ago
|
||
It happens intermittently so no way to get it reproduced. At least it didn't happen over the weekend by our mozmill tests.
Comment 17•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #12) > I have been told in passing that the CDN sometimes serves the wrong bits > though I have no confirmation of this. If you can get the AUS error console > messages for a failure I can look at this further. Are the messages from comment #5 or #7 not enough? I'm not sure I see why having them driven by Mozmill makes a difference...
Comment 18•12 years ago
|
||
The comments in comment #5 shows a partial log from a failed mozmill test and comment #7 shows a successful update. It doesn't give me enough to go on. We've had other cases where mozmill has had one-off issues in the past. My bet is on either the CDN sometime sending the wrong file or possibly something like the caching bug.
Comment 19•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #18) > The comments in comment #5 shows a partial log from a failed mozmill test > and comment #7 shows a successful update. It doesn't give me enough to go > on. Then you should have taken a look at the console output as I have mentioned in both of my comments. It would have contained all the necessary information. Now the logs have been removed due to log cleanup. We will have to wait for the next occurrence then. I will then make sure to attach the whole console log as attachment to the bug.
Comment 20•12 years ago
|
||
I did look at it... what I am saying is that it isn't enough to figure out what is going on. Once again, "My bet is on either the CDN sometime sending the wrong file or possibly something like the caching bug."
Comment 21•12 years ago
|
||
Oh, ok. Sorry in that case.
Comment 22•12 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #19) > (In reply to Robert Strong [:rstrong] (do not email) from comment #18) > > The comments in comment #5 shows a partial log from a failed mozmill test > > and comment #7 shows a successful update. It doesn't give me enough to go > > on. > > Then you should have taken a look at the console output as I have mentioned > in both of my comments. It would have contained all the necessary > information. Now the logs have been removed due to log cleanup. We will have > to wait for the next occurrence then. I will then make sure to attach the > whole console log as attachment to the bug. Yes, please include all necessary information in the debugging on the bug. Sounds like there's nothing else we can do here, hooray for unsolved mysteries!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Comment 23•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #20) > I did look at it... what I am saying is that it isn't enough to figure out > what is going on. Once again, "My bet is on either the CDN sometime sending > the wrong file or possibly something like the caching bug." Rob, is there something which we can add to our Mozmill tests to check in general? If yes, we can include that so we might get the necessary information the next time the issue occurs.
Comment 24•12 years ago
|
||
Difficult to say since this is likely a server side issue. It would be better to have a test against the servers outside of app update that made sure it was behaving properly.
Comment 25•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #24) > Difficult to say since this is likely a server side issue. It would be > better to have a test against the servers outside of app update that made > sure it was behaving properly. Well that's part of the problem, isn't it? We can't reliably reproduce this. What if we turned on some additional lower level network logging in addition to update logging or something? It seems to me that if we could get raw request/response bodies we'd be closer to being able to say precisely what's happened here.
Comment 26•12 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #25) > (In reply to Robert Strong [:rstrong] (do not email) from comment #24) > > Difficult to say since this is likely a server side issue. It would be > > better to have a test against the servers outside of app update that made > > sure it was behaving properly. > > Well that's part of the problem, isn't it? We can't reliably reproduce this. > What if we turned on some additional lower level network logging in addition > to update logging or something? It seems to me that if we could get raw > request/response bodies we'd be closer to being able to say precisely what's > happened here. Agreed and another part of the problem is that we are trying to use app update to diagnose server side issues that should likely be diagnosed using networking code as you state or other methods to verify the servers. Getting app update out of the way when checking server side issues would be a good thing as I see it since it would make it possible for the right people whether they are server admins or networking devs to focus on those issues instead of app update devs and allow app update devs to focus on the app update problems.
Comment 27•12 years ago
|
||
(In reply to Robert Strong [:rstrong] (do not email) from comment #26) > (In reply to Ben Hearsum [:bhearsum] from comment #25) > > (In reply to Robert Strong [:rstrong] (do not email) from comment #24) > > > Difficult to say since this is likely a server side issue. It would be > > > better to have a test against the servers outside of app update that made > > > sure it was behaving properly. > > > > Well that's part of the problem, isn't it? We can't reliably reproduce this. > > What if we turned on some additional lower level network logging in addition > > to update logging or something? It seems to me that if we could get raw > > request/response bodies we'd be closer to being able to say precisely what's > > happened here. > Agreed and another part of the problem is that we are trying to use app > update to diagnose server side issues that should likely be diagnosed using > networking code as you state or other methods to verify the servers. Getting > app update out of the way when checking server side issues would be a good > thing 100% agreed. It's not good that you and other update devs get pulled into these bugs when we're not even sure if the servers are functioning correctly. If we had better information in the logs (be it from the updater itself or something more low level) we could be more sure about what's going on, maybe even find a way to repro, and leave you folks out of things until we're more certain that there's code issues. If you have any pointers on how to get such information that'd be great. If not, I'll ask around.
Comment 28•12 years ago
|
||
I filed bug 821007 on getting Mozmill to log more information about update requests.
Comment 29•12 years ago
|
||
I will see what I can do on bug 821007. Hopefully we will have something soon so we can catch other instances of this problem. Thanks Ben!
Depends on: 821007
Comment 30•11 years ago
|
||
We hit the following failure today with a fr Nightly build of Firefox: http://10.22.73.128:8080/job/mozilla-central_update/2202/console (MPT) *** AUS:SVC Downloader:onProgress - progress: 28677919/28677919 *** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, status: 0 *** AUS:SVC Downloader:onStopRequest - status: 0, current fail: 0, max fail: 10, retryTimeout: 2000 *** AUS:SVC Downloader:_verifyDownload called *** AUS:SVC Downloader:_verifyDownload downloaded size == expected size. *** AUS:SVC Downloader:_verifyDownload hashes do not match. *** AUS:SVC Downloader:onStopRequest - download verification failed *** AUS:SVC getStatusTextFromCode - transfer error: L'intégrité de la mise à jour n'a pas pu être vérifiée, code: verification_failed *** AUS:SVC Downloader:onStopRequest - setting state to: download-failed *** AUS:UI gErrorsPage:onPageShow - update.statusText: L'intégrité de la mise à jour n'a pas pu être vérifiée *** AUS:SVC Downloader:onStopRequest - all update patch downloads failed Is it something I should file a new bug on?
Comment 31•11 years ago
|
||
I looked at the full console log for job 2202, and also 2201. In both the complete.mar download was interrupted several times. Perhaps it's related to the proxy you're using now ?
Comment 32•11 years ago
|
||
We are using the new CI system already since January and it is the first time we have seen this now. I can watch it closely and will make sure to file a new bug if it happens again. It really seems to be unrelated to this issue.
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•