Closed Bug 817688 Opened 12 years ago Closed 12 years ago

Firefox 15.0.1 -> 17.0.1 fallback updates fail intermittently

Categories

(Release Engineering :: Release Requests, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: u279076, Unassigned)

References

Details

When testing the Firefox 17.0.1 live updates on Friday nearly all of our update tests for 15.0.1 -> 17.0.1 failed in automation. I tried to reproduce the behaviour manually and found that the failure happens when trying to connect to the update server to grab the complete snippet. I was able to reproduce this only a couple of times out of 10 attempts.

I tried to debug this with Ben Hearsum but we couldn't find anything which would indicate a problem. The update.xml file looked correct, the mar file size was correct, the snippet via AUS URL looked correct, and I was able to download the snippet manually by navigating to the download URL.

Unfortunately, I don't have any of the actual logs anymore as they were on pastebin which doesn't host the files for that long. If necessary, I can try to generate the logs again.

One of the Mozmill reports can be seen here:
http://mozmill-ondemand.blargon7.com/#/update/report/352218d7e3125c857fc1d371674ccfcd

Please advise how I can help investigate this further (if there's something worth investigating).
Component: Release Engineering → Release Engineering: Releases
QA Contact: bhearsum
bhearsum told me this was a problem downloading the complete.mar, which comes from the meta-CDN, so this might be quite hard to debug. Can you see if the failure is reproducible today using mozmill ? 

If so can you try either of
 curl -i "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US"
or 
 wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US"
to see if we can figure out which CDN is the issue.
(In reply to Nick Thomas [:nthomas] from comment #1)
> Can you see if the failure is reproducible today using mozmill ? 
I ran the test again, it happened once out of 27 testruns.

>  curl -i "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US"
I tried this 50 times and each time it returned "HTTP/1.1 302 Found" along with a bunch of other information.

>  wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US"
I tried this 50 times and each time the complete snippet was downloaded.
FTR, snippets are turned into an update.xml by AUS, while a mar file is quite separate.

The wget/curl calls were done on the machine that the single test (of 27) failed on ? After initial 302 from donwload.m.o, what was the HTTP code from the CDN ? Did you get the same CDN IP on those 50 tests ? Have you seen any other network flakiness on the mozmill machine ?
(In reply to Nick Thomas [:nthomas] from comment #3)
> FTR, snippets are turned into an update.xml by AUS, while a mar file is
> quite separate.
> 
> The wget/curl calls were done on the machine that the single test (of 27)
> failed on ? After initial 302 from donwload.m.o, what was the HTTP code from
> the CDN ? Did you get the same CDN IP on those 50 tests ? Have you seen any
> other network flakiness on the mozmill machine ?

Here's the entire output of one of the wgets:
> wget -S "http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US"
> --2012-12-03 14:28:24--  http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US
> Resolving download.mozilla.org (download.mozilla.org)... 63.245.217.39, 63.245.217.36, 2620:101:8008:5::2:a
> Connecting to download.mozilla.org (download.mozilla.org)|63.245.217.39|:80... connected.
> HTTP request sent, awaiting response... 
>   HTTP/1.1 302 Found
>   Server: Apache
>   X-Backend-Server: bouncer7.webapp.phx1.mozilla.com
>   Cache-Control: max-age=15
>   Content-Type: text/html; charset=UTF-8
>   Date: Mon, 03 Dec 2012 22:28:15 GMT
>   Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar
>   Keep-Alive: timeout=3, max=500
>   Content-Length: 0
>   Connection: Keep-Alive
>   X-Cache-Info: cached
> Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar [following]
> --2012-12-03 14:28:24--  http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar
> Resolving download.cdn.mozilla.net (download.cdn.mozilla.net)... 173.223.52.177, 173.223.52.195
> Connecting to download.cdn.mozilla.net (download.cdn.mozilla.net)|173.223.52.177|:80... connected.
> HTTP request sent, awaiting response... 
>   HTTP/1.1 200 OK
>   Server: Apache
>   X-Backend-Server: ftp2.dmz.scl3.mozilla.com
>   Content-Type: application/octet-stream
>   Accept-Ranges: bytes
>   Access-Control-Allow-Origin: *
>   ETag: "f0434e-1728b62-4cf9e410d0f80"
>   Last-Modified: Thu, 29 Nov 2012 08:41:50 GMT
>   Content-Length: 24284002
>   X-Cache-Info: cached
>   Cache-Control: max-age=78231
>   Expires: Tue, 04 Dec 2012 20:12:15 GMT
>   Date: Mon, 03 Dec 2012 22:28:24 GMT
>   Connection: keep-alive
> Length: 24284002 (23M) [application/octet-stream]
> Saving to: `index.html?product=firefox-17.0.1-complete&os=win&lang=en-US.19'

Each time I run the command I get a different bouncerN.webapp.phx1.mozilla.com but I always get ftp2.dmz.scl3.mozilla.com for the actual download. Sometimes the IP resolves to 173.223.52.177 or 173.223.52.195.
Here the failing update test in the Jenkins CI:
http://10.250.73.243:8080/job/ondemand_update/5233/console

*** AUS:SVC Downloader:downloadUpdate - downloading from http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1 to C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
TEST-PASS | c:\users\mozauto\appdata\local\temp\tmpiy36pt.mozmill-tests\tests\update\testFallbackUpdate\test3.js | test3.js::setupModule
TEST-START | c:\users\mozauto\appdata\local\temp\tmpiy36pt.mozmill-tests\tests\update\testFallbackUpdate\test3.js | testFallbackUpdate_ErrorPatching
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar, status: 2152398850
*** AUS:SVC Downloader:onStopRequest - setting state to: downloading
*** AUS:UI gDownloadingPage:onStopRequest - pausing download
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:UI gErrorsPage:onPageShow - update.statusText: Failed (unknown reason)
*** AUS:SVC Downloader:onStopRequest - all update patch downloads failed
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #4)
The X-Backend-Server of ftp2.dmz is a red-herring here. Your request isn't actually hitting that, it's the cached response from when the CDN grabbed the file. 

The important line is 'Connecting to download.cdn.mozilla.net ...', specifically the IP it resolves to. That domain is actually two or three different CDN providers, and you get the one that a company called Cedexis thinks is best for your location. In this case you got Akamai (172.222.*.* and 172.223.*.*). Really we're looking the IP when a download like this fails.

(In reply to Henrik Skupin (:whimboo) from comment #5)
How does this differ from a fallback test that passes ? Is 'status: 2152398850' normal ?
(In reply to Nick Thomas [:nthomas] from comment #6)
> (In reply to Henrik Skupin (:whimboo) from comment #5)
> How does this differ from a fallback test that passes ? Is 'status:
> 2152398850' normal ?

2152398849 == 0x804B0001
2152398850 == 0x804B0002 

https://developer.mozilla.org/en-US/docs/Table_Of_Errors#Network_Errors

NS_BINDING_FAILED (0x804B0001)

NS_BINDING_ABORTED (0x804B0002)
    The asynchronous request failed because it was aborted by some user action.

Not sure what 2152398850 actually is but we can also see it in passing update tests as you can see here:

http://10.250.73.243:8080/job/ondemand_update/5252/console

spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, status: 2152398850
*** AUS:SVC Downloader:onStopRequest - setting state to: downloading
*** AUS:UI gDownloadingPage:onStopRequest - pausing download
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar
*** AUS:SVC Downloader:onProgress - progress: 768982/24284002
*** AUS:SVC Downloader:onProgress - progress: 6324282/24284002
*** AUS:SVC Downloader:onProgress - progress: 12459202/24284002
*** AUS:SVC Downloader:onProgress - progress: 18220362/24284002
*** AUS:SVC Downloader:onProgress - progress: 24225342/24284002
*** AUS:SVC Downloader:onProgress - progress: 24284002/24284002
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0.1/update/win32/en-US/firefox-17.0.1.complete.mar, status: 0

The only difference is that we finish with 'status: 0'. 

Robert, could you please tell us what we should look for to get this problem investigated and fixed? Thanks.
Flags: needinfo?(robert.bugzilla)
Is anyone at all able to reproduce without using mozmill?
Flags: needinfo?(robert.bugzilla)
(In reply to Robert Strong [:rstrong] (do not email) from comment #8)
> Is anyone at all able to reproduce without using mozmill?

Rob, see the initial comment. Anthony was able to reproduce it manually.
Can you provide the same log output as provided in comment #7 for a failure?
Flags: needinfo?
Flags: needinfo? → needinfo?(anthony.s.hughes)
I'm not able to reproduce this manually anymore. I tried 40 times without failure. It's quite tedious and time consuming. Given this is happening far less in automation (1/27 runs yesterday vs 14/14 runs on release day) I'm not sure it's worth investigating further.
Flags: needinfo?(anthony.s.hughes)
I have been told in passing that the CDN sometimes serves the wrong bits though I have no confirmation of this. If you can get the AUS error console messages for a failure I can look at this further.
I also saw (in another bug) there was a proxy added between some QA machines and the outside internet. Does that apply to the mozmill machines ?
(In reply to Nick Thomas [:nthomas] from comment #13)
> I also saw (in another bug) there was a proxy added between some QA machines
> and the outside internet. Does that apply to the mozmill machines ?

Not sure which QA machines you are referring but none of our mozmill machines have a proxy yet. But it will change soon when we are moving our CI to SCL3.
Is this reproducible at all anymore (through mozmill or otherwise)?
It happens intermittently so no way to get it reproduced. At least it didn't happen over the weekend by our mozmill tests.
(In reply to Robert Strong [:rstrong] (do not email) from comment #12)
> I have been told in passing that the CDN sometimes serves the wrong bits
> though I have no confirmation of this. If you can get the AUS error console
> messages for a failure I can look at this further.

Are the messages from comment #5 or #7 not enough? I'm not sure I see why having them driven by Mozmill makes a difference...
The comments in comment #5 shows a partial log from a failed mozmill test and comment #7 shows a successful update. It doesn't give me enough to go on. We've had other cases where mozmill has had one-off issues in the past. My bet is on either the CDN sometime sending the wrong file or possibly something like the caching bug.
(In reply to Robert Strong [:rstrong] (do not email) from comment #18)
> The comments in comment #5 shows a partial log from a failed mozmill test
> and comment #7 shows a successful update. It doesn't give me enough to go
> on. 

Then you should have taken a look at the console output as I have mentioned in both of my comments. It would have contained all the necessary information. Now the logs have been removed due to log cleanup. We will have to wait for the next occurrence then. I will then make sure to attach the whole console log as attachment to the bug.
I did look at it... what I am saying is that it isn't enough to figure out what is going on. Once again, "My bet is on either the CDN sometime sending the wrong file or possibly something like the caching bug."
Oh, ok. Sorry in that case.
(In reply to Henrik Skupin (:whimboo) from comment #19)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #18)
> > The comments in comment #5 shows a partial log from a failed mozmill test
> > and comment #7 shows a successful update. It doesn't give me enough to go
> > on. 
> 
> Then you should have taken a look at the console output as I have mentioned
> in both of my comments. It would have contained all the necessary
> information. Now the logs have been removed due to log cleanup. We will have
> to wait for the next occurrence then. I will then make sure to attach the
> whole console log as attachment to the bug.

Yes, please include all necessary information in the debugging on the bug.

Sounds like there's nothing else we can do here, hooray for unsolved mysteries!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
(In reply to Robert Strong [:rstrong] (do not email) from comment #20)
> I did look at it... what I am saying is that it isn't enough to figure out
> what is going on. Once again, "My bet is on either the CDN sometime sending
> the wrong file or possibly something like the caching bug."

Rob, is there something which we can add to our Mozmill tests to check in general? If yes, we can include that so we might get the necessary information the next time the issue occurs.
Difficult to say since this is likely a server side issue. It would be better to have a test against the servers outside of app update that made sure it was behaving properly.
(In reply to Robert Strong [:rstrong] (do not email) from comment #24)
> Difficult to say since this is likely a server side issue. It would be
> better to have a test against the servers outside of app update that made
> sure it was behaving properly.

Well that's part of the problem, isn't it? We can't reliably reproduce this. What if we turned on some additional lower level network logging in addition to update logging or something? It seems to me that if we could get raw request/response bodies we'd be closer to being able to say precisely what's happened here.
(In reply to Ben Hearsum [:bhearsum] from comment #25)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #24)
> > Difficult to say since this is likely a server side issue. It would be
> > better to have a test against the servers outside of app update that made
> > sure it was behaving properly.
> 
> Well that's part of the problem, isn't it? We can't reliably reproduce this.
> What if we turned on some additional lower level network logging in addition
> to update logging or something? It seems to me that if we could get raw
> request/response bodies we'd be closer to being able to say precisely what's
> happened here.
Agreed and another part of the problem is that we are trying to use app update to diagnose server side issues that should likely be diagnosed using networking code as you state or other methods to verify the servers. Getting app update out of the way when checking server side issues would be a good thing as I see it since it would make it possible for the right people whether they are server admins or networking devs to focus on those issues instead of app update devs and allow app update devs to focus on the app update problems.
(In reply to Robert Strong [:rstrong] (do not email) from comment #26)
> (In reply to Ben Hearsum [:bhearsum] from comment #25)
> > (In reply to Robert Strong [:rstrong] (do not email) from comment #24)
> > > Difficult to say since this is likely a server side issue. It would be
> > > better to have a test against the servers outside of app update that made
> > > sure it was behaving properly.
> > 
> > Well that's part of the problem, isn't it? We can't reliably reproduce this.
> > What if we turned on some additional lower level network logging in addition
> > to update logging or something? It seems to me that if we could get raw
> > request/response bodies we'd be closer to being able to say precisely what's
> > happened here.
> Agreed and another part of the problem is that we are trying to use app
> update to diagnose server side issues that should likely be diagnosed using
> networking code as you state or other methods to verify the servers. Getting
> app update out of the way when checking server side issues would be a good
> thing

100% agreed. It's not good that you and other update devs get pulled into these bugs when we're not even sure if the servers are functioning correctly. If we had better information in the logs (be it from the updater itself or something more low level) we could be more sure about what's going on, maybe even find a way to repro, and leave you folks out of things until we're more certain that there's code issues. If you have any pointers on how to get such information that'd be great. If not, I'll ask around.
I filed bug 821007 on getting Mozmill to log more information about update requests.
I will see what I can do on bug 821007. Hopefully we will have something soon so we can catch other instances of this problem. Thanks Ben!
Depends on: 821007
We hit the following failure today with a fr Nightly build of Firefox:

http://10.22.73.128:8080/job/mozilla-central_update/2202/console (MPT)

*** AUS:SVC Downloader:onProgress - progress: 28677919/28677919
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, status: 0
*** AUS:SVC Downloader:onStopRequest - status: 0, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:_verifyDownload called
*** AUS:SVC Downloader:_verifyDownload downloaded size == expected size.
*** AUS:SVC Downloader:_verifyDownload hashes do not match. 
*** AUS:SVC Downloader:onStopRequest - download verification failed
*** AUS:SVC getStatusTextFromCode - transfer error: L'intégrité de la mise à jour n'a pas pu être vérifiée, code: verification_failed
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:UI gErrorsPage:onPageShow - update.statusText: L'intégrité de la mise à jour n'a pas pu être vérifiée
*** AUS:SVC Downloader:onStopRequest - all update patch downloads failed

Is it something I should file a new bug on?
I looked at the full console log for job 2202, and also 2201. In both the complete.mar download was interrupted several times. Perhaps it's related to the proxy you're using now ?
We are using the new CI system already since January and it is the first time we have seen this now. I can watch it closely and will make sure to file a new bug if it happens again. It really seems to be unrelated to this issue.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.