Closed Bug 836044 Opened 9 years ago Closed 9 years ago

Aurora stub installer doesn't seem to be working

Categories

(Firefox :: Installer, defect)

20 Branch
x86
Windows 7
defect
Not set
normal

Tracking

()

VERIFIED FIXED
Firefox 22
Tracking Status
firefox20 - verified
firefox21 + verified
firefox22 --- verified
firefox-esr17 --- wontfix
b2g18 --- wontfix
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix

People

(Reporter: jbecerra, Assigned: robert.strong.bugs)

References

Details

(Whiteboard: [stub+])

I've been trying to install Aurora on a Windows 7 VM today, and I haven't been able to successfully run the stub installer to completion. It stalls at "Installing."

I tried at least 5 times from here: http://www.mozilla.org/en-US/firefox/channel/#aurora
The aurora installer that is downloaded by the stub appears to be corrupt.
I've also had difficulty getting the installer downloaded from (it stopped at about 18.2 MB)
http://download.cdn.mozilla.net/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora/firefox-20.0a2.en-US.win32.installer.exe
I highly suspect a change with something server side especially since we haven't changed anything in the stub since the time I verified it was working last.
(In reply to juan becerra [:juanb] from comment #0)
> I've been trying to install Aurora on a Windows 7 VM today, and I haven't
> been able to successfully run the stub installer to completion. It stalls at
> "Installing."

If it's stalling after downloading, it doesn't seem likely to be a delivery problem. Could it be a bad build of the full installer or something?

Aurora worked for me just now, start to finish via stub installer.


(In reply to Robert Strong [:rstrong] (do not email) from comment #1)
> The aurora installer that is downloaded by the stub appears to be corrupt.

Is there any way to verify this apart from using the stub installer?


(In reply to Robert Strong [:rstrong] (do not email) from comment #2)
> I've also had difficulty getting the installer downloaded from (it stopped
> at about 18.2 MB)
> http://download.cdn.mozilla.net/pub/mozilla.org/firefox/nightly/latest-
> mozilla-aurora/firefox-20.0a2.en-US.win32.installer.exe

Works for me at the moment. If you can duplicate, please let me know what IP "download.cdn.mozilla.net" is resolving to from the machine that is having trouble, at the time it is having trouble.


(In reply to Robert Strong [:rstrong] (do not email) from comment #3)
> I highly suspect a change with something server side especially since we
> haven't changed anything in the stub since the time I verified it was
> working last.

No changes that I'm aware of. We don't generally mess with product delivery unilaterally, except in the case of major outages/emergencies.

Of course that doesn't mean there isn't an error... only that I've nothing to go on in that direction.


If we could get some sort of logging or verbose output from the stub installer, that would help tremendously in troubleshooting this sort of thing. Seems like we've gone through some sort of trouble several times now, and it's always a bit of a guessing game as to what the problem is.
(In reply to Jake Maul [:jakem] from comment #4)
> (In reply to juan becerra [:juanb] from comment #0)
> > I've been trying to install Aurora on a Windows 7 VM today, and I haven't
> > been able to successfully run the stub installer to completion. It stalls at
> > "Installing."
> 
> If it's stalling after downloading, it doesn't seem likely to be a delivery
> problem. Could it be a bad build of the full installer or something?
I checked the download from the stub and it was corrupted.

> 
> Aurora worked for me just now, start to finish via stub installer.
It had not been working for me (several tries(... just tried it again with the same stub and it is now working.

> 
> 
> (In reply to Robert Strong [:rstrong] (do not email) from comment #1)
> > The aurora installer that is downloaded by the stub appears to be corrupt.
> 
> Is there any way to verify this apart from using the stub installer?
No idea if there are checks for the binaries distributed to the servers or if bouncer is redirecting correctly.

> 
> (In reply to Robert Strong [:rstrong] (do not email) from comment #2)
> > I've also had difficulty getting the installer downloaded from (it stopped
> > at about 18.2 MB)
> > http://download.cdn.mozilla.net/pub/mozilla.org/firefox/nightly/latest-
> > mozilla-aurora/firefox-20.0a2.en-US.win32.installer.exe
> 
> Works for me at the moment. If you can duplicate, please let me know what IP
> "download.cdn.mozilla.net" is resolving to from the machine that is having
> trouble, at the time it is having trouble.
Will try if I run into it... Juan will more likely see it than I will.

> 
> 
> (In reply to Robert Strong [:rstrong] (do not email) from comment #3)
> > I highly suspect a change with something server side especially since we
> > haven't changed anything in the stub since the time I verified it was
> > working last.
> 
> No changes that I'm aware of. We don't generally mess with product delivery
> unilaterally, except in the case of major outages/emergencies.
> 
> Of course that doesn't mean there isn't an error... only that I've nothing
> to go on in that direction.
> 
> 
> If we could get some sort of logging or verbose output from the stub
> installer, that would help tremendously in troubleshooting this sort of
> thing. Seems like we've gone through some sort of trouble several times now,
> and it's always a bit of a guessing game as to what the problem is.
The last couple of times iirc it was bouncer. Perhaps there could be a process to verify that is working correctly? I am adding more logging to the stub but it seems like there should be something other than the stub for verifying the server side especially since we want the stub to remain small in size.
I haven't been able to reproduce this again in the last half hour, however I've noticed that my machine cursor becomes a little spinner once it has crossed the "downloading" line and begun the installation process. That's not something I was seeing earlier, so perhaps the file never quite finished downloading despite the progress indicator being right on the line.

I will keep trying for a little bit, but unless I can observe this again, I don't know how to proceed.
The OS changing the cursor to a spinner is fairly typical (see it often) during the start of the install since we launch an external process and is expected.
I tried this today with the stub installer for Aurora on Windows 7 64 and I don't see this problem. It works as expected.
We'll track and wait for the builds from comment 5 that rs mentions as having extra logging. Understood that this doesn't appear to be an issue currently, but may be intermittent.
Can we confirm this is no longer occurring?  Happy to keep this tracked for the rest of the week until FF 20 moves to Beta but after that we should either continue tracking on 21 once it lands on Aurora or resolve this WFM.
Keywords: qawanted
Flags: needinfo?(jbecerra)
I tried this about 20 times on a couple of machines, and I was able to reproduce the problem once.
Flags: needinfo?(jbecerra)
Passing to Rob to see if we can get "some sort of logging or verbose output from the stub installer" to help Jake with troubleshooting here?

Also, moving tracking to 21 as it will be moving to Aurora channel on Tuesday's Merge Day.
Assignee: nobody → robert.bugzilla
The best way to get verbose logging would be to use wireshark and capture the download until it is reproduced. I'll try to do so when I have the time.
I was able to reproduce this while running Wireshark, and I'm uploading the log file a Dropbox location, and once that's done I'll post the link here.
Robert, the log is in the following link. Let me know if there's anything else you need: http://dl.dropbox.com/u/143596/20130219-aurora-stub-stuck.pcapng
Jake, while debugging this I noticed that some of the servers have nightly builds that are a couple of days old.

download.cdn.mozilla.net IP's that should have Build ID 20130218031106
63.236.253.19 has Build ID 20130216031127
204.93.47.59 has Build ID 20130216031127

Possibly others as well.
I believe it should have actually been a 20130219 build (20130219031055?) across the board. Is the cdn usually a day behind?
Note: so far it is failing to download the majority of the time with 93.184.215.248 and has succeeded 5 out of 5 times with 63.236.253.19.
I just tested several of the IP addresses that are being returned with the following results (there were many more of the same before I actually started counting):

165.254.94.64 download.cdn.mozilla.net  # Good 5 out of 5
165.254.94.16 download.cdn.mozilla.net  # Good 5 out of 5
63.236.253.24 download.cdn.mozilla.net  # Good 5 out of 5
205.234.218.40 download.cdn.mozilla.net # Good 5 out of 5
63.236.253.49 download.cdn.mozilla.net  # Good 5 out of 5
63.236.253.19 download.cdn.mozilla.net  # Good 5 out of 5
209.211.216.24 download.cdn.mozilla.net # Good 5 out of 5
93.184.215.248 download.cdn.mozilla.net # Bad 10 out of 10

I added a bunch of logging to the stub installer and the it is the same when it succeeds as when it fails.
(In reply to Robert Strong [:rstrong] (do not email) from comment #19)
> 93.184.215.248 download.cdn.mozilla.net # Bad 10 out of 10
> 
> I added a bunch of logging to the stub installer and the it is the same when
> it succeeds as when it fails.

Sending over to Jake. Thanks Rob!
Assignee: robert.bugzilla → nmaul
Excellent, that is precisely the data I need, thank you.

All of those working IPs are Akamai. The failing one is Edgecast.

Just to confirm, we're still looking at this URL, right?
http://download.cdn.mozilla.net/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora/firefox-20.0a2.en-US.win32.installer.exe

Comparing the md5sum's from each of those IPs downloading that file (as well as the FTP cluster directly), here's what I get:

MD5  (165.254.94.64.exe)   =  a7df4df13300adfa59b9ea9c914f5740
MD5  (165.254.94.16.exe)   =  a7df4df13300adfa59b9ea9c914f5740
MD5  (63.236.253.24.exe)   =  a7df4df13300adfa59b9ea9c914f5740
MD5  (205.234.218.40.exe)  =  a7df4df13300adfa59b9ea9c914f5740
MD5  (63.236.253.49.exe)   =  a7df4df13300adfa59b9ea9c914f5740
MD5  (63.236.253.19.exe)   =  a7df4df13300adfa59b9ea9c914f5740
MD5  (209.211.216.24.exe)  =  a7df4df13300adfa59b9ea9c914f5740
MD5  (93.184.215.248.exe)  =  fd9ebd7ca34854e1fc7847114fdda892
MD5  (ftp.exe)             =  fd9ebd7ca34854e1fc7847114fdda892

File size on that one differs slightly, too:
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:36  165.254.94.64.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:36  165.254.94.16.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:37  63.236.253.24.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:37  205.234.218.40.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:37  63.236.253.49.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:37  63.236.253.19.exe
-rw-r--r--  1  jakemaul  staff  21179216  Feb  20  10:41  209.211.216.24.exe
-rw-r--r--  1  jakemaul  staff  21179024  Feb  20  10:41  93.184.215.248.exe
-rw-r--r--  1  jakemaul  staff  21179024  Feb  20  10:59  ftp.exe

That explains why the stub installer fails *after* downloading, in the "Installing" phase. The delivery is fine... the contents being delivered are faulty somehow.


Response headers for one of the good Akamai nodes, the bad Edgecast node, and the FTP cluster:

Akamai:
< HTTP/1.1 200 OK
< Server: Apache
< X-Backend-Server: ftp2.dmz.scl3.mozilla.com
< Content-Type: application/octet-stream
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< ETag: "1e522bc-1432b50-4d5ee1ece642e"
< Last-Modified: Sun, 17 Feb 2013 16:30:02 GMT
< X-Cache-Info: caching
< Content-Length: 21179216
< Cache-Control: max-age=164507
< Expires: Fri, 22 Feb 2013 15:23:08 GMT
< Date: Wed, 20 Feb 2013 17:41:21 GMT
< Connection: keep-alive

Edgecast:
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< Cache-Control: max-age=345600
< Content-Type: application/octet-stream
< Date: Wed, 20 Feb 2013 17:41:37 GMT
< ETag: "834f22-1432a90-4d6169c610688"
< Expires: Sun, 24 Feb 2013 17:41:37 GMT
< Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
< Server: ECAcc (cpm/F8A3)
< X-Backend-Server: ftp3.dmz.scl3.mozilla.com
< X-Cache: HIT
< X-Cache-Info: cached
< Content-Length: 21179024

ftp.mozilla.org:
< HTTP/1.1 200 OK
< Server: Apache
< X-Backend-Server: ftp6.dmz.scl3.mozilla.com
< Cache-Control: max-age=345600
< Content-Type: application/octet-stream
< Date: Wed, 20 Feb 2013 17:43:16 GMT
< Expires: Sun, 24 Feb 2013 17:43:16 GMT
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< ETag: "834f22-1432a90-4d6169c610688"
< Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
< X-Cache-Info: caching
< Content-Length: 21179024

Looking carefully at the dates, it appears that Akamai is currently serving an installer from Feb 17... Edgecast and ftp.mozilla.org are serving one from Feb 19. It appears to me that the older full installer is working properly, but the newer one is not.

We are sending a far too long Expires header, at least, but that's not the problem here (it's just making things confusing because they don't all have the same contents). I'll work on this today. However, based on this data, I'd have to say that fixing this is likely to make the problem *worse*, because it will force the CDNs to stay more up-to-date, which means they'll both be serving the "bad" version instead.


In the meantime, you might want to try this... set this line in your /etc/hosts file:
63.245.215.46    download.cdn.mozilla.net

This will send you straight to the FTP cluster, cutting out both CDNs entirely. Be sure to flush your local DNS cache after setting this. If *that* fails (as I now suspect it will, given the above data), then we can definitively rule out either CDN as being a problem. That would tell me that there's a problem in the full installer itself, or at least something that is tripping up the stub installer.
Just to confirm, I checked all of the IPs in comment 19... all of the working Akamai IPs are serving up:

< Last-Modified: Sun, 17 Feb 2013 16:30:02 GMT
< ETag: "1e522bc-1432b50-4d5ee1ece642e"
< Content-Length: 21179216


The broken Edgecast IP is serving up:

< Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
< ETag: "834f22-1432a90-4d6169c610688"
< Content-Length: 21179024


ftp.mozilla.org serves up (same as Edgecast, so presumably broken but needs tested):

< Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
< ETag: "834f22-1432a90-4d6169c610688"
< Content-Length: 21179024
btw: I also tested using ftp for the download url without any failures (4 out of 4 good).
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-trunk/firefox-21.0a1.en-US.win32.installer.exe

The IP's I hit for ftp are:
63.245.215.46
63.245.215.56
Since nightly fails as well I was testing using nightly.
(In reply to Jake Maul [:jakem] from comment #21)
>...
> In the meantime, you might want to try this... set this line in your
> /etc/hosts file:
> 63.245.215.46    download.cdn.mozilla.net
> 
> This will send you straight to the FTP cluster, cutting out both CDNs
> entirely. Be sure to flush your local DNS cache after setting this. If
> *that* fails (as I now suspect it will, given the above data), then we can
> definitively rule out either CDN as being a problem. That would tell me that
> there's a problem in the full installer itself, or at least something that
> is tripping up the stub installer.
It appears to only happen when using Edgecast and I suspect that just as with Akamai setting the ftp IP in my hosts file will succeed.

I'll get data using aurora later today.
Jake, this might also be related to bug 816472 where we periodically see mar files that are larger than expected. Also, bug 816472 happens in Firefox and not in the stub.
(In reply to Robert Strong [:rstrong] (do not email) from comment #26)
> Jake, this might also be related to bug 816472 where we periodically see mar
> files that are larger than expected. Also, bug 816472 happens in Firefox and
> not in the stub.
btw: I have seen this happen in the stub as well... though less often than the other error but greater than successful downloads and also from the Edgecast IP.
I have fixed the expires headers, and issued a purge of these files on both Akamai and Edgecast. By my testing they both contain the same file now, and it's the same file as on the FTP cluster.
I just tried 5 times and they were all successful.

juan, can you still reproduce?
I remember having to do this maybe 20 times before I was able to reproduce this. I'll give it a try again today.
I went with adding 93.184.215.248 download.cdn.mozilla.net to my hosts file which was giving me consistent failures. Now it is giving me consistent success.
I haven't been able to reproduce the problem in the past hour, after trying tens of times.
With 93.184.215.248 download.cdn.mozilla.net in my hosts file it successfully downloaded / installed over 35 times. Last night it consistently failed.

Jake, what ever you did appears to have fixed it for me and Juan. Any idea what could have been the cause?
Is it possible to get all of the IP Addresses? With that I can probably create a test installer to verify all of the servers.
(In reply to Jake Maul [:jakem] from comment #22)
> Just to confirm, I checked all of the IPs in comment 19... all of the
> working Akamai IPs are serving up:
> 
> < Last-Modified: Sun, 17 Feb 2013 16:30:02 GMT
> < ETag: "1e522bc-1432b50-4d5ee1ece642e"
> < Content-Length: 21179216
> 
> 
> The broken Edgecast IP is serving up:
> 
> < Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
> < ETag: "834f22-1432a90-4d6169c610688"
> < Content-Length: 21179024
I was able to extract most of one of the corrupt 93.184.215.248 downloads from yesterday and the file times are 2/18/2013 6:19 AM. So, it appears that it thinks it is serving up 2/19 and the files I am received were from 2/18.
though it is possible that 93.184.215.248 started serving the 2/19 build after I experienced that. Still though, what was served was either corrupted or corrupted on the client side and it appears that what ever you did has fixed that.
Is it feasible that ... I don't really know how to even ask this ... that stub installer is "expecting" a given version of a file in some way, and when getting some other version it chokes?



Another possibility: Edgecast uses Anycast IPs. That means worldwide, despite hundreds (thousands?) of cache servers, they only present ~3 actual IPs. It's possible that one or more Edgecast nodes responding behind that IP actually did have bad data of some kind, totally unrelated to the the mere fact that they had different files from Akamai. I'm in Phoenix (and AFAIK you're all in the Bay area)... the nodes I hit and pulled files from weren't necessarily the same as the ones you accessed... so it's possible that the files *you* were getting are a different md5sum than the ones I was getting, even though we got them from the same IP.

This all hinges on "maybe some Edgecast nodes had corrupt data"... unlikely, but definitely not impossible. If this is the case, then the cache flush would have cured it. This feels unlikely because the previous Expires header was only 4 days... we've had this bug open much longer. If it was bad data on some node, it would have been flushed naturally a long time ago. If the problem has been continuous since this bug was opened, then it's really hard to point a finger at something as transient as "bad cache data".

The only way to debug this is to examine the HTTP headers of failed attempts vs successful ones: Edgecast includes a "Server: ECAcc (cpm/F8A3)" which uniquely identifies which precise node served your request. It might be possible to distinguish successful vs failed attempts looking just at that Edgecast IP.



One other possible scenario, though I can't see how it would matter: I noted that it's been over 24 hours since a full installer build for Aurora was generated. That means, given the new Expires settings in bug 829207, that the CDNs are not actually caching right now... they'll merely be forwarding data from the origin (ftp.mozilla.org), because the Expires header will not allow them to cache. It seems pretty unlikely that the stub installer would care one way or the other if the full installer was served from cache or not, but I wanted to mention it for the sake of full disclosure.
(In reply to Robert Strong [:rstrong] (do not email) from comment #34)
> Is it possible to get all of the IP Addresses? With that I can probably
> create a test installer to verify all of the servers.

Yes, but it won't do you much good. Edgecast uses Anycast to route their CDN traffic, meaning they expose only a handful of IPs worldwide, for lots and lots of cache nodes. This means queries to the same IP from different locations will actually hit different servers. Makes troubleshooting this kind of thing harder. :(

I can get you a list of Edgecast IPs that will access our origin, but this is not the same list of IPs that end users will access to get the files. I don't know if you can query them directly and get an intelligible answer.


Akamai is more traditional... one IP per cache node/cluster, so (AFAIK) you can expect consistent behavior from a given IP. I don't think they publicize a list, but if it's of interest to you I can ask for a snapshot. I dunno if they'd be willing to divulge it.
(In reply to Jake Maul [:jakem] from comment #37)
> Is it feasible that ... I don't really know how to even ask this ... that
> stub installer is "expecting" a given version of a file in some way, and
> when getting some other version it chokes?
No. It is just doing a WinInet download and at least yesterday only had ended up with a corrupted file from the Edgecast IP.

> Another possibility: Edgecast uses Anycast IPs. That means worldwide,
> despite hundreds (thousands?) of cache servers, they only present ~3 actual
> IPs. It's possible that one or more Edgecast nodes responding behind that IP
> actually did have bad data of some kind, totally unrelated to the the mere
> fact that they had different files from Akamai. I'm in Phoenix (and AFAIK
> you're all in the Bay area)... the nodes I hit and pulled files from weren't
> necessarily the same as the ones you accessed... so it's possible that the
> files *you* were getting are a different md5sum than the ones I was getting,
> even though we got them from the same IP.
> 
> This all hinges on "maybe some Edgecast nodes had corrupt data"... unlikely,
> but definitely not impossible. If this is the case, then the cache flush
> would have cured it. This feels unlikely because the previous Expires header
> was only 4 days... we've had this bug open much longer. If it was bad data
> on some node, it would have been flushed naturally a long time ago. If the
> problem has been continuous since this bug was opened, then it's really hard
> to point a finger at something as transient as "bad cache data".
I don't think we know if it has been continuous. We do know that it was sporadic until I forced it to use the Edgecast IP.

> The only way to debug this is to examine the HTTP headers of failed attempts
> vs successful ones: Edgecast includes a "Server: ECAcc (cpm/F8A3)" which
> uniquely identifies which precise node served your request. It might be
> possible to distinguish successful vs failed attempts looking just at that
> Edgecast IP.
Juan posted a wireshark log from a failed in comment #15.

btw: I have compared the WinInet logging for both a failed and a successful download and they were exactly the same so I don't think we will be able to get any additional insight there.

> One other possible scenario, though I can't see how it would matter: I noted
> that it's been over 24 hours since a full installer build for Aurora was
> generated. That means, given the new Expires settings in bug 829207, that
> the CDNs are not actually caching right now... they'll merely be forwarding
> data from the origin (ftp.mozilla.org), because the Expires header will not
> allow them to cache. It seems pretty unlikely that the stub installer would
> care one way or the other if the full installer was served from cache or
> not, but I wanted to mention it for the sake of full disclosure.
The stub does individual range requests and I wonder if the Edgecast server is serving up a different file for some of the requests.
s/Edgecast server/Edgecast IP/
(In reply to Robert Strong [:rstrong] (do not email) from comment #35)
> (In reply to Jake Maul [:jakem] from comment #22)
> > Just to confirm, I checked all of the IPs in comment 19... all of the
> > working Akamai IPs are serving up:
> > 
> > < Last-Modified: Sun, 17 Feb 2013 16:30:02 GMT
> > < ETag: "1e522bc-1432b50-4d5ee1ece642e"
> > < Content-Length: 21179216
> > 
> > 
> > The broken Edgecast IP is serving up:
> > 
> > < Last-Modified: Tue, 19 Feb 2013 16:48:28 GMT
> > < ETag: "834f22-1432a90-4d6169c610688"
> > < Content-Length: 21179024
> I was able to extract most of one of the corrupt 93.184.215.248 downloads
> from yesterday and the file times are 2/18/2013 6:19 AM. So, it appears that
> it thinks it is serving up 2/19 and the files I am received were from 2/18.
btw: all of the files had a 2/18/2013 6:XX AM timestamp
(In reply to Robert Strong [:rstrong] (do not email) from comment #36)

> The stub does individual range requests and I wonder if the Edgecast server
> is serving up a different file for some of the requests.

Aha, I mid-aired with you to ask this very question, if it does anything interesting like Range requests.

Anycast IPs fronting independent caching nodes
File contents changing daily, but 4-day-long Expires headers
Range requests

... possibly each Range request is hitting a different Edgecast node, and they don't all happen to have the same version of the file? That would severely screw up the download if you got 300KB from one version, and then 300KB from another one.
(In reply to Jake Maul [:jakem] from comment #42)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #36)
> 
> > The stub does individual range requests and I wonder if the Edgecast server
> > is serving up a different file for some of the requests.
> 
> Aha, I mid-aired with you to ask this very question, if it does anything
> interesting like Range requests.
> 
> Anycast IPs fronting independent caching nodes
> File contents changing daily, but 4-day-long Expires headers
> Range requests
> 
> ... possibly each Range request is hitting a different Edgecast node, and
> they don't all happen to have the same version of the file? That would
> severely screw up the download if you got 300KB from one version, and then
> 300KB from another one.
I highly suspect that is what is going on. I suppose there is no way to have consistency of files served by Edgecast?... at least not with our current setup where the url served by bouncer is always the same?

I'll see what I can do in the code for this scenario.
If we can't gaurantee consistency on Edgecast it would be helpful if the problem was still present so I can verify that any changes I make fix the problem. Can you revert Edgecast to the previous config if that is the case?
(In reply to Robert Strong [:rstrong] (do not email) from comment #43)
> (In reply to Jake Maul [:jakem] from comment #42)
> > ... possibly each Range request is hitting a different Edgecast node, and
> > they don't all happen to have the same version of the file? That would
> > severely screw up the download if you got 300KB from one version, and then
> > 300KB from another one.
> I highly suspect that is what is going on. I suppose there is no way to have
> consistency of files served by Edgecast?... at least not with our current
> setup where the url served by bouncer is always the same?
> 
> I'll see what I can do in the code for this scenario.

Not really, no... our headers are explicitly telling them "it's okay to cache this file for X hours/days", and then we violate that by changing the contents of the file before then.

I'll bring this situation up to them, but I suspect the response will be something along the lines of "yeah, don't do that". :)


The obvious fixes, as I see them, are:


1) Sidestep the problem by not using Range requests. No idea what this entails in terms of work, or why it uses them now. I know why Firefox does this for updating, but don't know the rationale for stub installer. If you get the whole file in one shot, it'll be consistent.


2) Don't use Edgecast. Since Akamai's IPs are internally "safe" in this respect, and you're not likely to do another DNS lookup between Range requests, this should generally work. However, on a very slow link, you might have to do another lookup before finishing, and thus you could run into the same problem with any CDN. For this reason I would consider this a stopgap, not a real fix.


3) Retool our product delivery more significantly. Essentially, change the filename every time we change the contents. It will take some thought, but I suspect it may be feasible to use a simple query string to bust the cache and create new objects as needed. The query string would be junk, just so long as it's *different* junk whenever the file changes. The CDNs can recognize that as "this is a new object" and treat it accordingly. This is the standard solution to "my file changed and the CDN has the old version"... I think it will work in this situation ("I'm using range requests and the file is sometimes different") just as well, although it would be good to have someone else double-check my thought process on this.

This basically means bouncer changes.

It might be possible to have bouncer send you to a file like:
http://download.cdn.mozilla.net/path/to/exe/installer.exe?<md5sum-of-current-file>

How bouncer is going to get hold of that md5sum is up for debate... it can't be done on-the-fly though (performance), and needs to rotate whenever the file changes on the ftp cluster.
(In reply to Robert Strong [:rstrong] (do not email) from comment #44)
> If we can't gaurantee consistency on Edgecast it would be helpful if the
> problem was still present so I can verify that any changes I make fix the
> problem. Can you revert Edgecast to the previous config if that is the case?

I can't make it inconsistent again, but I can definitely set it up for failure by undoing my change to the Expires headers... in a few days they'll wander out of sync again and the problem will start to reappear. Want me to do that?
Yes. That will give me some confidence that any changes I make fix the problem for the stub.
Blocks: 829207
Jake/Robert, is qawanted still needed on this bug? If so, what more can we do to assist you?
At the moment there's nothing I need from QA. The next step for me is to implement a decision on comment 45 (one of the options, or a different option), and to re-implement the fix in bug 829207 when :rstrong gives the green light to do so.

I'm CC'ing Brandon and Laura on this bug, because one of the options in comment 45 is to alter Bouncer to include a query string when directing users to a mirror. This would need some thought and implementation, so I'd like to have them roped in on it. We can have a vidyo meeting to fill you two in on the background. :)
(In reply to Jake Maul [:jakem] from comment #49)
> At the moment there's nothing I need from QA. 

Thanks Jake. Dropping QAWANTED.
Keywords: qawanted
I'm removing range requests in bug 811573 which will make it so the stub doesn't break when hitting Edgecast servers so adding dependency.
Depends on: 811573
(In reply to Jake Maul [:jakem] from comment #46)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #44)
> > If we can't gaurantee consistency on Edgecast it would be helpful if the
> > problem was still present so I can verify that any changes I make fix the
> > problem. Can you revert Edgecast to the previous config if that is the case?
> 
> I can't make it inconsistent again, but I can definitely set it up for
> failure by undoing my change to the Expires headers... in a few days they'll
> wander out of sync again and the problem will start to reappear. Want me to
> do that?
Thanks for doing this. It made it so I was able to find a bug in the new code while testing the fix in bug 811573.
(In reply to Jake Maul [:jakem] from comment #49)
> At the moment there's nothing I need from QA. The next step for me is to
> implement a decision on comment 45 (one of the options, or a different
> option), and to re-implement the fix in bug 829207 when :rstrong gives the
> green light to do so.
> 
> I'm CC'ing Brandon and Laura on this bug, because one of the options in
> comment 45 is to alter Bouncer to include a query string when directing
> users to a mirror. This would need some thought and implementation, so I'd
> like to have them roped in on it. We can have a vidyo meeting to fill you
> two in on the background. :)

Is this still something that you want to do?
@laura: no, I think you and Brandon are in the clear on this now. :)

Just waiting on the green light to re-apply the fix here, and then we can close this bug out.
Jake, green light is given on re-applying the fix in bug 829207. Thanks!
Duplicate of this bug: 848145
Pushed to mozilla-central in bug 811573
https://hg.mozilla.org/mozilla-central/rev/216ec69cc531
Assignee: nmaul → robert.bugzilla
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [stub+]
Target Milestone: --- → Firefox 22
Keywords: verifyme
Please note that the workaround for fiddler that was added to the stub installer has been removed since it breaks downloading from Edgecast. Also note that fiddler breaks other parts of Firefox as well and since it is very likely that very few people use fiddler this is better than the workaround.
I received a sample set of the data last night and the new data points look like the symptoms that were present when this bug occurred during the download have been fixed.
FF 20b4, aurora 21.0a2 (2013-03-07) and nightly 22.0a1 (2013-03-07) stub installers works fine now. 
Is this enough for verifying this bug ?
If not, what exactly should be tested here?
Seems like enough to me. All evidence points to this being unreproducible now, after the fix(es).
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.