Closed Bug 1042481 Opened 10 years ago Closed 9 years ago

Investigate Firefox installer download timeouts over CDN

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/53] )

The client-side logs from the Firefox update hotfix are revealing some "interesting" behavior with CDN interaction.

Take this log for example:

2e2c671c-8fda-4d9e-9b3e-466873d913ce
====================================
1405638455708   install() - 5
1405638455715   startup() - 5
1405638455721   Created state directory.
1405638455726   Performing start-up tasks.
1405638455727   Fresh install. Loading fresh state.
1405638455727   Saving state file.
1405638455759   Reported Windows version: 6.1 SP1
1405638455837   Actual Windows version: 6.1 SP1
1405638455838   Hotfix applicable
1405638455840   Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe
1405638455840   Installer size (expected): 29816680
1405638455841   Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7
1405638455841   Deferring save because another in progress.
1405638455974   State save finished.
1405638455974   Saving state file from chain.
1405638456049   State save finished.
1405638456050   Finished start-up.
1405638456052   Starting download. Attempt 1
1405638456053   Saving state file.
1405638456057   Saving state file from chain.
1405638456119   State save finished.
1405638456137   State save finished.
1405638456733   channel:onStartRequest()
1405638456733   Download is resumable.
1405638456734   Saving state file.
1405638456792   copier:onStartRequest()
1405638456874   State save finished.
1405638469136   Download progress: 1033240/29816680
1405638494316   Download progress: 2033340/29816680
1405638517624   Download progress: 3068480/29816680
1405638539827   Download progress: 4068580/29816680
1405638558428   Download progress: 5109560/29816680
1405638586680   Download progress: 6153460/29816680
1405638602799   Download progress: 7153560/29816680
1405638620887   Download progress: 8155120/29816680
1405638643296   Download progress: 9156680/29816680
1405638663986   Download progress: 10158240/29816680
1405638683690   Download progress: 11159800/29816680
1405638698592   Download progress: 12161360/29816680
1405638720024   Download progress: 13161460/29816680
1405638730263   Download progress: 14167400/29816680
1405638745170   Download progress: 15168960/29816680
1405638763721   Download progress: 16170520/29816680
1405638777779   Download progress: 17170620/29816680
1405638794466   Download progress: 18172180/29816680
1405638811022   Download progress: 19172280/29816680
1405638832479   Download progress: 20173840/29816680
1405638853366   Download progress: 21258620/29816680
1405638865184   Download progress: 22263100/29816680
1405638881607   Download progress: 23267580/29816680
1405638897823   Download progress: 24270600/29816680
1405638912421   Download progress: 25275080/29816680
1405638934025   Download progress: 26282480/29816680
1405638952429   Download progress: 27286960/29816680
1405638970568   Download progress: 28288520/29816680
1405638981109   Download progress: 29300300/29816680
1405646368555   shutdown() - 2
1405646368556   Performing shutdown actions.
1405646369060   channel:onStopRequest(NS_ERROR_ABORT)
1405646369117   Closing pipe due to failed channel.
1405646390281   startup() - 1
1405646390295   Waiting for session restore to perform startup.
1405646394691   Performing start-up tasks.
1405646394831   Reported Windows version: 6.1 SP1
1405646394831   Actual Windows version: 6.1 SP1
1405646394832   Hotfix applicable
1405646394839   Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe
1405646394839   Installer size (expected): 29816680
1405646394840   Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7
1405646394840   Saving state file.
1405646394960   State save finished.
1405646394960   Finished start-up.
1405646394963   Starting download. Attempt 2
1405646394964   Saving state file.
1405646394966   Resuming download at byte offset 29813760
1405646394968   Saving state file from chain.
1405646395136   State save finished.
1405646395367   State save finished.
1405646466347   shutdown() - 2
1405646466347   Performing shutdown actions.
1405646640816   startup() - 1
1405646640861   Waiting for session restore to perform startup.
1405646645795   Performing start-up tasks.
1405646645928   Reported Windows version: 6.1 SP1
1405646645929   Actual Windows version: 6.1 SP1
1405646645932   Hotfix applicable
1405646645940   Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe
1405646645940   Installer size (expected): 29816680
1405646645940   Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7
1405646645940   Saving state file.
1405646646062   State save finished.
1405646646062   Finished start-up.
1405646646067   Starting download. Attempt 3
1405646646067   Saving state file.
1405646646203   Resuming download at byte offset 29813760
1405646646207   Saving state file from chain.
1405646646610   State save finished.
1405646646688   State save finished.
1405646647871   channel:onStartRequest()
...

It appears the download "hangs" at byte offset 29813760.

What's interesting here is the very long pause between the last download progress event and shutdown (time 1405638981109). Furthermore, at time 1405646394966 we see an attempted resume that goes nowhere for minutes until the application shuts down.

I can come up with the following explanations:

1) Necko is being buggy
2) An intermediary HTTP agent (proxy, cache, etc) isn't handling the download properly
3) CDN is being wonky

I'm filing this against the CDN component to see if anyone has reported anything like this before. If not, before we jump to conclusions, a Necko bug or a poor behaving intermediate HTTP agent should be investigated before we get too concerned about the CDN being buggy.

Honza, Robert: Have you heard of anything like this?
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(honzab.moz)
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/577]
Nothing quite like that.

Jake, any ideas?
Flags: needinfo?(robert.strong.bugs) → needinfo?(nmaul)
Honza is on PTO for a bit, and Patrick has actually fiddled with our incremental downloader code more recently than anyone else...
Flags: needinfo?(honzab.moz) → needinfo?(mcmanus)
Well, we aren't using the incremental downloader code. We had to write the hotfix with APIs that worked all the way back to Firefox 10. See https://hg.mozilla.org/releases/firefox-hotfixes/file/d29a3d24405c/v20140527.01/resource/update.jsm#l1359.

We have discovered at least one issue with the code (bug 1042461). But that doesn't smell like the same problem as this.
I'm afraid we're just off into raw speculation - I'm happy to join in.

1] it could be someone playing with gzip incorrectly(either server or intermediary). I say that because the bytes received are just slightly smaller than expected which is what you would get if you tried to gzip a compressed file wrapped in an .exe header. It might be interesting to log the content length and content-encoding response headers too. (though you're doing the right thing - theoretically it could have a smaller content-length, and a content encoding and what would be delivered through onDataAvailable would be the identity size.. but maybe we could see a pattern.)

2] its a little curious that you are only logging one channel::onStartRequest(), if it were caught in a loop I would expect to see one onstart/onstop for every asyncopen(). Am I reading the log wrong?

One explanation might simply be that the network went away just before the transfer finished.. that would explain the timeout and why the restart never logged an onstartrequest() event (that comes when the response headers are received). Are there lots of cases of it failing just before the end? (which would argue against this)
Flags: needinfo?(mcmanus)
FWIW, the CDN stats from Akamai indicate that only ~11% of downloads of complete updates are successful. ~50% of downloads of partial updates are successful. I'm pretty sure these are taking into account the range requested, since they record that something like 98% of requests receive an HTTP 206 (partial content) status.
A few points/questions/suggestions:


Just for clarification, you're downloading the latest *installer*, right? Not a complete.mar update? And it's all in one shot, no Range requests?

In comment 0, it's downloading over non-SSL. Is that on purpose? Our standard scheme calls for installers to be delivered over SSL- except the file downloaded by Stub Installer, as SI doesn't support SSL and will (AIUI) validate the integrity of the downloaded file itself anyway. If the hotfix does the same, fine... if not, it should probably use SSL instead. :)


Nothing obvious comes to mind as to why this issue would occur- the download-installer CDN config is one of the simplest, most direct ones we have, and from my understanding we're doing things in the most straightforward manner (no Range requests, no funky cache-control requirements, etc). I do have a recommendation, though, possibly for v2:

If possible, try to log the entire set of HTTP Request and Response headers... including what IP address is being connected to, and (if possible) what IP, region, or ISP we're connecting *from*. This will generally provide all the info we need to know a) what CDN is being connected to, b) what region the CDN node being contacted is in, c) if there's any ISP or region commonality among the problem clients, and d) various things like cache hits/misses, datestamps, etc. It might also give us hints as to whether an intermediate proxy is indeed in the way through the omission, addition, or modification of headers from their expected values.
Flags: needinfo?(nmaul)
Thanks for the details.

We're talking about the full installers, e.g. https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/30.0/win32/es-ES/Firefox%20Setup%2030.0.exe.

We aren't using TLS. The hotfix embeds the expected hash and verifies that, so no security worries there.

We do support download resuming, so Range requests are in play.

Adding additional logging to v2 is certainly in play. However, recording client IP is likely off the table for privacy reasons. But I can check. Logging the full set of request and response headers /might/ be in play. But again, there are privacy considerations here. I think logging a whitelist of standard HTTP headers is more realistic.
If content-encoding is changing between responses, you cannot reliably do range request resume:

http://hg.mozilla.org/mozilla-central/annotate/bf27e27c994d/netwerk/protocol/http/nsHttpChannel.cpp#l2135
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3138] [kanban:https://kanbanize.com/ctrl_board/4/577]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3138] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3143] [kanban:https://kanbanize.com/ctrl_board/4/577]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3143] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3148] [kanban:https://kanbanize.com/ctrl_board/4/577]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3148] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:https://kanbanize.com/ctrl_board/4/577]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:https://webops.kanbanize.com/ctrl_board/2/53]
(In reply to Gregory Szorc [:gps] from comment #0)
> I'm filing this against the CDN component to see if anyone has reported
> anything like this before. If not, before we jump to conclusions, a Necko
> bug or a poor behaving intermediate HTTP agent should be investigated before
> we get too concerned about the CDN being buggy.

We haven't received any other reports of issues with the CDN like this (or, really, at all), so I'm not sure there's any further actions we can take to be of assistance here.

I'm going to tentatively RESO WFM this bug, with the understanding that it might get reopened to another component. Note that the CDN is being handed off to Cloud Services, so if there are further issues to investigate, this should probably go to Mozilla Services :: Operations.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.