Closed
Bug 1042481
Opened 10 years ago
Closed 9 years ago
Investigate Firefox installer download timeouts over CDN
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: gps, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/53] )
The client-side logs from the Firefox update hotfix are revealing some "interesting" behavior with CDN interaction. Take this log for example: 2e2c671c-8fda-4d9e-9b3e-466873d913ce ==================================== 1405638455708 install() - 5 1405638455715 startup() - 5 1405638455721 Created state directory. 1405638455726 Performing start-up tasks. 1405638455727 Fresh install. Loading fresh state. 1405638455727 Saving state file. 1405638455759 Reported Windows version: 6.1 SP1 1405638455837 Actual Windows version: 6.1 SP1 1405638455838 Hotfix applicable 1405638455840 Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe 1405638455840 Installer size (expected): 29816680 1405638455841 Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7 1405638455841 Deferring save because another in progress. 1405638455974 State save finished. 1405638455974 Saving state file from chain. 1405638456049 State save finished. 1405638456050 Finished start-up. 1405638456052 Starting download. Attempt 1 1405638456053 Saving state file. 1405638456057 Saving state file from chain. 1405638456119 State save finished. 1405638456137 State save finished. 1405638456733 channel:onStartRequest() 1405638456733 Download is resumable. 1405638456734 Saving state file. 1405638456792 copier:onStartRequest() 1405638456874 State save finished. 1405638469136 Download progress: 1033240/29816680 1405638494316 Download progress: 2033340/29816680 1405638517624 Download progress: 3068480/29816680 1405638539827 Download progress: 4068580/29816680 1405638558428 Download progress: 5109560/29816680 1405638586680 Download progress: 6153460/29816680 1405638602799 Download progress: 7153560/29816680 1405638620887 Download progress: 8155120/29816680 1405638643296 Download progress: 9156680/29816680 1405638663986 Download progress: 10158240/29816680 1405638683690 Download progress: 11159800/29816680 1405638698592 Download progress: 12161360/29816680 1405638720024 Download progress: 13161460/29816680 1405638730263 Download progress: 14167400/29816680 1405638745170 Download progress: 15168960/29816680 1405638763721 Download progress: 16170520/29816680 1405638777779 Download progress: 17170620/29816680 1405638794466 Download progress: 18172180/29816680 1405638811022 Download progress: 19172280/29816680 1405638832479 Download progress: 20173840/29816680 1405638853366 Download progress: 21258620/29816680 1405638865184 Download progress: 22263100/29816680 1405638881607 Download progress: 23267580/29816680 1405638897823 Download progress: 24270600/29816680 1405638912421 Download progress: 25275080/29816680 1405638934025 Download progress: 26282480/29816680 1405638952429 Download progress: 27286960/29816680 1405638970568 Download progress: 28288520/29816680 1405638981109 Download progress: 29300300/29816680 1405646368555 shutdown() - 2 1405646368556 Performing shutdown actions. 1405646369060 channel:onStopRequest(NS_ERROR_ABORT) 1405646369117 Closing pipe due to failed channel. 1405646390281 startup() - 1 1405646390295 Waiting for session restore to perform startup. 1405646394691 Performing start-up tasks. 1405646394831 Reported Windows version: 6.1 SP1 1405646394831 Actual Windows version: 6.1 SP1 1405646394832 Hotfix applicable 1405646394839 Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe 1405646394839 Installer size (expected): 29816680 1405646394840 Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7 1405646394840 Saving state file. 1405646394960 State save finished. 1405646394960 Finished start-up. 1405646394963 Starting download. Attempt 2 1405646394964 Saving state file. 1405646394966 Resuming download at byte offset 29813760 1405646394968 Saving state file from chain. 1405646395136 State save finished. 1405646395367 State save finished. 1405646466347 shutdown() - 2 1405646466347 Performing shutdown actions. 1405646640816 startup() - 1 1405646640861 Waiting for session restore to perform startup. 1405646645795 Performing start-up tasks. 1405646645928 Reported Windows version: 6.1 SP1 1405646645929 Actual Windows version: 6.1 SP1 1405646645932 Hotfix applicable 1405646645940 Installer URL: http://download-installer.cdn.mozilla.net/pub/firefox/releases/30.0/win32/pt-BR/Firefox%20Setup%2030.0.exe 1405646645940 Installer size (expected): 29816680 1405646645940 Installer hash (expected): 99d1c33f2e12e82b0df0d71e2eef30cf4fa68311576c32d5b73e47fa4f431965af26b52f98640e468e798279d0fafe3eec2e5b4187372b74a29db5b36ca264d7 1405646645940 Saving state file. 1405646646062 State save finished. 1405646646062 Finished start-up. 1405646646067 Starting download. Attempt 3 1405646646067 Saving state file. 1405646646203 Resuming download at byte offset 29813760 1405646646207 Saving state file from chain. 1405646646610 State save finished. 1405646646688 State save finished. 1405646647871 channel:onStartRequest() ... It appears the download "hangs" at byte offset 29813760. What's interesting here is the very long pause between the last download progress event and shutdown (time 1405638981109). Furthermore, at time 1405646394966 we see an attempted resume that goes nowhere for minutes until the application shuts down. I can come up with the following explanations: 1) Necko is being buggy 2) An intermediary HTTP agent (proxy, cache, etc) isn't handling the download properly 3) CDN is being wonky I'm filing this against the CDN component to see if anyone has reported anything like this before. If not, before we jump to conclusions, a Necko bug or a poor behaving intermediate HTTP agent should be investigated before we get too concerned about the CDN being buggy. Honza, Robert: Have you heard of anything like this?
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(honzab.moz)
Comment 1•10 years ago
|
||
Nothing quite like that. Jake, any ideas?
Flags: needinfo?(robert.strong.bugs) → needinfo?(nmaul)
Comment 2•10 years ago
|
||
Honza is on PTO for a bit, and Patrick has actually fiddled with our incremental downloader code more recently than anyone else...
Flags: needinfo?(honzab.moz) → needinfo?(mcmanus)
Reporter | ||
Comment 3•10 years ago
|
||
Well, we aren't using the incremental downloader code. We had to write the hotfix with APIs that worked all the way back to Firefox 10. See https://hg.mozilla.org/releases/firefox-hotfixes/file/d29a3d24405c/v20140527.01/resource/update.jsm#l1359. We have discovered at least one issue with the code (bug 1042461). But that doesn't smell like the same problem as this.
Comment 4•10 years ago
|
||
I'm afraid we're just off into raw speculation - I'm happy to join in. 1] it could be someone playing with gzip incorrectly(either server or intermediary). I say that because the bytes received are just slightly smaller than expected which is what you would get if you tried to gzip a compressed file wrapped in an .exe header. It might be interesting to log the content length and content-encoding response headers too. (though you're doing the right thing - theoretically it could have a smaller content-length, and a content encoding and what would be delivered through onDataAvailable would be the identity size.. but maybe we could see a pattern.) 2] its a little curious that you are only logging one channel::onStartRequest(), if it were caught in a loop I would expect to see one onstart/onstop for every asyncopen(). Am I reading the log wrong? One explanation might simply be that the network went away just before the transfer finished.. that would explain the timeout and why the restart never logged an onstartrequest() event (that comes when the response headers are received). Are there lots of cases of it failing just before the end? (which would argue against this)
Flags: needinfo?(mcmanus)
Comment 5•10 years ago
|
||
FWIW, the CDN stats from Akamai indicate that only ~11% of downloads of complete updates are successful. ~50% of downloads of partial updates are successful. I'm pretty sure these are taking into account the range requested, since they record that something like 98% of requests receive an HTTP 206 (partial content) status.
Comment 6•10 years ago
|
||
A few points/questions/suggestions: Just for clarification, you're downloading the latest *installer*, right? Not a complete.mar update? And it's all in one shot, no Range requests? In comment 0, it's downloading over non-SSL. Is that on purpose? Our standard scheme calls for installers to be delivered over SSL- except the file downloaded by Stub Installer, as SI doesn't support SSL and will (AIUI) validate the integrity of the downloaded file itself anyway. If the hotfix does the same, fine... if not, it should probably use SSL instead. :) Nothing obvious comes to mind as to why this issue would occur- the download-installer CDN config is one of the simplest, most direct ones we have, and from my understanding we're doing things in the most straightforward manner (no Range requests, no funky cache-control requirements, etc). I do have a recommendation, though, possibly for v2: If possible, try to log the entire set of HTTP Request and Response headers... including what IP address is being connected to, and (if possible) what IP, region, or ISP we're connecting *from*. This will generally provide all the info we need to know a) what CDN is being connected to, b) what region the CDN node being contacted is in, c) if there's any ISP or region commonality among the problem clients, and d) various things like cache hits/misses, datestamps, etc. It might also give us hints as to whether an intermediate proxy is indeed in the way through the omission, addition, or modification of headers from their expected values.
Flags: needinfo?(nmaul)
Reporter | ||
Comment 7•10 years ago
|
||
Thanks for the details. We're talking about the full installers, e.g. https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/30.0/win32/es-ES/Firefox%20Setup%2030.0.exe. We aren't using TLS. The hotfix embeds the expected hash and verifies that, so no security worries there. We do support download resuming, so Range requests are in play. Adding additional logging to v2 is certainly in play. However, recording client IP is likely off the table for privacy reasons. But I can check. Logging the full set of request and response headers /might/ be in play. But again, there are privacy considerations here. I think logging a whitelist of standard HTTP headers is more realistic.
Comment 8•10 years ago
|
||
If content-encoding is changing between responses, you cannot reliably do range request resume: http://hg.mozilla.org/mozilla-central/annotate/bf27e27c994d/netwerk/protocol/http/nsHttpChannel.cpp#l2135
Updated•10 years ago
|
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3138] [kanban:https://kanbanize.com/ctrl_board/4/577]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3138] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3143] [kanban:https://kanbanize.com/ctrl_board/4/577]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3143] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3148] [kanban:https://kanbanize.com/ctrl_board/4/577]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3148] [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:https://kanbanize.com/ctrl_board/4/577]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/577] → [kanban:https://webops.kanbanize.com/ctrl_board/2/53]
(In reply to Gregory Szorc [:gps] from comment #0) > I'm filing this against the CDN component to see if anyone has reported > anything like this before. If not, before we jump to conclusions, a Necko > bug or a poor behaving intermediate HTTP agent should be investigated before > we get too concerned about the CDN being buggy. We haven't received any other reports of issues with the CDN like this (or, really, at all), so I'm not sure there's any further actions we can take to be of assistance here. I'm going to tentatively RESO WFM this bug, with the understanding that it might get reopened to another component. Note that the CDN is being handed off to Cloud Services, so if there are further issues to investigate, this should probably go to Mozilla Services :: Operations.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•