Last Comment Bug 816472 - Updates failing for Firefox because downloaded patch is larger than the one referenced in the update snippet
: Updates failing for Firefox because downloaded patch is larger than the one r...
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: All All
: -- major (vote)
: mozilla27
Assigned To: Patrick McManus [:mcmanus] PTO until Sep 6
:
Mentors:
Depends on: 887678
Blocks: 1073459 816470
  Show dependency treegraph
 
Reported: 2012-11-29 04:19 PST by Henrik Skupin (:whimboo)
Modified: 2014-09-26 03:49 PDT (History)
14 users (show)
Ms2ger: in‑testsuite?
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
-
-
affected
-
fixed


Attachments
13040301_log (31.51 KB, text/plain)
2013-04-04 00:44 PDT, Henrik Skupin (:whimboo)
no flags Details
http.log.zip (451.17 KB, application/zip)
2013-09-23 02:33 PDT, Henrik Skupin (:whimboo)
no flags Details
log file with app.update.log content (36.50 KB, text/x-log)
2013-09-23 08:31 PDT, Henrik Skupin (:whimboo)
no flags Details
full http.log (778.16 KB, application/zip)
2013-09-23 08:32 PDT, Henrik Skupin (:whimboo)
no flags Details
http.log (successful update) (369.30 KB, application/zip)
2013-09-23 12:37 PDT, Henrik Skupin (:whimboo)
no flags Details
incremental downloader error checking around bad server byte ranges (7.02 KB, patch)
2013-10-01 14:41 PDT, Patrick McManus [:mcmanus] PTO until Sep 6
jduell.mcbugs: review+
Details | Diff | Splinter Review

Description Henrik Skupin (:whimboo) 2012-11-29 04:19:33 PST
Mozilla/5.0 (Windows NT 6.2; rv:19.0) Gecko/19.0 Firefox/19.0 (fr, 20121127042013)

Something we have seen lately seen by our Mozmill tests (bug 816470) is that our update tests are failing because Firefox downloads an update which is larger than the one advertised by the update snippet:

*** AUS:SVC readStatusFile - status: downloading, path: C:\Users\mozilla\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** AUS:SVC Downloader:_selectPatch - found existing patch with state: downloading
*** AUS:SVC Downloader:_selectPatch - resuming download
*** AUS:SVC Downloader:downloadUpdate - downloading from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar to C:\Users\mozilla\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
TEST-PASS | c:\users\mozilla\appdata\local\temp\tmprrld_c.mozmill-tests\tests\update\testFallbackUpdate\test3.js | test3.js::setupModule
TEST-START | c:\users\mozilla\appdata\local\temp\tmprrld_c.mozmill-tests\tests\update\testFallbackUpdate\test3.js | testFallbackUpdate_ErrorPatching
*** AUS:SVC Downloader:onProgress - progress: 25200000/26393403
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar, status: 2152398850
*** AUS:SVC Downloader:onStopRequest - status: 2152398850, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:onStopRequest - setting state to: downloading
*** AUS:UI gDownloadingPage:onStopRequest - pausing download
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar
*** AUS:SVC Downloader:onProgress - progress: 25213928/26393403
*** AUS:SVC Downloader:onProgress - progress: 29764992/26393403
*** AUS:SVC Downloader:onProgress - progress: 29764992 is higher than patch size: 26393403
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onProgress - progress: 29700000/26393403
*** AUS:SVC Downloader:onProgress - progress: 29700000 is higher than patch size: 26393403
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - status: 2147549183, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Échec (raison inconnue), default code: 2152398849
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:UI gErrorsPage:onPageShow - update.statusText: Échec (raison inconnue)
*** AUS:SVC Downloader:onStopRequest - all update patch downloads failed



Checking with wget I get the correct size:

Connecting to ftp.mozilla.org|63.245.215.46|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26393403 (25M) [application/octet-stream]
Saving to: ‘firefox-19.0a2.fr.win32.complete.mar’

So not sure what's going on here but it looks like it belongs the software update component.

Not sure if Firefox 18 or lower are affected.
Comment 1 Brian R. Bondy [:bbondy] 2012-11-29 05:02:21 PST
The updates should fail in this case as of bug 801855.  The fix is to include the proper size in the XML.  I think this belongs in the AUS component.
Comment 2 Ben Hearsum (:bhearsum) 2012-11-29 05:10:06 PST
Actually, RelEng is a better place for now.
Comment 3 Henrik Skupin (:whimboo) 2012-11-29 05:34:09 PST
The update snippets show the correct size:

<patch type="complete" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar" hashFunction="sha512" hashValue="2d7da85753c7a6729c2dc95f52ab9f9ce787692e52a4d7914dd046738528fb1c44e3619f172eb6e94587f64574e8fe1dccb9911ca0cb75e4655308954961a916" size="26393403"/>

https://aus3.mozilla.org/update/3/Firefox/19.0a2/20121127042013/WINNT_x86-msvc/fr/aurora/Windows_NT%206.2/default/default/update.xml?force=1
Comment 4 Ben Hearsum (:bhearsum) 2012-11-29 05:42:14 PST
For posterity, the current contents of the URL that Henrik linked to:
<updates><update type="minor" version="19.0a2" extensionVersion="19.0a2" buildID="20121128042014"><patch type="complete" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.complete.mar" hashFunction="sha512" hashValue="2d7da85753c7a6729c2dc95f52ab9f9ce787692e52a4d7914dd046738528fb1c44e3619f172eb6e94587f64574e8fe1dccb9911ca0cb75e4655308954961a916" size="26393403"/><patch type="partial" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-28-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.fr.win32.partial.20121127042013-20121128042014.mar" hashFunction="sha512" hashValue="1d8d25eeb0ab118c2673eb7e9d623253478ff5d265393cef064948876e5df48698ef59373d7038b664a5cbb2af801c760ce61bc5c7221354cf29f9781119a461" size="3550155"/></update></updates>
Comment 5 Henrik Skupin (:whimboo) 2012-12-05 03:37:08 PST
Happened again yesterday with an Aurora update test on Windows 8:
http://10.250.73.243:8080/job/mozilla-aurora_update/1763/console (MV)

*** AUS:SVC Downloader:onProgress - progress: 25500000/26299938
*** AUS:SVC Downloader:onProgress - progress: 25800000/26299938
*** AUS:SVC Downloader:onProgress - progress: 26100000/26299938
*** AUS:SVC Downloader:onProgress - progress: 26400000/26299938
*** AUS:SVC Downloader:onProgress - progress: 26400000 is higher than patch size: 26299938
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-04-04-20-15-mozilla-aurora/firefox-19.0a2.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-04-04-20-15-mozilla-aurora/firefox-19.0a2.en-US.win32.complete.mar, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - status: 2147549183, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
Comment 6 Henrik Skupin (:whimboo) 2012-12-05 03:50:21 PST
I have retriggered the same update test and it works fine now:
http://10.250.73.243:8080/job/mozilla-aurora_update/1774/console

*** AUS:SVC Downloader:onProgress - progress: 327210/26299938
*** AUS:SVC Downloader:onProgress - progress: 5283910/26299938
*** AUS:SVC Downloader:onProgress - progress: 16831050/26299938
*** AUS:SVC Downloader:onProgress - progress: 26299938/26299938
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-04-04-20-15-mozilla-aurora/firefox-19.0a2.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-04-04-20-15-mozilla-aurora/firefox-19.0a2.en-US.win32.complete.mar, status: 0
*** AUS:SVC Downloader:onStopRequest - status: 0, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:_verifyDownload called
*** AUS:SVC Downloader:_verifyDownload downloaded size == expected size.

The difference between both is the speed updates have been downloaded. The one in comment 5 has been taken way longer and only downloaded chunks of 300KB while this one only had a few steps which much larger chunks.

Could this be a regression in the software update code?
Comment 7 Robert Strong [:rstrong] (use needinfo to contact me) 2012-12-05 09:41:46 PST
The only change iirc was to the interval to download which was changed to the same value that has been used by nightly and aurora for a very long time.
Comment 8 Brian R. Bondy [:bbondy] 2012-12-05 09:44:28 PST
In worst case we can revert that check I think, it was a security recommendation but I don't think it was a basecamp blocker.
Comment 9 Robert Strong [:rstrong] (use needinfo to contact me) 2012-12-05 12:41:15 PST
It would be a good thing to check again now that bug 818038 is fixed. The uptake rate of 17 was very low and the akamai bandwidth usage started to increase on 12/4 around noon and has increased by around 4 fold since then.
Comment 10 Henrik Skupin (:whimboo) 2012-12-05 13:51:36 PST
(In reply to Robert Strong [:rstrong] (do not email) from comment #9)
> It would be a good thing to check again now that bug 818038 is fixed. The
> uptake rate of 17 was very low and the akamai bandwidth usage started to
> increase on 12/4 around noon and has increased by around 4 fold since then.

Anything we can do with Mozmill update tests? If yes would be nice to get the information which updates we would have to test. Anthony could probably start a testrun once the 18.0 beta tests are done.
Comment 11 Robert Strong [:rstrong] (use needinfo to contact me) 2012-12-05 13:52:39 PST
If you are still getting a new failure that you weren't getting previously.
Comment 12 Henrik Skupin (:whimboo) 2012-12-07 10:16:03 PST
We still see it. Today with an esr10 build:

*** AUS:SVC Downloader:onProgress - progress: 19500000/19794186
*** AUS:SVC Downloader:onProgress - progress: 19800000/19794186
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-07-03-45-01-mozilla-esr10/firefox-10.0.11esrpre.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-07-03-45-01-mozilla-esr10/firefox-10.0.11esrpre.en-US.win32.complete.mar, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:UI gErrorsPage:onPageShow - update.statusText: Failed (unknown reason)
Comment 13 Brian R. Bondy [:bbondy] 2012-12-07 10:27:28 PST
Is there any way to reproduce this?  There are xpcshell tests that cover the size check against the XML which I haven't seen fail, but those use a mock incremental downloader and those download small MAR files.
Comment 14 Henrik Skupin (:whimboo) 2012-12-07 11:47:18 PST
No, it happens intermittently. I just retriggered the failing job and now the update passed.

But if you want to have a look at the logs please check:
http://10.250.73.243:8080/job/mozilla-esr10_update/539/console (MV VPN)
Comment 15 Henrik Skupin (:whimboo) 2012-12-27 13:39:30 PST
Happened again lately on the esr17 branch:

http://10.250.73.243:8080/job/mozilla-esr17_update/252/console

*** AUS:SVC gCanCheckForUpdates - able to check for updates
*** AUS:SVC gCanApplyUpdates - testing write access C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\update.test
*** AUS:SVC gCanApplyUpdates - windowsVersion = 6.0
*** AUS:SVC gCanApplyUpdates - on Vista, userCanElevate: false
*** AUS:SVC gCanApplyUpdates - testing write access c:\users\mozauto\appdata\local\temp\tmpemd1j5.binary\update.test
*** AUS:SVC gCanApplyUpdates - able to apply updates
*** AUS:SVC readStatusFile - status: downloading, path: C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** AUS:SVC readStatusFile - status: downloading, path: C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** AUS:SVC UpdateService:_postUpdateProcessing - patch found in downloading state
*** AUS:SVC readStatusFile - status: downloading, path: C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** AUS:SVC Downloader:_selectPatch - found existing patch with state: downloading
*** AUS:SVC Downloader:_selectPatch - resuming download
*** AUS:SVC Downloader:downloadUpdate - downloading from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar to C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
*** UTM:SVC TimerManager:registerTimer - id: browser-cleanup-thumbnails
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar
*** AUS:SVC Downloader:onProgress - progress: 24694070/24207638
*** AUS:SVC Downloader:onProgress - progress: 28339669/24207638
*** AUS:SVC Downloader:onProgress - progress: 31903030/24207638
*** AUS:SVC Downloader:onProgress - progress: 34884918/24207638
*** AUS:SVC Downloader:onProgress - progress: 38128950/24207638
*** AUS:SVC Downloader:onProgress - progress: 41274678/24207638
*** AUS:SVC Downloader:onProgress - progress: 43601206/24207638
*** AUS:SVC Downloader:onProgress - progress: 47815276/24207638
*** AUS:SVC readStatusFile - status: downloading, path: C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** AUS:SVC Downloader:_selectPatch - found existing patch with state: downloading
*** AUS:SVC Downloader:_selectPatch - resuming download
*** AUS:SVC Downloader:downloadUpdate - downloading from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar to C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, status: 2152398850
*** AUS:SVC Downloader:onStopRequest - setting state to: downloading
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/12/2012-12-24-03-45-02-mozilla-esr17/firefox-17.0.1esrpre.en-US.win32.complete.mar, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:SVC Downloader:onStopRequest - all update patch downloads failed
*** AUS:UI gUpdates:onLoad - setting current page to startpage errorpatching
Comment 16 Ben Hearsum (:bhearsum) 2012-12-27 13:41:28 PST
Can you attach the full snippet and MAR file?
Comment 17 Henrik Skupin (:whimboo) 2012-12-27 14:33:14 PST
We don't log the snippets yet. So I cannot provide them for tests ran a couple of days ago. Sorry. The downloaded mar files you can find as links in my last comment. But all those are about 23MB in size as the snippet included.
Comment 18 Alex Keybl [:akeybl] 2013-01-07 17:37:21 PST
It sounds like RelEng may not be able to take next steps until they can take a look at an affected mar file.

I'm tempted to take this off the tracking lists, given it impacts ESR10/17 as well. But I don't want a bug like this (which could impact uptake) to fall of our radar.

What next steps can we take here?
Comment 19 Henrik Skupin (:whimboo) 2013-01-08 00:29:09 PST
Next step on my plate will be to integrate the content of the update snippet into our Mozmill report. That way we can at least see details about the mar file served at that day.
Comment 20 Alex Keybl [:akeybl] 2013-01-16 13:13:17 PST
(In reply to Henrik Skupin (:whimboo) from comment #19)
> Next step on my plate will be to integrate the content of the update snippet
> into our Mozmill report. That way we can at least see details about the mar
> file served at that day.

Do you have an ETA?
Comment 21 Henrik Skupin (:whimboo) 2013-01-18 08:16:11 PST
Once we talk about it it's failing again:

http://mm-ci-master.qa.scl3.mozilla.com:8080/job/mozilla-aurora_update/101/console

Here the snippet which should still be accurate:

*** AUS:SVC Checker:getUpdateURL - update URL: https://aus3.mozilla.org/update/3/Firefox/20.0a2/20130117042017/Linux_x86-gcc3/en-US/aurora/Linux%203.2.0-34-generic-pae%20(GTK%202.24.10)/default/default/update.xml?force=1
*** AUS:SVC Checker:checkForUpdates - sending request to: https://aus3.mozilla.org/update/3/Firefox/20.0a2/20130117042017/Linux_x86-gcc3/en-US/aurora/Linux%203.2.0-34-generic-pae%20(GTK%202.24.10)/default/default/update.xml?force=1
*** AUS:SVC Checker:onProgress - 925/925
*** AUS:SVC Checker:onLoad - request completed downloading document
*** AUS:SVC Checker:getUpdateURL - update URL: https://aus3.mozilla.org/update/3/Firefox/20.0a2/20130117042017/Linux_x86-gcc3/en-US/aurora/Linux%203.2.0-34-generic-pae%20(GTK%202.24.10)/default/default/update.xml?force=1

Sadly I have seen this problem too late to react accordingly. New update snippets are available in the meantime. The test run at around 1:23pm UTC. So might it be possible that we replaced the snippets or the mar file at exactly this time with the ones for todays Aurora build? And that this caused the problem?

I still find is suspicious that whenever we hit this problem we have this strange 1MB progress update in the log:

*** AUS:SVC Downloader:onProgress - progress: 27000000/28816396
*** AUS:SVC Downloader:onProgress - progress: 27300000/28816396
*** AUS:SVC Downloader:onProgress - progress: 27600000/28816396
*** AUS:SVC Downloader:onProgress - progress: 27900000/28816396
*** AUS:SVC Downloader:onProgress - progress: 28200000/28816396
*** AUS:SVC Downloader:onProgress - progress: 28500000/28816396
*** AUS:SVC Downloader:onProgress - progress: 28800000/28816396
*** AUS:SVC Downloader:onProgress - progress: 29100000/28816396
*** AUS:SVC Downloader:onProgress - progress: 29100000 is higher than patch size: 28816396
*** AUS:SVC Downloader: cancel

All other updates are a fraction of 1MB.

(In reply to Alex Keybl [:akeybl] from comment #20)
> Do you have an ETA?

Having our new CI system ready in SCL3 was my highest priority. Now that it is present I will see what I can do here. My focus is currently on WebRTC but I might find a bit of time for patching our update tests.
Comment 22 Alex Keybl [:akeybl] 2013-01-22 07:32:49 PST
Is this an issue that would be obvious as part of final update sign-offs, and thus fixable prior to release? We're trying to determine whether this is actually critical for release or can be resolved in the medium-term.
Comment 23 Henrik Skupin (:whimboo) 2013-01-22 10:29:12 PST
(In reply to Alex Keybl [:akeybl] from comment #22)
> Is this an issue that would be obvious as part of final update sign-offs,
> and thus fixable prior to release? We're trying to determine whether this is
> actually critical for release or can be resolved in the medium-term.

Actually I'm not sure if we have ever seen that for ondemand update tests which cover releases. Anthony might be able to give an answer here. It's not something I'm taking care of.
Comment 24 juan becerra [:juanb] 2013-01-23 16:30:36 PST
I have not noticed this issue as part of the update testing during our beta cycles. I can be on the look out for failures in aurora or nightly.

I can't tell how often this has happened, and I don't know if this is critical right at the moment.
Comment 25 Alex Keybl [:akeybl] 2013-01-30 15:44:28 PST
In email, Rob suggests that "We haven't seen this on user systems (personally or in bug reports) and with this so far only being reproduced with mozmill I suspect it is mozmill specific or perhaps a server side issue".

No need to track given that.
Comment 26 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2013-01-31 15:42:27 PST
I can't personally recall seeing this with the ondemand update tests. As with Juan, I'll keep an eye out with future releases. I would support adding anything to the reports which would make this problem easier to debug though, and not waiting for this to happen with ondemand tests to make this change.
Comment 27 Henrik Skupin (:whimboo) 2013-02-20 10:50:50 PST
Could this be related to bug 836044? So far only Aurora builds are affected and that sounds kinda connected to each other.
Comment 28 Robert Strong [:rstrong] (use needinfo to contact me) 2013-02-20 10:51:57 PST
Quite possibly
Comment 29 Nick Thomas [:nthomas] 2013-03-03 23:23:50 PST
The URLs for nightly/aurora updates don't go into dirs like firefox/nightly/latest-<branch>/, they go to firefox/nightly/YYYY/MM/YYYY-MM-DD-HH-MM-SS-<branch>/ instead (and we don't overwrite files in there unless something is really really wrong). So I don't think bug 836044 is relevant here.
Comment 30 Robert Strong [:rstrong] (use needinfo to contact me) 2013-03-04 11:53:55 PST
(In reply to Nick Thomas [:nthomas] from comment #29)
> The URLs for nightly/aurora updates don't go into dirs like
> firefox/nightly/latest-<branch>/, they go to
> firefox/nightly/YYYY/MM/YYYY-MM-DD-HH-MM-SS-<branch>/ instead (and we don't
> overwrite files in there unless something is really really wrong). So I
> don't think bug 836044 is relevant here.
Agreed.
Comment 31 Henrik Skupin (:whimboo) 2013-04-03 16:08:12 PDT
We have seen the problem again today when upgrading a fr Nightly build of Firefox:

http://10.22.73.128:8080/job/mozilla-central_update/2201/console

*** AUS:SVC Downloader:downloadUpdate - downloading from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar to C:\Documents and Settings\mozilla\Local Settings\Application Data\Mozilla\Firefox\firefox\updates\0\update.mar
*** AUS:SVC Downloader:onProgress - progress: 6900000/28677919
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, status: 2152398850
*** AUS:SVC Downloader:onStopRequest - status: 2152398850, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:onStopRequest - setting state to: downloading
*** AUS:UI gDownloadingPage:onStopRequest - pausing download
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar
*** AUS:SVC Downloader:onProgress - progress: 7332673/28677919
*** AUS:SVC Downloader:onProgress - progress: 9067303/28677919
*** AUS:SVC Downloader:onProgress - progress: 10856907/28677919
*** AUS:SVC Downloader:onProgress - progress: 12791235/28677919
*** AUS:SVC Downloader:onProgress - progress: 14720587/28677919
*** AUS:SVC Downloader:onProgress - progress: 16413337/28677919
*** AUS:SVC Downloader:onProgress - progress: 18344475/28677919
*** AUS:SVC Downloader:onProgress - progress: 20045421/28677919
*** AUS:SVC Downloader:onProgress - progress: 21637251/28677919
*** AUS:SVC Downloader:onProgress - progress: 23401491/28677919
*** AUS:SVC Downloader:onProgress - progress: 25308283/28677919
*** AUS:SVC Downloader:onProgress - progress: 27302077/28677919
*** AUS:SVC Downloader:onProgress - progress: 29058459/28677919
*** AUS:SVC Downloader:onProgress - progress: 29058459 is higher than patch size: 28677919
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onProgress - progress: 28800000/28677919
*** AUS:SVC Downloader:onProgress - progress: 28800000 is higher than patch size: 28677919
Comment 32 Robert Strong [:rstrong] (use needinfo to contact me) 2013-04-03 16:22:31 PDT
Jake, can you think of any reason why the following file would be larger than expected?

http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-03-09-09-50-mozilla-central-l10n/firefox-23.0a1.fr.win32.complete.mar


Henrik, have you been running your tests successfully the vast majority of the time? Also, the download didn't start at 0. Do you know what might be different on your system that could cause this? Is this mozmill? If so, then it might be performing some of the update steps in a manner to cause this. Also, when you link to the console log it usually times out. Please attach the log as an attachment in the future if you think there is relevant info.
Comment 33 Jake Maul [:jakem] 2013-04-03 17:21:27 PDT
Hmm... well, according to comment 29 the files are never being overwritten, which is definitely the most obvious possible problem. I can't think of any other way to get a file size to be different from expected unless the file itself is changed between when the download started and when it ended. And a slower download speed (more, smaller chunks with longer wait times) would definitely be more significantly impacted.

Just for clarification, we're always talking about hitting ftp.mozilla.org directly, right? None of this goes through download.cdn.mozilla.net? Just narrowing down possibilities here...

I don't know anything about this testing infrastructure. Is it possible to try the tests with different settings? For instance, could we try fewer, larger chunks (preferably up to and including no chunking at all... just one big GET)? This would reduce the time differential between the initial AUS hit and the finished download, as well as the number of requests made to ftp.mozilla.org. If we can get the latter down to 1, it should eliminate any potential for variation on the server infra side (1 atomic request, no chance to "cross the streams").

That all seems like a long shot though... if the files don't change, and the AUS XML data doesn't change, then I'm at a loss as to what could cause this on the infra side. :(
Comment 34 Henrik Skupin (:whimboo) 2013-04-04 00:44:44 PDT
Created attachment 733212 [details]
13040301_log

(In reply to Robert Strong [:rstrong] (do not email) from comment #32)
> Henrik, have you been running your tests successfully the vast majority of
> the time?

That's correct. It happens very rarely that we hit this failure.

> Also, the download didn't start at 0. Do you know what might be
> different on your system that could cause this? Is this mozmill? If so, then
> it might be performing some of the update steps in a manner to cause this.

We always use a fresh profile and also a fresh installation directory. Update checks in the background are disabled by default and we don't manually check that earlier as when the test starts. There is only a restart of the browser at the beginning so that we can get the update log enabled.

> Also, when you link to the console log it usually times out. Please attach
> the log as an attachment in the future if you think there is relevant info.

All those links are accessible when you are connected to MPT. But I will make sure to attach a full copy of the log in the future.
Comment 35 Henrik Skupin (:whimboo) 2013-04-05 11:37:59 PDT
(In reply to Robert Strong [:rstrong] (do not email) from comment #32)
> Henrik, have you been running your tests successfully the vast majority of
> the time? Also, the download didn't start at 0. Do you know what might be
> different on your system that could cause this? Is this mozmill? If so, then

As of now I figured out what's going on. It's a result of a bad update test which failed with status 7. We have to find a workaround in our automation scripts and Mozmill to get this working. See bug 858686.

Anyway, doesn't the updater save the locale beside all the other update status? Means if we start an update for another locale shouldn't the update start from the beginning?
Comment 36 Henrik Skupin (:whimboo) 2013-04-05 11:59:26 PDT
Ok, so this might be related to the fallback updates with the second step in downloading the full update. When we set 'status: 6' to indicate that a failure happened, Firefox will show the warning the screen the next time it gets started. At this time I can already see that the download of the mar file has been started in the background. There might be a little delay in when we click on the OK button, so a small fraction of the patch has been downloaded. That's why we do not start by 0KB.
Comment 37 Robert Strong [:rstrong] (use needinfo to contact me) 2013-04-05 12:58:32 PDT
(In reply to Henrik Skupin (:whimboo) from comment #35)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #32)
> > Henrik, have you been running your tests successfully the vast majority of
> > the time? Also, the download didn't start at 0. Do you know what might be
> > different on your system that could cause this? Is this mozmill? If so, then
> 
> As of now I figured out what's going on. It's a result of a bad update test
> which failed with status 7. We have to find a workaround in our automation
> scripts and Mozmill to get this working. See bug 858686.
> 
> Anyway, doesn't the updater save the locale beside all the other update
> status? Means if we start an update for another locale shouldn't the update
> start from the beginning?
The only update status that the updater uses is in update.status. Specifically, the state information for the update. The info in update.version and active-update.xml is used by Firefox and not the updater.
Comment 38 Henrik Skupin (:whimboo) 2013-04-11 23:59:03 PDT
Yesterday I was working on a patch (bug 860677) for our Mozmill automation scripts so that we will delete downloaded updates at the end of the testrun. Given that those were previously located in the binary folder it didn't cause us problems. But now those are stored in a shared location and can cause us trouble for following testruns on the same machine. Lets see if we still see this behavior after the patch landed.
Comment 39 Henrik Skupin (:whimboo) 2013-04-12 02:28:51 PDT
(In reply to Robert Strong [:rstrong] (do not email) from comment #37)
> The only update status that the updater uses is in update.status.
> Specifically, the state information for the update. The info in
> update.version and active-update.xml is used by Firefox and not the updater.

So something is strange here then and it might be Firefox which misbehaves. As we have seen yesterday during the release update tests we had a situation when an et build updated to en-US. I would blame an update which was already located in the shared folder from a previous en-US update of the same Firefox version. And we continued the download even it is not the correct locale:

*** AUS:SVC Downloader:_selectPatch - resuming download
*** AUS:SVC Downloader:downloadUpdate - downloading from http://download.mozilla.org/?product=firefox-20.0.1-complete&os=win&lang=en-US&force=1 to C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
*** UTM:SVC TimerManager:registerTimer - id: browser-cleanup-thumbnails
*** AUS:SVC Downloader:onStartRequest - original URI spec: http://download.mozilla.org/?product=firefox-20.0.1-complete&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/20.0.1/update/win32/en-US/firefox-20.0.1.complete.mar
*** AUS:SVC Downloader:onProgress - progress: 4200000/26178137

Rob, I assume we should cover that with a new bug?
Comment 40 Robert Strong [:rstrong] (use needinfo to contact me) 2013-04-12 10:50:17 PDT
(In reply to Henrik Skupin (:whimboo) from comment #39)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #37)
> > The only update status that the updater uses is in update.status.
> > Specifically, the state information for the update. The info in
> > update.version and active-update.xml is used by Firefox and not the updater.
> 
> So something is strange here then and it might be Firefox which misbehaves.
> As we have seen yesterday during the release update tests we had a situation
> when an et build updated to en-US. I would blame an update which was already
> located in the shared folder from a previous en-US update of the same
> Firefox version. And we continued the download even it is not the correct
> locale:
> 
> *** AUS:SVC Downloader:_selectPatch - resuming download
> *** AUS:SVC Downloader:downloadUpdate - downloading from
> http://download.mozilla.org/?product=firefox-20.0.1-complete&os=win&lang=en-
> US&force=1 to
> C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.mar
> *** UTM:SVC TimerManager:registerTimer - id: browser-cleanup-thumbnails
> *** AUS:SVC Downloader:onStartRequest - original URI spec:
> http://download.mozilla.org/?product=firefox-20.0.1-complete&os=win&lang=en-
> US&force=1, final URI spec:
> http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/20.0.1/
> update/win32/en-US/firefox-20.0.1.complete.mar
> *** AUS:SVC Downloader:onProgress - progress: 4200000/26178137
> 
> Rob, I assume we should cover that with a new bug?
As usual, if you can reproduce without using mozmill then definitely file a bug with the steps used. The thing is, this very much looks like it was caused by mozmill trying to simulate the conditions for update.
Comment 41 Henrik Skupin (:whimboo) 2013-04-19 08:04:40 PDT
We had this failure again today with a de Nightly build which gets it updates from ftp.m.o:

*** AUS:SVC Downloader:onProgress - progress: 28852297/28804302
*** AUS:SVC Downloader:onProgress - progress: 28852297 is higher than patch size: 28804302
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onProgress - progress: 28800000/28804302
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-19-03-09-57-mozilla-central-l10n/firefox-23.0a1.de.win32.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/04/2013-04-19-03-09-57-mozilla-central-l10n/firefox-23.0a1.de.win32.complete.mar, status: 2147549183
*** AUS:SVC Downloader:onStopRequest - status: 2147549183, current fail: 0, max fail: 10, retryTimeout: 2000
*** AUS:SVC Downloader:onStopRequest - non-verification failure
*** AUS:SVC getStatusTextFromCode - transfer error: Fehlgeschlagen (Unbekannter Grund), default code: 2152398849
*** AUS:SVC Downloader:onStopRequest - setting state to: download-failed
*** AUS:UI gErrorsPage:onPageShow - update.statusText: Fehlgeschlagen (Unbekannter Grund)
*** AUS:SVC Downloader:onStopRequest - all update patch downloads failed

That means it's not related to the CDN mirror network.
Comment 42 Henrik Skupin (:whimboo) 2013-07-16 09:08:20 PDT
Not sure if it's related to this bug but with bug 887678 we found a very serious issue today, where we accidentally produce repacks for localized builds from debug builds. Those are getting accidentally uploaded to latest-mozilla-central. So given that we end up in different file sizes at least for the builds. We are not sure about the mar files, which might also be generated under those circumstances.

We will continue to watch this bug, but it seems to happen very rarely those day. Probably because we run the tests a couple of seconds after their upload.
Comment 43 Henrik Skupin (:whimboo) 2013-07-22 23:51:40 PDT
We haven't seen this failure for a really long time. So I'm inclined to close it as WFM. It might have been fixed by one of the dependent bugs or some others. We will reopen whenever we hit this problem again.
Comment 44 Henrik Skupin (:whimboo) 2013-08-12 06:51:28 PDT
This happened again as noted on bug 816470 comment 16. Nick Thomas gave an answer to that on bug 816470 comment 17. It happened for the Aurora build 20130804004006.

As seen by Anthony we had the same issue for the 24.0b1 release. But here it was a misconfiguration of the update snippets.
Comment 45 Henrik Skupin (:whimboo) 2013-09-23 02:33:58 PDT
Created attachment 808478 [details]
http.log.zip

I trapped into this issue again today while running update tests for mozilla-esr17 on Ubuntu 32bit. Given that we record the HTTP log now, I have attached it to this bug. I hope it helps to analyze the problem.
Comment 46 Henrik Skupin (:whimboo) 2013-09-23 02:42:25 PDT
Brian, could you have a look at the HTTP log? You might want to search for '25800000', which is the failing range for the HTTP request. The failing host here is ftp5.dmz.scl3.mozilla.com.

2013-09-23 09:13:39.300015 UTC - -1216837888[b75361c0]: http request [
2013-09-23 09:13:39.300024 UTC - -1216837888[b75361c0]:   GET /pub/mozilla.org/firefox/nightly/2013/09/2013-09-22-00-05-01-mozilla-esr17/firefox-17.0.9esrpre.en-US.linux-i686.complete.mar HTTP/1.1
2013-09-23 09:13:39.300029 UTC - -1216837888[b75361c0]:   Host: ftp.mozilla.org
2013-09-23 09:13:39.300034 UTC - -1216837888[b75361c0]:   User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130921 Firefox/17.0
2013-09-23 09:13:39.300039 UTC - -1216837888[b75361c0]:   Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2013-09-23 09:13:39.300044 UTC - -1216837888[b75361c0]:   Accept-Language: en-US,en;q=0.5
2013-09-23 09:13:39.300049 UTC - -1216837888[b75361c0]:   Connection: keep-alive
2013-09-23 09:13:39.300053 UTC - -1216837888[b75361c0]:   Range: bytes=25800000-
2013-09-23 09:13:39.300059 UTC - -1216837888[b75361c0]:   Cookie: __utma=150903082.1984087841.1379927591.1379927591.1379927591.1; __utmb=150903082; __utmz=150903082.1379927591.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); __utmc=150903082
2013-09-23 09:13:39.300063 UTC - -1216837888[b75361c0]: ]
[..]
2013-09-23 09:13:39.303510 UTC - -1269826752[b7536b60]: http response [
2013-09-23 09:13:39.303523 UTC - -1269826752[b7536b60]:   HTTP/1.1 206 Partial Content
2013-09-23 09:13:39.303528 UTC - -1269826752[b7536b60]:   Server: Apache
2013-09-23 09:13:39.303533 UTC - -1269826752[b7536b60]:   X-Backend-Server: ftp5.dmz.scl3.mozilla.com
2013-09-23 09:13:39.303538 UTC - -1269826752[b7536b60]:   Cache-Control: max-age=3600
2013-09-23 09:13:39.303543 UTC - -1269826752[b7536b60]:   Content-Type: application/octet-stream
2013-09-23 09:13:39.303547 UTC - -1269826752[b7536b60]:   Date: Mon, 23 Sep 2013 09:14:05 GMT
2013-09-23 09:13:39.303552 UTC - -1269826752[b7536b60]:   Keep-Alive: timeout=5, max=1000
2013-09-23 09:13:39.303557 UTC - -1269826752[b7536b60]:   Expires: Mon, 23 Sep 2013 10:14:05 GMT
2013-09-23 09:13:39.303561 UTC - -1269826752[b7536b60]:   Accept-Ranges: bytes
2013-09-23 09:13:39.303566 UTC - -1269826752[b7536b60]:   Access-Control-Allow-Origin: *
2013-09-23 09:13:39.303570 UTC - -1269826752[b7536b60]:   Content-Range: bytes 0-299999/25604694
2013-09-23 09:13:39.303575 UTC - -1269826752[b7536b60]:   Etag: "edb453-186b256-4e6f46da8f7c2"
2013-09-23 09:13:39.303580 UTC - -1269826752[b7536b60]:   Last-Modified: Sun, 22 Sep 2013 08:09:53 GMT
2013-09-23 09:13:39.303584 UTC - -1269826752[b7536b60]:   Content-Length: 300000
2013-09-23 09:13:39.303589 UTC - -1269826752[b7536b60]:   Connection: Keep-Alive
2013-09-23 09:13:39.303593 UTC - -1269826752[b7536b60]:   X-Cache-Info: cached
2013-09-23 09:13:39.303597 UTC - -1269826752[b7536b60]: ]

The previous request is from 25500000 of the same ftp host, in case you have to go back to it.
Comment 47 Brian R. Bondy [:bbondy] 2013-09-23 08:00:14 PDT
Questions:
1) Do you know what the MAR size was that was advertised for download?
2) Do you have the app update log as well?
3) 
> The previous request is from 25500000 of the same ftp host, in case you have to go back to it.
Can you provide a full HTTP trace or at least each of the Range: headers given in the requests and Content-Range: given in the responses?

My guess from the above is that the size was advertised to be larger than 25800000 in the XML, but the real size of the file 25604694.  Maybe the XML was obtained from a different server with a different MAR or something like that and wasn't updated yet to contain the new info?  Then the MAR requests went to a different server than the update XML was obtained from?  

The above 3 things will help me learn more about the problem.
Comment 48 Henrik Skupin (:whimboo) 2013-09-23 08:26:37 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #47)
> Questions:
> 1) Do you know what the MAR size was that was advertised for download?
> 2) Do you have the app update log as well?
> 3) 
> > The previous request is from 25500000 of the same ftp host, in case you have to go back to it.
> Can you provide a full HTTP trace or at least each of the Range: headers
> given in the requests and Content-Range: given in the responses?

All that you can get when you connect to the MV office VPN and accessing this url:
http://10.250.73.48:8080/job/mozilla-esr17_update/1/

http.log: http://10.250.73.48:8080/job/mozilla-esr17_update/1/artifact/http.log
AUS log:  http://10.250.73.48:8080/job/mozilla-esr17_update/1/consoleFull

I will attach copies of those information to this bug so it doesn't get lost.
Comment 49 Henrik Skupin (:whimboo) 2013-09-23 08:31:00 PDT
Created attachment 808606 [details]
log file with app.update.log content
Comment 50 Henrik Skupin (:whimboo) 2013-09-23 08:32:37 PDT
Created attachment 808609 [details]
full http.log
Comment 51 Henrik Skupin (:whimboo) 2013-09-23 08:34:30 PDT
The content of the AUS snippet used by this particular fallback update test, I sadly don't have. If it's really necessary please let me know. Some changes to our tests would have to be made to always store its content.
Comment 52 Brian R. Bondy [:bbondy] 2013-09-23 09:09:57 PDT
So the HTTP log doesn't make sense to me.
The HTTP requests look correct, they ask for ranges starting with:
Range: bytes=0-299999
then 
Range: bytes=300000-
then
Range: bytes=600000-
...
Increasing by 300000 each time.

That seems right.


But the HTTP response is always including a header like this:
Content-Range: bytes 0-299999/25604694

For each request it keeps returning the same range of bytes. It is like the server does not see the requested range. The server should be increasing the values on the left side of the / on each response but is not.

Does anyone else know what's going on?  The server seems to have Accept-Ranges: bytes
Comment 53 Ben Hearsum (:bhearsum) 2013-09-23 09:12:33 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #52)
> So the HTTP log doesn't make sense to me.
> The HTTP requests look correct, they ask for ranges starting with:
> Range: bytes=0-299999
> then 
> Range: bytes=300000-
> then
> Range: bytes=600000-
> ...
> Increasing by 300000 each time.
> 
> That seems right.
> 
> 
> But the HTTP response is always including a header like this:
> Content-Range: bytes 0-299999/25604694
> 
> For each request it keeps returning the same range of bytes. It is like the
> server does not see the requested range. The server should be increasing the
> values on the left side of the / on each response but is not.
> 
> Does anyone else know what's going on?  The server seems to have
> Accept-Ranges: bytes

Jake, any idea about this? Could parts of our CDN network not be respecting the Range header?
Comment 54 Henrik Skupin (:whimboo) 2013-09-23 09:39:36 PDT
(In reply to Ben Hearsum [:bhearsum] from comment #53)
> Jake, any idea about this? Could parts of our CDN network not be respecting
> the Range header?

Do nightly builds are use CDN for patch file delivery? I thought all that is handled internally in terms of our own servers.
Comment 55 Brian R. Bondy [:bbondy] 2013-09-23 10:44:05 PDT
Note also that the ETag is the same for each response (not surprising since the content range is always the same).  What is more surprising is that the Date header is the same each time.

Maybe there's a bug in whatever generated the log files to always output the first response?
Comment 56 Henrik Skupin (:whimboo) 2013-09-23 11:39:31 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #55)
> Maybe there's a bug in whatever generated the log files to always output the
> first response?

Which log do you mean? The HTTP log? In that case we do not generate that but set the following environment variables:

NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,append
NSPR_LOG_FILE=http.log

With append all HTTP logs from each running Firefox instance will be appended. If we don't do this the log file will be overwritten each time we restart Firefox.
Comment 57 Brian R. Bondy [:bbondy] 2013-09-23 11:46:20 PDT
ah ok so it seems like either
i) there's a bug in the http logging inside netwerk somewhere around here http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpTransaction.cpp#l1250
or
ii) our cache handling isn't taking into effect the range header (bug in netwerk http handling)?
or
iii) Some kind of server side or proxy problem
or
iv) Something else I'm not thinking of
Comment 58 Brian R. Bondy [:bbondy] 2013-09-23 11:50:59 PDT
If I had more time right now I'd work on bug 886061 (Add tests for coverage of the nsIIncrementalDownload implementation), maybe it could lead to some underlying problem. The http log shows the size and ranges requested, as well as the server response chunks given.  We could add a similar test.
Comment 59 Henrik Skupin (:whimboo) 2013-09-23 11:52:34 PDT
The interesting thing is that this happens very very rarely. So not sure what can really cause this. Btw, how many internal ftp servers do we have which serve the updates? In this case it was ftp5.mozilla.org.

Jason, can you help us to get the questions from Brian's last comment sorted out? We would kindly appreciate that. Thanks.
Comment 60 Henrik Skupin (:whimboo) 2013-09-23 11:54:39 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #57)
> iii) Some kind of server side or proxy problem

The result from today is from a machine in MV and not SCL3, which means there is no proxy configured. So I think we should not follow that idea with those new information.
Comment 61 Brian R. Bondy [:bbondy] 2013-09-23 12:11:26 PDT
Could you also attach a known good log? I just want to compare, I'm sure the content-range response header will be filled out properly there and the date and etag headers will change too.
Comment 62 Henrik Skupin (:whimboo) 2013-09-23 12:37:40 PDT
Created attachment 808746 [details]
http.log (successful update)

As requested here an http log for a successful update test.
Comment 63 Brian R. Bondy [:bbondy] 2013-09-23 12:57:05 PDT
Yep so the Date header is updated between responses this time and the Content-Range header is filled out correctly after the 2nd request. The ETags are the same this time too but maybe that's because it is per resource (entity) and not per byte range?

So to summarize the problem:


For the log with the problem, response headers are always the same:
1) 
Request Header > Range: bytes=0-299999
Response Header < Content-Range: bytes 0-299999/25604694
Response Header < Etag: "edb453-186b256-4e6f46da8f7c2"

2) 
Request Header > Range: bytes=300000-
Response Header < Content-Range: bytes 0-299999/25604694
Response Header < Date: Mon, 23 Sep 2013 09:14:05 GMT
Response Header < Etag: "edb453-186b256-4e6f46da8f7c2"

3) 
Request Header > bytes=600000-
Response Header < Content-Range: bytes 0-299999/25604694
Response Header < Date: Mon, 23 Sep 2013 09:14:05 GMT
Response Header < Etag: "edb453-186b256-4e6f46da8f7c2"

4) 
Request Header > bytes=900000-
Response Header < Content-Range: bytes 0-299999/25604694
Response Header < Date: Mon, 23 Sep 2013 09:14:05 GMT
Response Header < Etag: "edb453-186b256-4e6f46da8f7c2"

...

(continues for a long time until the client side file is larger than what it's expected to be and then it throws an error)


For the log for the good update:
1)
Request Header > Range: bytes=0-299999
Response Header < Content-Range: bytes 0-299999/24225869
Response Header < Date: Mon, 23 Sep 2013 13:04:35 GMT
Response Header < Etag: "f38edb-171a84d-4e70c8b30ad3e"

2) 
Request Header > Range: bytes=300000-
Response Header < Content-Range: bytes 300000-24225868/24225869
Response Header < Date: Mon, 23 Sep 2013 13:06:21 GMT
Response Header < Etag: "f38edb-171a84d-4e70c8b30ad3e"

3)
(end of requests / responses)

It seems like either a bug in the server itself or in our http handling for incremental downloading.
Comment 64 Jason Duell [:jduell] (needinfo me) 2013-09-23 21:23:54 PDT
Patrick: comment 52 summarizes pretty well what's going on here.  Grep the log for 'complete.mar' to see the URI that this is happening with.

1) I'm not clear on why Firefox is asking for separate range requests to do the update. Is there some reason the updater is asking the channel to get the file in separate range requests?  I wasn't aware of the HTTP channel ever splitting things up like this unless it's told to, or it hits a partial cache entry (which we're not here)

2) The server is definitely broken or at least being very unhelpful by replying with a range request other than what was asked for.

3) Necko is broken for not failing when the wrong range is returned.

Patrick I remember you were just mentioning we're missing some range checks in our code, right?
Comment 65 Henrik Skupin (:whimboo) 2013-09-24 01:30:46 PDT
It's not necessarily the back-end server which is broken here. Given other passing update tests, which also make use of ftp5.dmz.scl3.mozilla.com, are working fine. So this might be a strange combination of different symptoms?
Comment 66 Henrik Skupin (:whimboo) 2013-09-24 01:38:08 PDT
Wait. There is something strange. I have checked a lot of http logs for all passing update tests and I have seen that in those cases we do only a single request to ftp5.dmz.scl3.mozilla.com. Sometimes we get back the whole content, but in case or partial content we switch over to another ftp server. That one serves the remaining content from 300.000 bytes up to the full file size.

That means whenever we are not hopping to another ftp server we might run into that problem. I will continue to observe http log files and will check if that is the real correlation.
Comment 67 Henrik Skupin (:whimboo) 2013-09-24 01:46:04 PDT
The following snippet most likely invalidates my thoughts from the previous comment. Looks like ftp5 is able to send the correct range request:

2013-09-21 08:55:29.936473 UTC - 152600576[100330400]: http response [
2013-09-21 08:55:29.936484 UTC - 152600576[100330400]:   HTTP/1.1 206 Partial Content
2013-09-21 08:55:29.936489 UTC - 152600576[100330400]:   Server: Apache
2013-09-21 08:55:29.936493 UTC - 152600576[100330400]:   X-Backend-Server: ftp5.dmz.scl3.mozilla.com
2013-09-21 08:55:29.936497 UTC - 152600576[100330400]:   Cache-Control: max-age=3600
2013-09-21 08:55:29.936501 UTC - 152600576[100330400]:   Content-Type: application/octet-stream
2013-09-21 08:55:29.936506 UTC - 152600576[100330400]:   Date: Sat, 21 Sep 2013 08:55:29 GMT
2013-09-21 08:55:29.936510 UTC - 152600576[100330400]:   Keep-Alive: timeout=5, max=1000
2013-09-21 08:55:29.936514 UTC - 152600576[100330400]:   Expires: Sat, 21 Sep 2013 09:55:29 GMT
2013-09-21 08:55:29.936518 UTC - 152600576[100330400]:   Accept-Ranges: bytes
2013-09-21 08:55:29.936522 UTC - 152600576[100330400]:   Access-Control-Allow-Origin: *
2013-09-21 08:55:29.936527 UTC - 152600576[100330400]:   Content-Range: bytes 300000-57273212/57273213
2013-09-21 08:55:29.936531 UTC - 152600576[100330400]:   Etag: "166c607-369eb7d-4e6e0b14aec68"
2013-09-21 08:55:29.936535 UTC - 152600576[100330400]:   Connection: Keep-Alive
2013-09-21 08:55:29.936539 UTC - 152600576[100330400]:   Last-Modified: Sat, 21 Sep 2013 08:37:08 GMT
2013-09-21 08:55:29.936543 UTC - 152600576[100330400]:   X-Cache-Info: caching
2013-09-21 08:55:29.936547 UTC - 152600576[100330400]:   Content-Length: 56973213
2013-09-21 08:55:29.936551 UTC - 152600576[100330400]: ]

I have no idea what specifically I could check further in the log files. If someone has ideas please let me know and I can continue checking the log files.
Comment 68 Patrick McManus [:mcmanus] PTO until Sep 6 2013-09-27 06:53:32 PDT
in the failed case from the log in comment 46 we've got repeated replies with the same date header and the same content-range and each of them after the first contains a response header "X-Cached-Info: cached". It seems pretty likely some cache along the way believes that 206 response is the right answer regardless of the range: request header (which is obviously wrong).

so what cache is it that is adding that header and can it be fixed? Its not internal to firefox - I can tell from the nspr log it is going to the network.

Somewhat separately, I'd like to look at the code that drives this request (i.e. the code that opens the channel and either resumes it or makes these 300KB requests).. I suspect the inclusion if an If-Range header is something you really want (to ensure you don't get mixed up versions), but Its not likely the source of this problem as the etags are consistent.

we definitely have a problem where if the server responds with a different range than the one we requested (but from the right revision of the document) we corrupt the response (more or less ignoring the range responded with). That might be coming into play here. I think a bug needs to be filed on it. But fixing that issue won't help you much here - you still won't have the right reply it will just be detected in necko.
Comment 69 Brian R. Bondy [:bbondy] 2013-09-27 15:23:32 PDT
The code is hit here:
http://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateService.js#l4133

Note that these headers are being set on the request as well:
> this._request.setRequestHeader("Cache-Control", "no-cache");
> this._request.setRequestHeader("Pragma", "no-cache");

I think the response Date header should at least be updated, even if the server is not respecting the range requests. Which supports your theory that some kind of caching is in play.
Comment 70 Patrick McManus [:mcmanus] PTO until Sep 6 2013-09-27 19:37:59 PDT
thanks brian!

(In reply to Brian R. Bondy [:bbondy] from comment #69)
> The code is hit here:
> http://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/
> nsUpdateService.js#l4133

that seems to call into netwerk/base/src/nsIncrementalDownload .. which honestly I'm not familiar with (though it looks straightfrward) and seems to go back to cvs days. yum!

that's probably the code that should have if-range added to it. (you can file a bug or I will do it monday).. but I doubt in this case it will help.

> 
> Note that these headers are being set on the request as well:
> > this._request.setRequestHeader("Cache-Control", "no-cache");
> > this._request.setRequestHeader("Pragma", "no-cache");

actually those are on a differnet thing right - the xhr not the incrementaldownload? (both called _request but in different objects) (I assume the xhr is some kind of meta data). The log definitely does not show the no-cache headers being sent on the range based requests. But I don't think they should be there in the general case anyhow - what could be a more perfect use of a cache infrastructure than checksummed and signed updates?

The problem here is the cache is broken.

> 
> I think the response Date header should at least be updated, even if the
> server is not respecting the range requests. Which supports your theory that
> some kind of caching is in play.

right.

I'm guessing it is a HTTP/1.0 era cache that sees a 206 and just considers it a 2xx code and caches it without an understanding of byte ranges.. in that sense the future requests are cache hits and each receive a copy of the first reply.

its consistent with what we are seeing.

I can patch nsIncrementalDownload to more or less detect when the wrong content-range comes back (if it does not match the requested range). it already has some internal retry logic that can be reused there; so I can have just those retries use no-cache and the normal transfers should continue unchanged.

I would kinda be coding blind on that one - its just a hypothesis. do we have a reasonable way to test a fix perhaps from a try build? mozmill?
Comment 71 Brian R. Bondy [:bbondy] 2013-09-27 19:52:52 PDT
ah, yep about the cache headers.  Those are sent for the app update xml. We still want them there.  That will tell us which MAR to download, and then the MAR file is downloaded with the incremental downloader.

> I can patch nsIncrementalDownload to more or less detect when the wrong content-range
> comes back (if it does not match the requested range). 
> it already has some internal retry logic that can be reused there; so I can have just
> those retries use no-cache and the normal transfers should continue unchanged.

That sounds good to me.

> I would kinda be coding blind on that one - its just a hypothesis. do we have a reasonable 
> way to test a fix perhaps from a try build? mozmill?

I don't think we have a way to immediately verify it, but I think we can see via this bug via mozmill over time.
Comment 72 Henrik Skupin (:whimboo) 2013-09-28 14:33:09 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #71)
> > I would kinda be coding blind on that one - its just a hypothesis. do we have a reasonable 
> > way to test a fix perhaps from a try build? mozmill?
> 
> I don't think we have a way to immediately verify it, but I think we can see
> via this bug via mozmill over time.

Sadly that's true. We see that issue very rarely. So there is no way for us to somehow verify a possible patch. If we would have further debug information in the code we could at least dump that to the console. But then the code has to live in central. But I cannot promise when we will hit this problem the next time. It could take a couple of days or even weeks.

But good to hear that we got closer in the understanding of the issue.
Comment 73 Jake Maul [:jakem] 2013-09-30 12:32:59 PDT
Lots to comment on here...

1)
If we're querying ftp.mozilla.org (as I believe we should be for nightly/aurora), then the CDNs do not factor in. No need to worry about them. If, however, we see this problem on beta or release, then we need to be concerned here.

2)
X-Cache-Info is a header added by our own (caching) load balancers. When you see this header as "cached", the "Server: ftpX.dmz.scl3.mozilla.com" header is unreliable... that server didn't actually serve the request, it was served out of cache- that server simply generated the original item that was put into cache in the first place. If you see "X-Cache-Info: caching", then the request is being put into cache, but was indeed returned by that named server. The other option is something like "X-Cache-Info: not cacheable", which is self-explanatory.

3)
We do know these load balancers have an issue with Range requests when the response is a 302 redirect... they will return a 206 instead of the 302. This makes their default behavior unsuitable for download.mozilla.org, which returns 302's over to the real file. This particular bug shouldn't matter here, but I mention it just on the off chance there are other oddities with its Range handling. Note that we've been using them for years now, so if this is the issue, it's been around since 2011.

4)
The fact that these issues occur extremely rarely is troubling. It would be good if we could nail down exactly the right characteristics for this.

5)
It's worth noting that the same backend which powers ftp.mozilla.org also powers the origin from which the CDNs get their data. That means, if there is a significant infra issue with headers and responding with the wrong content, it seems likely it would 'bubble up' and affect the CDNs too.

6)
We've had delivery issues with nightly/aurora in the past due to the nature of those updates- the contents change nightly, but the filenames don't. This is troublesome for caching, and forces us to use small cache TTLs on them. One of the issues we've seen in the past (particularly with longer TTLs though) is that sometimes a given block (a 300KB chunk, I mean) from the previous build will still be cached, but other blocks will be from the new build. This obviously breaks the checksum and updating fails. This doesn't line up with the headers being reported here, though, so I can't point a finger at it either.


I would like to ask, is there any correlation between time-of-day or day-of-week as to when these issues happen? Perhaps they're being affected by some other force...
Comment 74 Henrik Skupin (:whimboo) 2013-09-30 14:55:55 PDT
Aurora around 2012-11-29T02:20:01Z:
Win_8_x86: http://mozmill-ci.blargon7.com/#/update/report/352218d7e3125c857fc1d3716713d32d

Aurora around 2013-02-16T16:36:49Z:
Win_8_x86: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf98b9f0
Win_8_x86_64: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf98a196
Win_Vista_x86: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf98a099
Win_XP_x86: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf98761f
Win_7_x86_64: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf985efe
Win_7_x86: http://mozmill-ci.blargon7.com/#/update/report/a83c700664548dba07298b74bf986ee6

Nightly around 2013-04-03T14:11:53Z:
Win_7_x86_64: http://mozmill-ci.blargon7.com/#/update/report/740ab23e062758ae8a240b263d064cbd

Nightly around 2013-09-10T13:56:17Z:
Win_8_x86: http://mozmill-daily.blargon7.com/#/update/report/26d5854914a68aa13b333f28152e365c

So as far as I can see from those reports, we never have seen such failures for Linux or Mac OS X, but only Windows. Given that the Windows builds are taking longer before they are ready, it might be time related.

Keep in mind that we run our tests a couple of seconds after the Pulse snippets for finished builds have been arrived. Not sure if all ftp servers are synced up already by that time, and if that could cause the trouble.
Comment 75 Jake Maul [:jakem] 2013-09-30 18:04:24 PDT
I don't know anything about Pulse, so I can't comment on timing related to that. That does seem like a feasible possibility, that maybe things just aren't ready yet for Windows, but they consistently are for Linux and OSX.

If nothing else, "Windows" is alphabetically after "Linux" and "Mac" or "OSX". It wouldn't be the first time I've seen things near the end of the alphabet broken simply because they're done last (and in this case, maybe that means it takes just a bit longer to be ready for testing?).

Windows is probably also better cached, because it gets more hits, and therefore is more likely to be served out of load balancer cache whereas Linux and OSX are more likely to be served directly from the web nodes. Immediately after an update, that cache is obviously more likely to have the previous days' data, and so it's more likely to deliver the old content. This is something that only affects nightly and aurora, because beta and release actually change filenames, whereas nightly and aurora simply change the contents of a file, which is awful for cacheability.

The current load balancer TTL for http://ftp.mozilla.org/ is 1 hour. It's already pretty short to deal with precisely this problem. If this is indeed the problem, I don't know if we can feasibly shorten it enough to make mozmill happy.


Is there any way to try things with a longer "sleep" after the Pulse snippets? Something like 5 minutes, at least for the channels and platforms affected? That would at least rule out the first possibility.
Comment 76 Henrik Skupin (:whimboo) 2013-10-01 00:34:19 PDT
(In reply to Jake Maul [:jakem] from comment #75)
> If nothing else, "Windows" is alphabetically after "Linux" and "Mac" or
> "OSX". It wouldn't be the first time I've seen things near the end of the
> alphabet broken simply because they're done last (and in this case, maybe
> that means it takes just a bit longer to be ready for testing?).

It's simply that the build process takes longer. But I think we can forget about that, because it shouldn't play into account here.

> This is something that only affects nightly and aurora, because beta and
> release actually change filenames, whereas nightly and aurora simply change
> the contents of a file, which is awful for cacheability.

We have never seen problems in downloading the actual executable. It's for updates, and here specifically for fallback updates. Those are referencing the complete mar file like: firefox-27.0a1.en-US.win64-x86_64.complete.mar, whereby partial updates will change per build: firefox-27.0a1.en-US.win32.partial.20130930030205-20130930071202.mar.

And as far as I can remember we never have seen the problem with partial updates but only fallback (complete) updates.

> The current load balancer TTL for http://ftp.mozilla.org/ is 1 hour. It's
> already pretty short to deal with precisely this problem. If this is indeed
> the problem, I don't know if we can feasibly shorten it enough to make
> mozmill happy.

Brian, is there a way to let Firefox circumvent the cache when downloading the update? I wonder if that could help us here. I don't want to wait with update tests a complete hour to be sure that we get the right bits.

> Is there any way to try things with a longer "sleep" after the Pulse
> snippets? Something like 5 minutes, at least for the channels and platforms
> affected? That would at least rule out the first possibility.

I can add a sleep but that would be the last resort. I cannot do it for a specific platform, but all would be delayed. I don't really like that. But even if we change that, there is no guarantee that we can be sure it will fix it, given the very bad reproduction rate over the last year. All the above test results were nearly all failures we had for daily builds.

We also had those problems with update tests on betatest or releasetest, but I don't have examples for it right now. If it happens again, we now have http logs available.
Comment 77 Robert Strong [:rstrong] (use needinfo to contact me) 2013-10-01 00:39:05 PDT
(In reply to Henrik Skupin (:whimboo) from comment #76)
> (In reply to Jake Maul [:jakem] from comment #75)
>...
> > The current load balancer TTL for http://ftp.mozilla.org/ is 1 hour. It's
> > already pretty short to deal with precisely this problem. If this is indeed
> > the problem, I don't know if we can feasibly shorten it enough to make
> > mozmill happy.
> 
> Brian, is there a way to let Firefox circumvent the cache when downloading
> the update? I wonder if that could help us here. I don't want to wait with
> update tests a complete hour to be sure that we get the right bits.
The cache in question is the server cache and it would be better to ask Jake if there is a way for the client to tell the server to circumvent the cache but that would be a change to how the client code works and hence that wouldn't be testing what the code clients actually use.
Comment 78 Henrik Skupin (:whimboo) 2013-10-01 00:42:21 PDT
(In reply to Robert Strong [:rstrong] (do not email) from comment #77)
> The cache in question is the server cache and it would be better to ask Jake
> if there is a way for the client to tell the server to circumvent the cache
> but that would be a change to how the client code works and hence that
> wouldn't be testing what the code clients actually use.

Good call. We most likely should not do that then. A real fix is always better than a workaround, especially if we wouldn't test the real bits anymore.
Comment 79 Ben Hearsum (:bhearsum) 2013-10-01 06:10:29 PDT
Note: I've triggered additional nightlies on Friday and Monday for testing various things that affect them. It sounds like that may correlate with the increased occurrence of this lately.
Comment 80 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-01 14:41:43 PDT
Created attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Change 1: For >= the second chunk add an If-Range request header
including the etag or l-m of the first chunk's response. That will
require a compliant server to send all of the responses from the same
version. non-compliant server's will obviously not give a crap, but
our current approach is a bug without it.

Change 2: If at any point the content-range response header looks
'funny' then redo that chunk with "no-cache" and also use no-cache
on all future chunks. funny includes: a range that doesn't start at
the offset we asked for, or a range from a total size that is
different than the total size in the first chunk.

I've tested this with the TestIncrementalDownload C++ test tool.

Its hard for me to say whether the cache that is causing the problem
here will honor the no-cache directive (it is in general hard to
predict what broken software will do), but if it does that ought to
resolve the issue seen in the logs.
Comment 81 Jake Maul [:jakem] 2013-10-01 16:51:32 PDT
(In reply to Patrick McManus [:mcmanus] from comment #80)
> Its hard for me to say whether the cache that is causing the problem
> here will honor the no-cache directive (it is in general hard to
> predict what broken software will do), but if it does that ought to
> resolve the issue seen in the logs.

It will honor an incoming no-cache directive.

Or rather, it *should*. We've done this before. The load balancers by default do honor incoming cache header requests. But then, hey, they also honor Range requests, and it also works *right now* almost all the time, and only sometimes we get this odd behavior, so who knows! :)

That's the only thing really troubling me, the unpredictability of reproducing this problem. 60% of the time, it works every time.


I can't say offhand if the CDNs used by beta and release will honor it. But since there's apparently no problem there, it's probably not worth thinking about too much. Such a header would only affect the single file/range requested, and not even on all CDN nodes, so the bandwidth implications are probably minimal either way.
Comment 82 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-01 19:13:04 PDT
https://tbpl.mozilla.org/?tree=Try&rev=ebf12798abfb
Comment 83 Ben Hearsum (:bhearsum) 2013-10-02 06:05:14 PDT
Moving this bug to what I think is now the correct component.
Comment 84 Jason Duell [:jduell] (needinfo me) 2013-10-15 20:06:03 PDT
Comment on attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Review of attachment 812841 [details] [diff] [review]:
-----------------------------------------------------------------

So my (quite possibly wrong: I don't know this code well) understanding is that this patch changes the error scenario we've been hitting to look like this:

  1st request: range 1-999   // OK

  2nd request (for 1000-1999)  // server gives us bogus Range 1-999

  3rd request: we retry with no-cache, plus If-Range, plus Range 1000-1999 
  - if that makes the server reply with the correct Range, then we're good.
  - if the server replies with a full (non partial) 200 response (which is valid
    given we're passing If-Range), we throw it away (right? that's what seems to
    happen unless mInterval == 1 and I don't see us setting that), and try
    again until we hit mNonPartialCount. 
  - if we keep getting the bogus Range, we just keep re-trying until we hit
    mNonPartialCount then give up.

If I'm right about the algorithm here, I'm wondering if we'd be better off
  
  1) handling a full 200 reply if we get one in response to If-Request

  2) if we see an error in the 3rd request, actually just *requesting* a full,
     non-partial load.

I.e. if things go wrong with partial loads I'm guessing we're just better off grabbing the whole entity, unless there's some really good reason why that's a terrible thing to do even in this failure mode (This entire nsIncrementalDownload thing feels very 1996-dial-up-modem to me).

OTOH I don't want to burn your time on this Patrick, so if you think the current patch is good enough (or I've misinterpreted it), I'm open to just landing this.

> I've tested this with the TestIncrementalDownload C++ test tool.

You tested with a misbehaving server, or just with a compliant one?

::: netwerk/base/src/nsIncrementalDownload.cpp
@@ +591,5 @@
> +      rv = http->GetResponseHeader(NS_LITERAL_CSTRING("Content-Range"), buf);
> +      if (NS_FAILED(rv))
> +        return rv; // it isn't a 206 without a CONTENT-RANGE of some sort
> +
> +    // Content-Range: bytes 0-299999/25604694

indent 2 spaces

@@ +616,5 @@
> +          } else {
> +            int32_t slash = buf.FindChar('/');
> +            int64_t rangeSize = 0;
> +            if (slash != kNotFound &&
> +                (PR_sscanf(buf.get() + slash + 1, "%lld", (int64_t *) &rangeSize) == 1) &&

Does sscanf really not work without casting int64_t address to (int64_t*)?
Comment 85 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2013-10-15 21:09:27 PDT
Comment on attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Review of attachment 812841 [details] [diff] [review]:
-----------------------------------------------------------------

I agree with Jason that we're better off doing the conservative thing when we've detected a bad response and just start over with a non-byterange request.

I also agree with Jason that we should handle a 200 response to a byte-range request correctly.

::: netwerk/base/src/nsIncrementalDownload.cpp
@@ +578,5 @@
>    } else {
>      // We got a partial response, so clear this counter in case the next chunk
>      // results in a 200 response.
>      mNonPartialCount = 0;
> +

May be worth checking cache validator matches too, considering we're trying to work around broken servers.

Also would be good to avoid treating a 206 HTTP/1.0 response as being broken.

@@ +589,5 @@
> +      bool confirmedOK = false;
> +
> +      rv = http->GetResponseHeader(NS_LITERAL_CSTRING("Content-Range"), buf);
> +      if (NS_FAILED(rv))
> +        return rv; // it isn't a 206 without a CONTENT-RANGE of some sort

the comment can be read in two different, contradictory ways. I suggest "a valid 206 response must have a Content-Range"

@@ +622,5 @@
> +              confirmedOK = true;
> +            }
> +          }
> +        }
> +      }

I think it would be great if we only had one Content-Range parser in Gecko. Can we factor out ChannelMediaResource::ParseContentRangeHeader into something more generic? (See also nsMultiMixedConv::ParseHeaders which seems like a not-so-good parser.)

One thing that is better about ChannelMediaResource::ParseContentRangeHeader is that it uses only nsCString operations so it is more clearly memory-safe. Also, it doesn't use any scanf/printf formatting so no format specifier issues.

@@ +646,5 @@
>      rv = http->GetURI(getter_AddRefs(mFinalURI));
>      if (NS_FAILED(rv))
>        return rv;
> +    http->GetResponseHeader(NS_LITERAL_CSTRING("Etag"), mPartialValidator);
> +    if (mPartialValidator.IsEmpty())

Technically we should also ignore weak ETags...

@@ +647,5 @@
>      if (NS_FAILED(rv))
>        return rv;
> +    http->GetResponseHeader(NS_LITERAL_CSTRING("Etag"), mPartialValidator);
> +    if (mPartialValidator.IsEmpty())
> +      http->GetResponseHeader(NS_LITERAL_CSTRING("Last-Modified"), mPartialValidator);

...and weak Last-Modified.

@@ +866,5 @@
> +  if (mCacheBust) {
> +    newHttpChannel->SetRequestHeader(NS_LITERAL_CSTRING("Cache-Control"),
> +                                     NS_LITERAL_CSTRING("no-cache"), false);
> +    newHttpChannel->SetRequestHeader(NS_LITERAL_CSTRING("Pragma"),
> +                                     NS_LITERAL_CSTRING("no-cache"), false);

Better to put this duplicated code in a reused function too.
Comment 86 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2013-10-15 21:27:44 PDT
Comment on attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Review of attachment 812841 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsIncrementalDownload.cpp
@@ +860,5 @@
> +  if (!ifRange.IsEmpty()) {
> +    rv = newHttpChannel->SetRequestHeader(NS_LITERAL_CSTRING("If-Range"),
> +                                          ifRange, false);
> +    NS_ENSURE_SUCCESS(rv, rv);
> +  }

Thinking about it more: It is wrong to copy the If-Range header from the old response, isn't it? Consider:

Request  1: GET http://example.org/resource1
Response 1: 200, ETag: A
Request  2: GET http://example.org/resource1
Response 2: 301, Location: http://example.org/resource2

You cannot assume that ETag "A" is valid for resource2 just because it was valid at one point for resource1 and resource1 is now redirecting you to resource2.

Consequently, you cannot do a range request here, only a full request.
Comment 87 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-16 04:46:41 PDT
if we're going to insist on perfection, this is the wrong priority for me to be working on.
Comment 88 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-16 07:14:57 PDT
(In reply to Patrick McManus [:mcmanus] from comment #87)
> if we're going to insist on perfection, this is the wrong priority for me to
> be working on.

sorry about that - I know better than that. But I feel we should go ahead as is with this patch (modulo nits).

This is a bugfix not a feature. I believe the right standard for that is "does this improve the code and address the problem? i.e. does it move the ball forward?". The standard for a feature is somewhat higher - involving larger questions of completeness, organization, and what not.

In this case the file in question is kinda wonky in its overall approach, but on the other hand I respect the stability it has had over the years. We don't need to re-invent the feature in the bugfix. I am only aware of two real complaints regarding its operation (as opposed to moral complaints you might generate by reading it): #1 is this bug, and #2 is relayed from bahanava at the summit that the background downloads are too aggressive for some applications. I think 2 is quite worth solving, not in this bugfix, and a transition to a blind 200 without also adding some rate limiting logic is actually quite the opposite effect.

(In reply to Jason Duell (:jduell) from comment #84)

> If I'm right about the algorithm here, I'm wondering if we'd be better off
>   
>   1) handling a full 200 reply if we get one in response to If-Request
>

perhaps - but I'm not really interested in changing the incremental downloader that much without confirmation that it is necessary. It has never supported that use case and its unknown whether or not we need that big of a change to address this issue. Indeed it doesn't even currently support a 200 reply in response to a request for the first chunk without the if-request (which is certainly legal of the server to do). More importantly there isn't evidence that a 200 reply ever happens after a 206 though I agree it could theoretically happen. I would probably merge that code as a separate patch if contributed, but I don't think its germane to the log included in this bug.
 
>   2) if we see an error in the 3rd request, actually just *requesting* a
> full,
>      non-partial load.
> 
> I.e. if things go wrong with partial loads I'm guessing we're just better
> off grabbing the whole entity,

this is a little trickier. you're speculating on what the broken server needs to get going on the right path again (a non range request) and so am I. (cache-control). We don't have a way of determining which is right without deploying the code. Your approach has the benefit of being a bigger hammer and being more likely to fix the interop problem.. but my approach has the benefit of being less risky  because it doesn't require adding a whole new code path (a 200 on top of a partial 206 spool) which also has behavioral risks in terms of overloading the background download rate. (the rate is currently limited by this range based chunking not by use of the token bucket). I prefer to try the targetted fix and only go to the broader one if our hand is forced. 

> unless there's some really good reason why
> that's a terrible thing to do even in this failure mode (This entire
> nsIncrementalDownload thing feels very 1996-dial-up-modem to me).
>

I agree - but like I said, I respect code that ships and mostly works :)
 
> 
> You tested with a misbehaving server, or just with a compliant one?
>

I have tested with a misbehaving server (synthetically injected errors), but of course not the one in question with this bug because that's intermittent and not reproducible. So I can test the code does what I want but I cannot test how the real server reacts to that workaround.

 (In reply to Brian Smith (:briansmith, was :bsmith@mozilla.com) from comment #86)
> Comment on attachment 812841 [details] [diff] [review]
> incremental downloader error checking around bad server byte ranges
> 
> Review of attachment 812841 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/base/src/nsIncrementalDownload.cpp
> @@ +860,5 @@
> > +  if (!ifRange.IsEmpty()) {
> > +    rv = newHttpChannel->SetRequestHeader(NS_LITERAL_CSTRING("If-Range"),
> > +                                          ifRange, false);
> > +    NS_ENSURE_SUCCESS(rv, rv);
> > +  }
> 
> Thinking about it more: It is wrong to copy the If-Range header from the old
> response, isn't it? Consider:
> 
> Request  1: GET http://example.org/resource1
> Response 1: 200, ETag: A
> Request  2: GET http://example.org/resource1
> Response 2: 301, Location: http://example.org/resource2
> 
> You cannot assume that ETag "A" is valid for resource2 just because it was
> valid at one point for resource1 and resource1 is now redirecting you to
> resource2.
> 
> Consequently, you cannot do a range request here, only a full request.

that's not quite what is going on.. response 1 needs to be a 206, and request 2 is really "GET /resource1, Range: 1000-whatever, if-range: a".. if-range will never be set if we don't already have a partial response spooled. if we get a 30x there we can do one of four things:
1] accept on blind faith this is a continuation of what is already spooled. This is the current behavior and it is backstopped by a checksum at a higher layer.
2] enforce etag A on the subsequent 206 (transaction 3).. which is certainly some sanity protection - if they are different this clearly isn't going to work :) that is what this patch does. It isn't technically right in the sense that resource1:A doesn't have to be the same as resource2:A so we might get a checksum error on the larger download. but we have protection there.
3] we can simply throw an error anytime we get a 3xx in response to a request for chunk > 0. That really shouldn't happen (the redirections should all be resolved on chunk 0.)
4] we could react to a 3xx in response to a request for chunk > 0 by starting over with non-ranged requests. That is probably superior overall - but given the risks I talked about above in reply to jason and the fact that there is no evidence of this problem at all I don't want to pollute this bugfix with that code nor would I personally prioritize its work.

on balance, I think 2 is the right path for this patch. It does have an extremely low risk of breaking an existing (broken!) use case but I don't see any evidence to suggest that. It leaves some wiggle room if you've indicated with etags that everything is consistent. If you feel really strongly about the other options let me know, but I think this is one of those things where we are trying to balance flexibility with correctness in a path that can't really happen in a logical sense (because the redirects have to be on the chunk 0 transaction) and there isn't a hard and fast right answer.
Comment 89 Jason Duell [:jduell] (needinfo me) 2013-10-16 14:07:29 PDT
Comment on attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Review of attachment 812841 [details] [diff] [review]:
-----------------------------------------------------------------

> Your approach has the benefit of being a bigger hammer and being more likely
> to fix the interop problem.. but my approach has the benefit of being less
> risky  because it doesn't require adding a whole new code path (a 200 on top
> of a partial 206 spool) which also has behavioral risks in terms of
> overloading the background download rate. 

Fair enough.  I'm good with landing this and if it fixes the specific intermittent server bug here, calling this done.  If/when we see evidence that actual implementations need more attention to get things right we can revisit.

Brian, are you ok with that?
Comment 90 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-19 17:41:12 PDT
(In reply to Brian Smith (:briansmith, was :bsmith@mozilla.com) from comment #85)

> Technically we should also ignore weak ETags...

good point. thanks.

I've also given the redirection case more thought - and doing byte ranges with it makes even less sense than before for anything other than chunk 0.. so I'm going to essentially keep that behavior unchanged by not copying over the if-range on the redirect and also clearing the validator so it won't be used for future chunks either. That will preserve the status quo, which at least has the virtue of seemingly working.
Comment 91 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-19 18:09:12 PDT
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/cbf2fa780040
Comment 92 :Ms2ger (⌚ UTC+1/+2) 2013-10-20 03:22:23 PDT
https://hg.mozilla.org/mozilla-central/rev/cbf2fa780040
Comment 93 Henrik Skupin (:whimboo) 2013-10-20 23:23:18 PDT
I wonder if we should also get this backported to Aurora.
Comment 94 Patrick McManus [:mcmanus] PTO until Sep 6 2013-10-21 04:52:18 PDT
(In reply to Henrik Skupin (:whimboo) from comment #93)
> I wonder if we should also get this backported to Aurora.

I would argue no
1] don't mess with the baking time on something as critical as updates
2] we don't know without experience if this actually fixes the intermittent problem
Comment 95 Lukas Blakk [:lsblakk] use ?needinfo 2013-10-21 13:13:08 PDT
Agree with Patrick, let's not put additional risk by lowering bake time for this plus it's been around for a long time.
Comment 96 Lukas Blakk [:lsblakk] use ?needinfo 2013-10-21 13:13:41 PDT
Sorry, bad flagging - don't need to track this long-standing issue for release.
Comment 97 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2013-10-22 15:25:22 PDT
This isn't something we can easily verify, it only reproduced intermittently in our automation. We'll just have to keep an eye out in our update tests and file a follow up bug if it happens again.
Comment 98 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2013-10-23 16:05:53 PDT
Comment on attachment 812841 [details] [diff] [review]
incremental downloader error checking around bad server byte ranges

Review of attachment 812841 [details] [diff] [review]:
-----------------------------------------------------------------

I still stand by my previous review comments but a patch different than this one was already committed.
Comment 99 Henrik Skupin (:whimboo) 2014-09-26 03:04:21 PDT
I have seen this problem again with an update for Nightly from 09/23 to 09/25. The complete mar file failed in this case:

*** AUS:SVC Downloader:onProgress - progress: 53400000 is higher than patch size: 53299706
*** AUS:SVC Downloader: cancel
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar, final URI spec: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar, status: 2147549183

I will keep an eye on it. If it happens again or regularly I will get a new bug filed.

Note You need to log in before you can comment on or make changes to this bug.