Closed
Bug 1073459
Opened 10 years ago
Closed 2 years ago
Updates failing for Firefox because downloaded patch is larger than the one referenced in the update snippet
Categories
(Core :: Networking, defect, P3)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: whimboo, Unassigned)
References
Details
(Whiteboard: [mozmill][necko-backlog])
Attachments
(1 file)
21.08 KB,
text/plain
|
Details |
+++ This bug was initially created as a clone of Bug #816472 +++
We have seen this behavior already on bug 816472 a while ago and Patrick pushed a fix for Firefox 27. Sadly something seems to have been regressed here, or I'm hitting another problem.
So what I'm testing is to update the 2014-09-24-03-02-04-mozilla-central-firefox-35.0a1.en-US.linux-x86_64.tar.bz2 build to yesterday's one. The direct update via the partial mar file works fine, while the fallback update by using the full mar file fails.
Update snippets used:
https://aus4.mozilla.org/update/3/Firefox/35.0a1/20140924030204/Linux_x86_64-gcc3/en-US/nightly/Linux%203.13.0-35-generic%20(GTK%202.24.23)/default/default/update.xml?force=1
<updates><update type="minor" displayVersion="35.0a1" appVersion="35.0a1" platformVersion="35.0a1" buildID="20140925030203"><patch type="complete" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar" hashFunction="sha512" hashValue="5e25d232c442e70b92c9a4f5ac65c79d187ff363ce8f5269ddd3ceeb6cea5e9de18df74977bfe2a6f9f1fcb345172eb00da4140a76eb72dab36fb9044b2e140f" size="53299706"/><patch type="partial" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.partial.20140924030204-20140925030203.mar" hashFunction="sha512" hashValue="1f28c66ee80cdda7b430df0bd696d750319de0f662b4e788ce91836c509fa2cba77d617ac3f016ac8bd48d7322d6ab58a2f04a010d9546860ef3f119c5136ca0" size="9606465"/></update></updates>
Reporter | ||
Comment 1•10 years ago
|
||
Comment 2•10 years ago
|
||
The problem will likely be in networking so network logs will probably be needed.
Reporter | ||
Comment 3•10 years ago
|
||
I re-run this update a couple of times and I can reproduce it once each 3-4 times. Finally I got a HTTP log for such a failed update:
https://www.dropbox.com/s/ef6ipbyws8rpckw/log.txt?dl=0
Patrick, would you mind having a look at this?
Flags: needinfo?(mcmanus)
Reporter | ||
Comment 4•10 years ago
|
||
Please disregard the above log file. I missed to specify 'append' to the HTTP_LOG env variable. So what's in there is only for the latest restart. The failure happened earlier. I will try to get the right log.
Reporter | ||
Comment 5•10 years ago
|
||
Interestingly a new version of Firefox Nightly came out while I was trying to get the log for this problem. Sadly I was not able to reproduce it anymore with the target build 09/26. Thankfully I saved off the above update snippet, so that I was able to override the update url and still serve the old snippet content.
So here the the HTTP log:
https://www.dropbox.com/s/h62wizk7czvzx4m/log-larger.zip?dl=0
Comment 6•10 years ago
|
||
This was pretty involved and took me a while to get to it. Sorry about that.
Here's what the log shows (analysis in next comment):
51.804: http request [
51.804: calling PR_Poll [active=4 idle=0]
51.804: GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1
51.804: Host: ftp.mozilla.org
51.804: User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0
51.804: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
51.804: Accept-Language: en-US,en;q=0.5
51.804: Range: bytes=53100000-
51.804: If-Range: "4874170-32d49fa-503e245bea46f"
51.804: Cache-Control: no-cache
51.804: Pragma: no-cache
51.804: Cookie: optimizelySegments=%7B%22245875585%22%3A%22direct%22%2C%22245617832%22%3A%22none%22%2C%22246048108%22%3A%22false%22%2C%22245677587%22%3A%22ff%22%2C%22869421433%22%3A%22true%22%7D; optimizelyEndUserId=oeu1411735435041r0.33049987065833963; optimi\
zelyBuckets=%7B%7D; __utma=150903082.178961275.1411735435.1411735435.1411735435.1; __utmb=150903082.3.10.1411735435; __utmz=150903082.1411735435.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=150903082
51.804: Connection: keep-alive
51.804: ]
and then
51.992: http response [
51.992: HTTP/1.1 206 Partial Content
51.992: Server: Apache
51.992: X-Backend-Server: ftp1.dmz.scl3.mozilla.com
51.992: Cache-Control: max-age=3600
51.992: Content-Type: application/octet-stream
51.992: Date: Fri, 26 Sep 2014 12:41:26 GMT
51.992: Keep-Alive: timeout=5, max=1000
51.992: Expires: Fri, 26 Sep 2014 13:41:26 GMT
51.992: Accept-Ranges: bytes
51.992: Access-Control-Allow-Origin: *
51.992: Content-Range: bytes 40500000-40799999/53299706
51.992: Etag: "4874170-32d49fa-503e245bea46f"
51.992: Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT
51.992: Content-Length: 300000
51.992: Connection: Keep-Alive
51.992: X-Cache-Info: cached
51.992: ]
Flags: needinfo?(mcmanus)
Comment 7•10 years ago
|
||
This is broken cache behavior.
The request
51.804: Range: bytes=53100000-
The response
51.992: Content-Range: bytes 40500000-40799999/53299706
so in short - we're asking for bytes 53100000 to EOF, and the server is responding with 40500000-4079999.
At least the etag and if-range match, so they're talking about the same representation.
and then we get
52.752846: nsHttpChannel::Cancel [this=7f151f5dc800 status=8000ffff] NS_ERROR_UNEXPECTED
So the incremental downloader afaict doesn't even notice that its the wrong range.. it just assumes this is the right thing and appends it to the partially downloaded mar file.
I think what it notices is the content length of 300,000 + the 53100000 it already had is bigger than the expected 53299706. That's where it throws the error:
https://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsIncrementalDownload.cpp#702
Comment 8•10 years ago
|
||
What can we do?
option a - fix the broken thing. In this case that's the server which is responding incorrectly to the range request. Fixing broken things is always a good plan.
option b - stop using the incremental downloader and use an nsichannel directly. If you're concerned about rate limiting that's pretty easy to handle with suspend and resume at a gross scale (which ought to work here).
however, if you're concerned about connections being stable enough to download the whole mar reliably in one shot then its not a good idea. And I could understand if you feel that way.
option c - change incremental downloader to notice when its getting the wrong chunk. This will catch other errors too. also change it so that it will retry without a range request and overwrite the existing partial mar with the full 200 response.
we can do c.. but whomever owns a really should do that too - its just begging for corruption. and I'm not going to bother with c is b is in the plan (which I sometimes hear).
let me know.
Comment 9•10 years ago
|
||
> option c - change incremental downloader to notice when its getting the
> wrong chunk. This will catch other errors too.
and the are other errors in this same transfer that we don't seem to notice (presumably the mar processor would find it based on a checksum if the transfer didn't die on the last chunk). Fix that server - that will obviously cure the problem in deployed clients.
request
45:48.305883 Range: bytes=51900000-
45:48.305886 If-Range: "4874170-32d49fa-503e245bea46f"
response
45:48.494333 HTTP/1.1 206 Partial Content
45:48.494377 Content-Range: bytes 40500000-40799999/53299706
Comment 10•10 years ago
|
||
I am leaning toward b for the client though a should also be fixed.
Patrick, when you say "suspend and resume at a gross scale" could you provide additional detail for someone not familiar with it or a dxr link to an example usage of suspend and resume? Thanks!
Jake, can you get the server fixed?
Flags: needinfo?(nmaul)
Flags: needinfo?(mcmanus)
Comment 11•10 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #10)
>
> Patrick, when you say "suspend and resume at a gross scale" could you
> provide additional detail for someone not familiar with it or a dxr link to
> an example usage of suspend and resume? Thanks!
>
https://mxr.mozilla.org/mozilla-central/source/netwerk/base/public/nsIRequest.idl#82
an nsichannel is a subclass of nsirequest.
I think the possible concern here would be that in some of our markets the connectivity wouldn't be robust enough to get the complete download in one transfer. but maybe that's un-necessary worrying.
Flags: needinfo?(mcmanus)
Comment 12•10 years ago
|
||
That and we also support exiting the application during a download and resuming it on next app launch.
Comment 13•10 years ago
|
||
I can't replicate this by hand. What am I missing?
$ curl -v 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar' -o /dev/null -H 'Range: bytes=53100000-' -H 'If-Range: "4874170-32d49fa-503e245bea46f"' -H 'Cache-Control: no-cache' -H 'Pragma: no-cache' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0' -H 'Connection: keep-alive'
* About to connect() to ftp.mozilla.org port 80 (#0)
* Trying 63.245.215.56...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connected
* Connected to ftp.mozilla.org (63.245.215.56) port 80 (#0)
> GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1
> Host: ftp.mozilla.org
> Range: bytes=53100000-
> If-Range: "4874170-32d49fa-503e245bea46f"
> Cache-Control: no-cache
> Pragma: no-cache
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en-US,en;q=0.5
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0
> Connection: keep-alive
>
< HTTP/1.1 206 Partial Content
< Server: Apache
< X-Backend-Server: ftp8.dmz.scl3.mozilla.com
< Cache-Control: max-age=3600
< Content-Type: application/octet-stream
< Date: Mon, 20 Oct 2014 23:45:17 GMT
< Expires: Tue, 21 Oct 2014 00:45:17 GMT
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< Content-Range: bytes 53100000-53299705/53299706
< ETag: "4874170-32d49fa-503e245bea46f"
< Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT
< Content-Length: 199706
< Connection: Keep-Alive
< X-Cache-Info: cached
If I can replicate this, we stand a chance of maybe putting in place some sort of rule that corrects the errant behavior. I'm not sure how realistic this is, but maybe. More likely is that if we can provide a reproducible test case, we can provide it to the load balancer vendor (Riverbed) and perhaps get a patch.
Flags: needinfo?(nmaul) → needinfo?(mcmanus)
Reporter | ||
Comment 14•10 years ago
|
||
Jake, maybe it's a single server only? You sent a request against the load balancer. In the case of failure it happened for ftp1.dmz.scl3.mozilla.com. You got back a response from ftp8.dmz.scl3.mozilla.com.
Comment 15•10 years ago
|
||
I don't know how to make the server do the thing the log says it did :)
Flags: needinfo?(mcmanus)
Comment 16•10 years ago
|
||
that last comment came off a little weird.. sorry about that. I just mean I don't know what triggers the bug either - there are lots of cases of the server doing the right thing in the log too. perhaps some kind of cache corruption that has long since cleared up.
Updated•9 years ago
|
Whiteboard: [mozmill] → [mozmill][necko-backlog]
Comment 17•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Comment 18•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Comment 19•2 years ago
|
||
In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.
Severity: major → --
Comment 20•2 years ago
|
||
I assume this works now.
If anyone is still seeing these kinds of errors, please reopen the bug.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•