Closed Bug 1073459 Opened 10 years ago Closed 2 years ago

Updates failing for Firefox because downloaded patch is larger than the one referenced in the update snippet

Categories

(Core :: Networking, defect, P3)

35 Branch
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: whimboo, Unassigned)

References

Details

(Whiteboard: [mozmill][necko-backlog])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #816472 +++ We have seen this behavior already on bug 816472 a while ago and Patrick pushed a fix for Firefox 27. Sadly something seems to have been regressed here, or I'm hitting another problem. So what I'm testing is to update the 2014-09-24-03-02-04-mozilla-central-firefox-35.0a1.en-US.linux-x86_64.tar.bz2 build to yesterday's one. The direct update via the partial mar file works fine, while the fallback update by using the full mar file fails. Update snippets used: https://aus4.mozilla.org/update/3/Firefox/35.0a1/20140924030204/Linux_x86_64-gcc3/en-US/nightly/Linux%203.13.0-35-generic%20(GTK%202.24.23)/default/default/update.xml?force=1 <updates><update type="minor" displayVersion="35.0a1" appVersion="35.0a1" platformVersion="35.0a1" buildID="20140925030203"><patch type="complete" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar" hashFunction="sha512" hashValue="5e25d232c442e70b92c9a4f5ac65c79d187ff363ce8f5269ddd3ceeb6cea5e9de18df74977bfe2a6f9f1fcb345172eb00da4140a76eb72dab36fb9044b2e140f" size="53299706"/><patch type="partial" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.partial.20140924030204-20140925030203.mar" hashFunction="sha512" hashValue="1f28c66ee80cdda7b430df0bd696d750319de0f662b4e788ce91836c509fa2cba77d617ac3f016ac8bd48d7322d6ab58a2f04a010d9546860ef3f119c5136ca0" size="9606465"/></update></updates>
The problem will likely be in networking so network logs will probably be needed.
I re-run this update a couple of times and I can reproduce it once each 3-4 times. Finally I got a HTTP log for such a failed update: https://www.dropbox.com/s/ef6ipbyws8rpckw/log.txt?dl=0 Patrick, would you mind having a look at this?
Flags: needinfo?(mcmanus)
Please disregard the above log file. I missed to specify 'append' to the HTTP_LOG env variable. So what's in there is only for the latest restart. The failure happened earlier. I will try to get the right log.
Interestingly a new version of Firefox Nightly came out while I was trying to get the log for this problem. Sadly I was not able to reproduce it anymore with the target build 09/26. Thankfully I saved off the above update snippet, so that I was able to override the update url and still serve the old snippet content. So here the the HTTP log: https://www.dropbox.com/s/h62wizk7czvzx4m/log-larger.zip?dl=0
This was pretty involved and took me a while to get to it. Sorry about that. Here's what the log shows (analysis in next comment): 51.804: http request [ 51.804: calling PR_Poll [active=4 idle=0] 51.804: GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1 51.804: Host: ftp.mozilla.org 51.804: User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 51.804: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 51.804: Accept-Language: en-US,en;q=0.5 51.804: Range: bytes=53100000- 51.804: If-Range: "4874170-32d49fa-503e245bea46f" 51.804: Cache-Control: no-cache 51.804: Pragma: no-cache 51.804: Cookie: optimizelySegments=%7B%22245875585%22%3A%22direct%22%2C%22245617832%22%3A%22none%22%2C%22246048108%22%3A%22false%22%2C%22245677587%22%3A%22ff%22%2C%22869421433%22%3A%22true%22%7D; optimizelyEndUserId=oeu1411735435041r0.33049987065833963; optimi\ zelyBuckets=%7B%7D; __utma=150903082.178961275.1411735435.1411735435.1411735435.1; __utmb=150903082.3.10.1411735435; __utmz=150903082.1411735435.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=150903082 51.804: Connection: keep-alive 51.804: ] and then 51.992: http response [ 51.992: HTTP/1.1 206 Partial Content 51.992: Server: Apache 51.992: X-Backend-Server: ftp1.dmz.scl3.mozilla.com 51.992: Cache-Control: max-age=3600 51.992: Content-Type: application/octet-stream 51.992: Date: Fri, 26 Sep 2014 12:41:26 GMT 51.992: Keep-Alive: timeout=5, max=1000 51.992: Expires: Fri, 26 Sep 2014 13:41:26 GMT 51.992: Accept-Ranges: bytes 51.992: Access-Control-Allow-Origin: * 51.992: Content-Range: bytes 40500000-40799999/53299706 51.992: Etag: "4874170-32d49fa-503e245bea46f" 51.992: Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT 51.992: Content-Length: 300000 51.992: Connection: Keep-Alive 51.992: X-Cache-Info: cached 51.992: ]
Flags: needinfo?(mcmanus)
This is broken cache behavior. The request 51.804: Range: bytes=53100000- The response 51.992: Content-Range: bytes 40500000-40799999/53299706 so in short - we're asking for bytes 53100000 to EOF, and the server is responding with 40500000-4079999. At least the etag and if-range match, so they're talking about the same representation. and then we get 52.752846: nsHttpChannel::Cancel [this=7f151f5dc800 status=8000ffff] NS_ERROR_UNEXPECTED So the incremental downloader afaict doesn't even notice that its the wrong range.. it just assumes this is the right thing and appends it to the partially downloaded mar file. I think what it notices is the content length of 300,000 + the 53100000 it already had is bigger than the expected 53299706. That's where it throws the error: https://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsIncrementalDownload.cpp#702
What can we do? option a - fix the broken thing. In this case that's the server which is responding incorrectly to the range request. Fixing broken things is always a good plan. option b - stop using the incremental downloader and use an nsichannel directly. If you're concerned about rate limiting that's pretty easy to handle with suspend and resume at a gross scale (which ought to work here). however, if you're concerned about connections being stable enough to download the whole mar reliably in one shot then its not a good idea. And I could understand if you feel that way. option c - change incremental downloader to notice when its getting the wrong chunk. This will catch other errors too. also change it so that it will retry without a range request and overwrite the existing partial mar with the full 200 response. we can do c.. but whomever owns a really should do that too - its just begging for corruption. and I'm not going to bother with c is b is in the plan (which I sometimes hear). let me know.
> option c - change incremental downloader to notice when its getting the > wrong chunk. This will catch other errors too. and the are other errors in this same transfer that we don't seem to notice (presumably the mar processor would find it based on a checksum if the transfer didn't die on the last chunk). Fix that server - that will obviously cure the problem in deployed clients. request 45:48.305883 Range: bytes=51900000- 45:48.305886 If-Range: "4874170-32d49fa-503e245bea46f" response 45:48.494333 HTTP/1.1 206 Partial Content 45:48.494377 Content-Range: bytes 40500000-40799999/53299706
I am leaning toward b for the client though a should also be fixed. Patrick, when you say "suspend and resume at a gross scale" could you provide additional detail for someone not familiar with it or a dxr link to an example usage of suspend and resume? Thanks! Jake, can you get the server fixed?
Flags: needinfo?(nmaul)
Flags: needinfo?(mcmanus)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #10) > > Patrick, when you say "suspend and resume at a gross scale" could you > provide additional detail for someone not familiar with it or a dxr link to > an example usage of suspend and resume? Thanks! > https://mxr.mozilla.org/mozilla-central/source/netwerk/base/public/nsIRequest.idl#82 an nsichannel is a subclass of nsirequest. I think the possible concern here would be that in some of our markets the connectivity wouldn't be robust enough to get the complete download in one transfer. but maybe that's un-necessary worrying.
Flags: needinfo?(mcmanus)
That and we also support exiting the application during a download and resuming it on next app launch.
I can't replicate this by hand. What am I missing? $ curl -v 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar' -o /dev/null -H 'Range: bytes=53100000-' -H 'If-Range: "4874170-32d49fa-503e245bea46f"' -H 'Cache-Control: no-cache' -H 'Pragma: no-cache' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0' -H 'Connection: keep-alive' * About to connect() to ftp.mozilla.org port 80 (#0) * Trying 63.245.215.56... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connected * Connected to ftp.mozilla.org (63.245.215.56) port 80 (#0) > GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1 > Host: ftp.mozilla.org > Range: bytes=53100000- > If-Range: "4874170-32d49fa-503e245bea46f" > Cache-Control: no-cache > Pragma: no-cache > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > Accept-Language: en-US,en;q=0.5 > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 > Connection: keep-alive > < HTTP/1.1 206 Partial Content < Server: Apache < X-Backend-Server: ftp8.dmz.scl3.mozilla.com < Cache-Control: max-age=3600 < Content-Type: application/octet-stream < Date: Mon, 20 Oct 2014 23:45:17 GMT < Expires: Tue, 21 Oct 2014 00:45:17 GMT < Accept-Ranges: bytes < Access-Control-Allow-Origin: * < Content-Range: bytes 53100000-53299705/53299706 < ETag: "4874170-32d49fa-503e245bea46f" < Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT < Content-Length: 199706 < Connection: Keep-Alive < X-Cache-Info: cached If I can replicate this, we stand a chance of maybe putting in place some sort of rule that corrects the errant behavior. I'm not sure how realistic this is, but maybe. More likely is that if we can provide a reproducible test case, we can provide it to the load balancer vendor (Riverbed) and perhaps get a patch.
Flags: needinfo?(nmaul) → needinfo?(mcmanus)
Jake, maybe it's a single server only? You sent a request against the load balancer. In the case of failure it happened for ftp1.dmz.scl3.mozilla.com. You got back a response from ftp8.dmz.scl3.mozilla.com.
I don't know how to make the server do the thing the log says it did :)
Flags: needinfo?(mcmanus)
that last comment came off a little weird.. sorry about that. I just mean I don't know what triggers the bug either - there are lots of cases of the server doing the right thing in the log too. perhaps some kind of cache corruption that has long since cleared up.
Whiteboard: [mozmill] → [mozmill][necko-backlog]
Priority: -- → P1
Priority: P1 → P3
QA Whiteboard: qa-not-actionable

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: major → --

I assume this works now.
If anyone is still seeing these kinds of errors, please reopen the bug.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: