1073459 - Updates failing for Firefox because downloaded patch is larger than the one referenced in the update snippet

Reporter

Description

•

10 years ago

+++ This bug was initially created as a clone of Bug #816472 +++ We have seen this behavior already on bug 816472 a while ago and Patrick pushed a fix for Firefox 27. Sadly something seems to have been regressed here, or I'm hitting another problem. So what I'm testing is to update the 2014-09-24-03-02-04-mozilla-central-firefox-35.0a1.en-US.linux-x86_64.tar.bz2 build to yesterday's one. The direct update via the partial mar file works fine, while the fallback update by using the full mar file fails. Update snippets used: https://aus4.mozilla.org/update/3/Firefox/35.0a1/20140924030204/Linux_x86_64-gcc3/en-US/nightly/Linux%203.13.0-35-generic%20(GTK%202.24.23)/default/default/update.xml?force=1 <updates><update type="minor" displayVersion="35.0a1" appVersion="35.0a1" platformVersion="35.0a1" buildID="20140925030203"><patch type="complete" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar" hashFunction="sha512" hashValue="5e25d232c442e70b92c9a4f5ac65c79d187ff363ce8f5269ddd3ceeb6cea5e9de18df74977bfe2a6f9f1fcb345172eb00da4140a76eb72dab36fb9044b2e140f" size="53299706"/><patch type="partial" URL="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.partial.20140924030204-20140925030203.mar" hashFunction="sha512" hashValue="1f28c66ee80cdda7b430df0bd696d750319de0f662b4e788ce91836c509fa2cba77d617ac3f016ac8bd48d7322d6ab58a2f04a010d9546860ef3f119c5136ca0" size="9606465"/></update></updates>

Henrik Skupin [:whimboo][⌚️UTC+2]

Reporter

Comment 1

•

10 years ago

Attached file console log with update logs enabled — Details

Robert Strong (they/them - no direct email)

Comment 2

•

10 years ago

The problem will likely be in networking so network logs will probably be needed.

Henrik Skupin [:whimboo][⌚️UTC+2]

Reporter

Comment 3

•

10 years ago

I re-run this update a couple of times and I can reproduce it once each 3-4 times. Finally I got a HTTP log for such a failed update: https://www.dropbox.com/s/ef6ipbyws8rpckw/log.txt?dl=0 Patrick, would you mind having a look at this?

Flags: needinfo?(mcmanus)

Henrik Skupin [:whimboo][⌚️UTC+2]

Reporter

Comment 4

•

10 years ago

Please disregard the above log file. I missed to specify 'append' to the HTTP_LOG env variable. So what's in there is only for the latest restart. The failure happened earlier. I will try to get the right log.

Henrik Skupin [:whimboo][⌚️UTC+2]

Reporter

Comment 5

•

10 years ago

Interestingly a new version of Firefox Nightly came out while I was trying to get the log for this problem. Sadly I was not able to reproduce it anymore with the target build 09/26. Thankfully I saved off the above update snippet, so that I was able to override the update url and still serve the old snippet content. So here the the HTTP log: https://www.dropbox.com/s/h62wizk7czvzx4m/log-larger.zip?dl=0

Patrick McManus [:mcmanus]

Comment 6

•

10 years ago

This was pretty involved and took me a while to get to it. Sorry about that. Here's what the log shows (analysis in next comment): 51.804: http request [ 51.804: calling PR_Poll [active=4 idle=0] 51.804: GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1 51.804: Host: ftp.mozilla.org 51.804: User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 51.804: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 51.804: Accept-Language: en-US,en;q=0.5 51.804: Range: bytes=53100000- 51.804: If-Range: "4874170-32d49fa-503e245bea46f" 51.804: Cache-Control: no-cache 51.804: Pragma: no-cache 51.804: Cookie: optimizelySegments=%7B%22245875585%22%3A%22direct%22%2C%22245617832%22%3A%22none%22%2C%22246048108%22%3A%22false%22%2C%22245677587%22%3A%22ff%22%2C%22869421433%22%3A%22true%22%7D; optimizelyEndUserId=oeu1411735435041r0.33049987065833963; optimi\ zelyBuckets=%7B%7D; __utma=150903082.178961275.1411735435.1411735435.1411735435.1; __utmb=150903082.3.10.1411735435; __utmz=150903082.1411735435.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=150903082 51.804: Connection: keep-alive 51.804: ] and then 51.992: http response [ 51.992: HTTP/1.1 206 Partial Content 51.992: Server: Apache 51.992: X-Backend-Server: ftp1.dmz.scl3.mozilla.com 51.992: Cache-Control: max-age=3600 51.992: Content-Type: application/octet-stream 51.992: Date: Fri, 26 Sep 2014 12:41:26 GMT 51.992: Keep-Alive: timeout=5, max=1000 51.992: Expires: Fri, 26 Sep 2014 13:41:26 GMT 51.992: Accept-Ranges: bytes 51.992: Access-Control-Allow-Origin: * 51.992: Content-Range: bytes 40500000-40799999/53299706 51.992: Etag: "4874170-32d49fa-503e245bea46f" 51.992: Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT 51.992: Content-Length: 300000 51.992: Connection: Keep-Alive 51.992: X-Cache-Info: cached 51.992: ]

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Comment 7

•

10 years ago

This is broken cache behavior. The request 51.804: Range: bytes=53100000- The response 51.992: Content-Range: bytes 40500000-40799999/53299706 so in short - we're asking for bytes 53100000 to EOF, and the server is responding with 40500000-4079999. At least the etag and if-range match, so they're talking about the same representation. and then we get 52.752846: nsHttpChannel::Cancel [this=7f151f5dc800 status=8000ffff] NS_ERROR_UNEXPECTED So the incremental downloader afaict doesn't even notice that its the wrong range.. it just assumes this is the right thing and appends it to the partially downloaded mar file. I think what it notices is the content length of 300,000 + the 53100000 it already had is bigger than the expected 53299706. That's where it throws the error: https://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsIncrementalDownload.cpp#702

Patrick McManus [:mcmanus]

Comment 8

•

10 years ago

What can we do? option a - fix the broken thing. In this case that's the server which is responding incorrectly to the range request. Fixing broken things is always a good plan. option b - stop using the incremental downloader and use an nsichannel directly. If you're concerned about rate limiting that's pretty easy to handle with suspend and resume at a gross scale (which ought to work here). however, if you're concerned about connections being stable enough to download the whole mar reliably in one shot then its not a good idea. And I could understand if you feel that way. option c - change incremental downloader to notice when its getting the wrong chunk. This will catch other errors too. also change it so that it will retry without a range request and overwrite the existing partial mar with the full 200 response. we can do c.. but whomever owns a really should do that too - its just begging for corruption. and I'm not going to bother with c is b is in the plan (which I sometimes hear). let me know.

Patrick McManus [:mcmanus]

Comment 9

•

10 years ago

> option c - change incremental downloader to notice when its getting the > wrong chunk. This will catch other errors too. and the are other errors in this same transfer that we don't seem to notice (presumably the mar processor would find it based on a checksum if the transfer didn't die on the last chunk). Fix that server - that will obviously cure the problem in deployed clients. request 45:48.305883 Range: bytes=51900000- 45:48.305886 If-Range: "4874170-32d49fa-503e245bea46f" response 45:48.494333 HTTP/1.1 206 Partial Content 45:48.494377 Content-Range: bytes 40500000-40799999/53299706

Robert Strong (they/them - no direct email)

Comment 10

•

10 years ago

I am leaning toward b for the client though a should also be fixed. Patrick, when you say "suspend and resume at a gross scale" could you provide additional detail for someone not familiar with it or a dxr link to an example usage of suspend and resume? Thanks! Jake, can you get the server fixed?

Flags: needinfo?(nmaul)

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Comment 11

•

10 years ago

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #10) > > Patrick, when you say "suspend and resume at a gross scale" could you > provide additional detail for someone not familiar with it or a dxr link to > an example usage of suspend and resume? Thanks! > https://mxr.mozilla.org/mozilla-central/source/netwerk/base/public/nsIRequest.idl#82 an nsichannel is a subclass of nsirequest. I think the possible concern here would be that in some of our markets the connectivity wouldn't be robust enough to get the complete download in one transfer. but maybe that's un-necessary worrying.

Flags: needinfo?(mcmanus)

Robert Strong (they/them - no direct email)

Comment 12

•

10 years ago

That and we also support exiting the application during a download and resuming it on next app launch.

Jake Maul [:jakem]

Comment 13

•

10 years ago

I can't replicate this by hand. What am I missing? $ curl -v 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar' -o /dev/null -H 'Range: bytes=53100000-' -H 'If-Range: "4874170-32d49fa-503e245bea46f"' -H 'Cache-Control: no-cache' -H 'Pragma: no-cache' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0' -H 'Connection: keep-alive' * About to connect() to ftp.mozilla.org port 80 (#0) * Trying 63.245.215.56... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connected * Connected to ftp.mozilla.org (63.245.215.56) port 80 (#0) > GET /pub/mozilla.org/firefox/nightly/2014/09/2014-09-25-03-02-03-mozilla-central/firefox-35.0a1.en-US.linux-x86_64.complete.mar HTTP/1.1 > Host: ftp.mozilla.org > Range: bytes=53100000- > If-Range: "4874170-32d49fa-503e245bea46f" > Cache-Control: no-cache > Pragma: no-cache > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > Accept-Language: en-US,en;q=0.5 > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 > Connection: keep-alive > < HTTP/1.1 206 Partial Content < Server: Apache < X-Backend-Server: ftp8.dmz.scl3.mozilla.com < Cache-Control: max-age=3600 < Content-Type: application/octet-stream < Date: Mon, 20 Oct 2014 23:45:17 GMT < Expires: Tue, 21 Oct 2014 00:45:17 GMT < Accept-Ranges: bytes < Access-Control-Allow-Origin: * < Content-Range: bytes 53100000-53299705/53299706 < ETag: "4874170-32d49fa-503e245bea46f" < Last-Modified: Thu, 25 Sep 2014 11:40:51 GMT < Content-Length: 199706 < Connection: Keep-Alive < X-Cache-Info: cached If I can replicate this, we stand a chance of maybe putting in place some sort of rule that corrects the errant behavior. I'm not sure how realistic this is, but maybe. More likely is that if we can provide a reproducible test case, we can provide it to the load balancer vendor (Riverbed) and perhaps get a patch.

Flags: needinfo?(nmaul) → needinfo?(mcmanus)

Henrik Skupin [:whimboo][⌚️UTC+2]

Reporter

Comment 14

•

10 years ago

Jake, maybe it's a single server only? You sent a request against the load balancer. In the case of failure it happened for ftp1.dmz.scl3.mozilla.com. You got back a response from ftp8.dmz.scl3.mozilla.com.

Patrick McManus [:mcmanus]

Comment 15

•

10 years ago

I don't know how to make the server do the thing the log says it did :)

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Comment 16

•

10 years ago

that last comment came off a little weird.. sorry about that. I just mean I don't know what triggers the bug either - there are lots of cases of the server doing the right thing in the log too. perhaps some kind of cache corruption that has long since cleared up.

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Whiteboard: [mozmill] → [mozmill][necko-backlog]

Firefox Bug Husbandry Bot

Comment 17

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: -- → P1

Firefox Bug Husbandry Bot

Comment 18

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: P1 → P3

RaresB

Updated

•

3 years ago

QA Whiteboard: qa-not-actionable

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

2 years ago

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: major → --

Valentin Gosu [:valentin] (he/him)

Comment 20

•

2 years ago

I assume this works now.
If anyone is still seeing these kinds of errors, please reopen the bug.

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → WORKSFORME