352853 - 64k update chunks are maxing out mirrors connection limits before they use up their bandwidth

Reporter

Description

•

18 years ago

We had a complaint from one of the mirrors today about the 64k chunksize used by the update service. Seems the quantity of people updating nowadays means the number of people hitting simultaneously to grab these little tiny chunks is growing fast, and because of the cost of setup/teardown on each connection, all of the available connection slots on the server are getting used up a long time before they max out their bandwidth. From IRC: 10:12:09 < maswan> justdave: anyway, can you please pass on that 1-meg chunks might be fine, but 64k really is killing us and not sustainable for larger releases. If this goes on we'll probably have to start aiming lower for the ACC mirror, which would be sad given that we have plenty of bandwidth to spare.

Justin Fitzhugh

Comment 1

•

18 years ago

I got the same request from multiple other mirrors so I second this...

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

18 years ago

URL: http://bonsai.mozilla.org/cvsblame.cg...

Myk Melez [:myk] [@mykmelez]

Comment 2

•

18 years ago

The fix itself would be trivial, but what's the risk? Seems like something worth considering for Fx2.

Flags: blocking-firefox2?

Michael Marineau

Comment 3

•

18 years ago

I don't see what the risk could be, but then I don't understand why small chunks are used in the first place. Anyone care to fill me in? I'm curious.

Darin Fisher

Assignee

Comment 4

•

18 years ago

The risk is the impact this may have on dial-up users. If their modem connection is saturated with a 1-meg download, then they will be very unhappy users. 64k was as large as I thought we could go without impairing dial-up users significantly. The downloader (nsIncrementalDownload.cpp) could really benefit from some adaptive logic that figures out the available bandwidth and adjusts accordingly.

Darin Fisher

Assignee

Comment 5

•

18 years ago

Actually, perhaps an easy way to solve this problem would be for the downloader to increase its fetch size but to read that at a slowed rate. For example, after each 64k chunk, the downloader could suspend the network channel for a couple seconds to relieve pressure on the user's network connection. This would cause some extra ACKs across the network as the two TCP ends recognize the slowdown, which might also suck. Other ideas?

Mattias Wadenstein

Comment 6

•

18 years ago

(I'm maswan in the IRC paste) Well, instead of hanging up after 64k, how about sleeping half a second every second or some similar approach? Or for that matter, if you want to do it by bytes, read N k from the net, then sleep for as long as that took to read? If you only do 64k before hanging up, you'll never get good rates for those of us on good bandwidth either, considering that it takes much more to get a decent-sized tcp window. We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests per second was quite the surprise, and not what our mirror was tuned for.

Tomas Ögren

Comment 7

•

18 years ago

(In reply to comment #6) > We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests > per second was quite the surprise, and not what our mirror was tuned for. And we are getting log management problems due to a gazillion 64k requests (like 30 million or so in a day)..

Darin Fisher

Assignee

Comment 8

•

18 years ago

So, currently FF fetches a 64k chunk every minute. Over a 50 kbps modem, that takes about 10 seconds to download. A typical partial update is around 500k, which takes only 8 chunks to download (or ~8 minutes). Ignoring the logs issue for a moment, have you guys tried increasing the keep-alive interval of your HTTP connections to over a minute? That would eliminate the setup and teardown cost associated with the TCP connections. We can also very easily tune FF to fetch the 64k chunk less frequently. If we aim for having partial updates delivered within the span of an hour, then we could wait about 7-8 minutes between chunks. We can and should also make the downloader use a more sophisticated method of minimizing bandwidth utilization, so that we can avoid 64k chunks in the first place. However, that will entail more risk.

Darin Fisher

Assignee

Comment 9

•

18 years ago

Attached patch v1 patch (obsolete) — Details — Splinter Review

OK, this patch bumps the chunk size up to 100000 bytes and the interval up to 10 minutes. This is hopefully a decent short-term solution.

Assignee: nobody → darin

Status: NEW → ASSIGNED

Attachment #238950 - Flags: review?

Darin Fisher

Assignee

Updated

•

18 years ago

Attachment #238950 - Flags: review? → review?(robert.bugzilla)

Robert Strong (they/them - no direct email)

Updated

•

18 years ago

Attachment #238950 - Flags: review?(robert.bugzilla) → review+

Mike Schroepfer

Comment 10

•

18 years ago

Darin - why 100k? Why not 256K or bigger? On windows update sizes have been as such (requests/size): 65K 100K 200K 300K 0.1: 751K 12 8 4 3 0.2: 557K 9 6 3 2 0.3: 232K 4 3 2 1 0.4: 511K 8 6 3 2 0.5: 565K 9 6 3 2 0.6: 49K 1 1 1 1 0.7: 531K 9 6 3 2 Do we deal gracefully if the partial chunk dl is aborted? What's the advantage to keeping it small?

Mattias Wadenstein

Comment 11

•

18 years ago

(In reply to comment #8) > Ignoring the logs issue for a moment, have you guys tried increasing the > keep-alive interval of your HTTP connections to over a minute? That would > eliminate the setup and teardown cost associated with the TCP connections. We actually dropped max keep-alive from 5 seconds to 1 second, since we only have a finite number of threads. Over a minute would require something like 100k MaxClients (~1000-1200 requests/s), and I'm not sure either our machines nor apache would be happy with 100k threads, even if most of them are just idling in keep-alive. > We can also very easily tune FF to fetch the 64k chunk less frequently. If we > aim for having partial updates delivered within the span of an hour, then we > could wait about 7-8 minutes between chunks. Does this really matter? In total the same ammount of chunks are going to be fetched, it will just be more spread out over time per client (but likely, more different clients will be fetching at every moment). /Mattias Wadenstein

Darin Fisher

Assignee

Comment 12

•

18 years ago

> Darin - why 100k? Why not 256K or bigger? 100k takes about 15 seconds to download over a 50 kbps modem. During that time the users network is saturated with this download, and then they will be unable to make use of their network connection for much of anything else. Their OS will ensure that other TCP connections are not entirely blocked out, but it will certainly seem like their network connection is suddenly much slower for that interval of time. So, I think we should strive to keep that interval small-ish. > Do we deal gracefully if the partial chunk dl is aborted? Yes, the downloader is designed for that. It'll pick up where it left off the next time the browser is started. > Does this really matter? In total the same ammount of chunks are going to be > fetched, it will just be more spread out over time per client (but likely, > more different clients will be fetching at every moment). I was assuming that a big part of the problem was the swell of downloads after a release, so if we spread that swell out a little more, it would help. Instead of a big spike in connections, with this change you'd see a softer spike in connections, no?

Myk Melez [:myk] [@mykmelez]

Comment 13

•

18 years ago

> I was assuming that a big part of the problem was the swell of downloads after > a release, so if we spread that swell out a little more, it would help. > Instead of a big spike in connections, with this change you'd see a softer > spike in connections, no? I think the primary problem is not the swell in downloads (as Mattias says in comment 6, they were "quite prepared to handle 1+ Gbit/s burst in traffic") but rather inefficient utilization of capacity. Currently, a significant portion of mirror bandwidth isn't being used because the servers hit their connection limits before they hit their bandwidth capacity. If you spread out the swell by pausing longer between requests, so the update gets downloaded more slowly, then more users will be able to start downloading the software sooner, but you still won't take any better advantage of the mirrors' additional bandwidth capacity. The secondary problem is that many small requests create overlarge logs, increasing the log management burden for mirrors. This problem also won't be mitigated by spreading out downloads, since the additional users who become able to connect will keep the servers at their connection limits and thus keep generating the same number of log entries.

Mattias Wadenstein

Comment 14

•

18 years ago

> I was assuming that a big part of the problem was the swell of downloads after > a release, so if we spread that swell out a little more, it would help. > Instead of a big spike in connections, with this change you'd see a softer > spike in connections, no? Spreading out the spike by a few hours won't make much of a difference. We're still seeing >500 requests per second today which is several days after the release. And while this is less than half the peak, it is still way more than we find resonable. http://www.acc.umu.se/technical/statistics/ftp/monitordata/index.html.en For reference, look at saimei on that graph does a third of the bandwidth of the other two hosts. It is not listed in mozilla bouncer and just handles the other things we mirror. It does 4 requests per second right now. Anyway, I'd really appriciate if you could so some different kind of throttling than splitting it up in tiny chunks. In the long run, the kind of manual intervention that we're doing to keep up with this load is not really sustainable, and it might make mirrors reconsider their commitment to mozilla mirroring. It would also be rather sad to remove all the request logging for mozilla downloads, both we and some mozilla people seem to want to have that around.

Mike Schroepfer

Comment 15

•

18 years ago

Darin, My concern is the 100K doesn't do enough to substantially reduce the number of req/s for our mirrors. At 300K we'd hold the line for 45s on dial-up. It's not like it is totally unusable - just slow. Combine that with at 74% US broadband penetration (90% at work, and US is 20th in the world in BB penetration) - and the fact that 300k chunks will, on average, reduce the number of requests by 4x - we think we should use that number. Given we are holding the line for 45s the 10 minute timeout also seems reasonable. Make sense? We should open a second bug to implement a better form of bandwidth throttling for future releases.

Justin Dolske [:Dolske]

Comment 16

•

18 years ago

Were there signs of this problem when the last update was released (which I wasn't around for), or on other mirrors? That was just two months ago, so one would think the traffic profile for this update would be very similar. Maybe we just reached a tipping point, but I'd want to rule out any kind of regression. That's not to say we shouldn't try to improve the download behavior, though. I wonder if we could try to guesstimate the capability of the user's network connection by gathering some metrics at update time... For example: the time to establish the TCP connection, any delay before server responds to request, the download speed of the first chunk, etc. [The server's capabilities would affect these numbers, but that's not a bad thing since we want to be nice to both ends.] The updater could then decide if it should just go ahead and snarf the rest of the update immediately, or if it should backoff by trying again later.

Status: ASSIGNED → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Justin Dolske [:Dolske]

Comment 17

•

18 years ago

(oops, dunno how the "mark FIXED" field got checked. sorry!)

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Dolske [:Dolske]

Updated

•

18 years ago

Status: REOPENED → NEW

Darin Fisher

Assignee

Comment 18

•

18 years ago

Attached patch v1.1 patch - 300k chunks — Details — Splinter Review

OK, per discussion at the bonecho triage meeting this morning, we're going to go with a 300k chunk size and an interval of 10 minutes between chunks. (Sucks to be a modem user.) I'm also going to pursue teaching the downloader to throttle downloads itself.

Attachment #238950 - Attachment is obsolete: true

Attachment #239048 - Flags: approval1.8.1?

Attachment #239048 - Flags: approval1.8.0.8?

Mike Schroepfer

Comment 19

•

18 years ago

Comment on attachment 239048 [details] [diff] [review] v1.1 patch - 300k chunks a=schrep - thanks Darin!

Attachment #239048 - Flags: approval1.8.1? → approval1.8.1+

Darin Fisher

Assignee

Comment 20

•

18 years ago

I filed bug 353182 for making nsIncrementalDownload smarter.

Darin Fisher

Assignee

Comment 21

•

18 years ago

fixed-on-trunk, fixed1.8.1

Status: NEW → RESOLVED

Closed: 18 years ago → 18 years ago

Keywords: fixed1.8.1

Resolution: --- → FIXED

Mike Connor [:mconnor]

Updated

•

18 years ago

Flags: blocking-firefox2? → blocking-firefox2+

Christian :Biesinger (don't email me, ping me on IRC)

Comment 22

•

18 years ago

(In reply to comment #15) > Combine that with at 74% US > broadband penetration (90% at work, and US is 20th in the world in BB > penetration) This product isn't a US-only product...

Christian :Biesinger (don't email me, ping me on IRC)

Comment 23

•

18 years ago

(I'm not saying that I disagree with that change, I'm just saying that just considering US statistics is perhaps not such a good idea)

Myk Melez [:myk] [@mykmelez]

Comment 24

•

18 years ago

> > Combine that with at 74% US > > broadband penetration (90% at work, and US is 20th in the world in BB > > penetration) > > This product isn't a US-only product... > ... > (I'm not saying that I disagree with that change, I'm just saying that just > considering US statistics is perhaps not such a good idea) We weren't considering only US statistics. As schrep pointed out in his comment, the US is 20th in the world in broadband penetration, which means many other countries have an even higher percentage of Internet users on broadband, so the implemented fix is even better for users in those countries.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 25

•

18 years ago

Ah, ok. sorry, I didn't notice that part of the comment.

Mike Connor [:mconnor]

Comment 26

•

18 years ago

We need this for 1.5.0.8 or the major update to 2.0 will kill mirrors.

Flags: blocking1.8.0.8+

Reed Loden [:reed]

Updated

•

18 years ago

Attachment #239048 - Flags: approval1.8.0.8?

(not reading, please use seth@sspitzer.org instead)

Comment 27

•

18 years ago

Attached patch patch for the MOZILLA_1_8_0_BRANCH (obsolete) — Details — Splinter Review

Attachment #243439 - Flags: review?(mconnor)

Attachment #243439 - Flags: approval1.8.0.8?

(not reading, please use seth@sspitzer.org instead)

Updated

•

18 years ago

Attachment #243439 - Attachment is obsolete: true

Attachment #243439 - Flags: review?(mconnor)

Attachment #243439 - Flags: approval1.8.0.8?

Reed Loden [:reed]

Updated

•

18 years ago

Attachment #239048 - Flags: approval1.8.0.9?

Attachment #239048 - Flags: approval1.8.0.8?

(not reading, please use seth@sspitzer.org instead)

Updated

•

18 years ago

Attachment #239048 - Flags: approval1.8.0.8?

(not reading, please use seth@sspitzer.org instead)

Comment 28

•

18 years ago

I have tested this patch on my 1508 tree, and it does what we expect: From my javascript console: before: onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 65536/7876331 after: onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 300000/7876331 waiting for approval before I land

(not reading, please use seth@sspitzer.org instead)

Comment 29

•

18 years ago

to answer a question dan raised in the 150x meeting: See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/src/nsUpdateService.js.in#2212 2212 var interval = this.background ? DOWNLOAD_BACKGROUND_INTERVAL 2213 : DOWNLOAD_FOREGROUND_INTERVAL; 2214 this._request.init(uri, patchFile, DOWNLOAD_CHUNK_SIZE, interval); all that differs is the interval.

Daniel Veditz [:dveditz]

Comment 30

•

18 years ago

Comment on attachment 239048 [details] [diff] [review] v1.1 patch - 300k chunks approved for 1.8.0 branch, a=dveditz for drivers

Attachment #239048 - Flags: approval1.8.0.8? → approval1.8.0.8+

(not reading, please use seth@sspitzer.org instead)

Comment 31

•

18 years ago

fix landed on the MOZILLA_1_8_0_BRANCH

Keywords: fixed1.8.0.8

(not reading, please use seth@sspitzer.org instead)

Comment 32

•

18 years ago

added a litmus test case for this, see http://litmus.mozilla.org/show_test.cgi?id=2711

Dave Liebreich [:davel]

Comment 33

•

18 years ago

behaviour has changed from 1507 to 1508rc2 I set prefs to check for update soon after app start, and download in the background. 1507 downloads 1 64K chunk each minute. 1508rc2 downloads 1 300000byte chunk each 10 minutes. this is as expected I selected Help->Downloading ... Update from the menu. hen the UI appeared, the download went to line speed for both 1507 and 1508rc2 I clicked on the "Hide" button. after the dialog was dismissed, 1507 went back to the 1 64K chunk/min rate. 1508rc3 remained at line speed. I measured the rate of the download by inspecting updates/0/update.mar at regular intervals (read: |ls -l| over and over) Pref changes as follows: app.update.interval: 60 app.update.timer: 600 app.update.url.override: file:///Users/davel/update.xml update.xml: <?xml version="1.0"?> <updates> <update type="minor" version="2.0mt2" extensionVersion="2.0" buildID="2006092114" > <patch type="complete" URL="http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/1.5.0.8-candidates/rc2/firefox-1.5.0.8.en-US.mac.dmg" hashFunction="SHA1" hashValue="10c287df3d0479b0579a61531498addc4c325746" size="16507599"/> </update> </updates>

(not reading, please use seth@sspitzer.org instead)

Comment 34

•

18 years ago

Dave, here's the answer: In my backport from the MOZILLA_1_8_BRANCH to the MOZILLA_1_8_0_BRANCH I included jwalden's fix for https://bugzilla.mozilla.org/show_bug.cgi?id=304381 See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/content/updates.js#1359

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Firefox → Toolkit

v1 patch 18 years ago Darin Fisher 1.29 KB, patch	robert.strong.bugs : review+	Details \| Diff \| Splinter Review
v1.1 patch - 300k chunks 18 years ago Darin Fisher 657 bytes, patch	dveditz : approval1.8.0.8+ mtschrep : approval1.8.1+	Details \| Diff \| Splinter Review
patch for the MOZILLA_1_8_0_BRANCH 18 years ago (not reading, please use seth@sspitzer.org instead) 1.16 KB, patch		Details \| Diff \| Splinter Review