Last Comment Bug 352853 - 64k update chunks are maxing out mirrors connection limits before they use up their bandwidth
: 64k update chunks are maxing out mirrors connection limits before they use up...
Status: RESOLVED FIXED
: fixed1.8.0.8, fixed1.8.1
Product: Toolkit
Classification: Components
Component: Application Update (show other bugs)
: 1.8.0 Branch
: All All
: -- normal (vote)
: ---
Assigned To: Darin Fisher
:
Mentors:
http://bonsai.mozilla.org/cvsblame.cg...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-15 11:45 PDT by Dave Miller [:justdave] (justdave@bugzilla.org)
Modified: 2009-07-14 13:50 PDT (History)
23 users (show)
mconnor: blocking‑firefox2+
mconnor: blocking1.8.0.8+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 patch (1.29 KB, patch)
2006-09-17 18:49 PDT, Darin Fisher
robert.strong.bugs: review+
Details | Diff | Review
v1.1 patch - 300k chunks (657 bytes, patch)
2006-09-18 10:46 PDT, Darin Fisher
dveditz: approval1.8.0.8+
mtschrep: approval1.8.1+
Details | Diff | Review
patch for the MOZILLA_1_8_0_BRANCH (1.16 KB, patch)
2006-10-24 22:07 PDT, (not reading, please use seth@sspitzer.org instead)
no flags Details | Diff | Review

Description Dave Miller [:justdave] (justdave@bugzilla.org) 2006-09-15 11:45:40 PDT
We had a complaint from one of the mirrors today about the 64k chunksize used by the update service.  Seems the quantity of people updating nowadays means the number of people hitting simultaneously to grab these little tiny chunks is growing fast, and because of the cost of setup/teardown on each connection, all of the available connection slots on the server are getting used up a long time before they max out their bandwidth.

From IRC:

10:12:09 < maswan> justdave: anyway, can you please pass on that 1-meg chunks might be fine, but 64k really is killing us and not sustainable for larger releases. If this goes on we'll probably have to start aiming lower for the ACC mirror, which would be sad given that we have plenty of bandwidth to spare.
Comment 1 Justin Fitzhugh 2006-09-15 11:47:28 PDT
I got the same request from multiple other mirrors so I second this...
Comment 2 Myk Melez [:myk] [@mykmelez] 2006-09-15 13:04:04 PDT
The fix itself would be trivial, but what's the risk?  Seems like something worth considering for Fx2.
Comment 3 Michael Marineau 2006-09-15 13:49:22 PDT
I don't see what the risk could be, but then I don't understand why small chunks are used in the first place. Anyone care to fill me in? I'm curious.
Comment 4 Darin Fisher 2006-09-15 14:03:19 PDT
The risk is the impact this may have on dial-up users.  If their modem connection is saturated with a 1-meg download, then they will be very unhappy users.  64k was as large as I thought we could go without impairing dial-up users significantly.  The downloader (nsIncrementalDownload.cpp) could really benefit from some adaptive logic that figures out the available bandwidth and adjusts accordingly.
Comment 5 Darin Fisher 2006-09-15 14:06:45 PDT
Actually, perhaps an easy way to solve this problem would be for the downloader to increase its fetch size but to read that at a slowed rate.  For example, after each 64k chunk, the downloader could suspend the network channel for a couple seconds to relieve pressure on the user's network connection.  This would cause some extra ACKs across the network as the two TCP ends recognize the slowdown, which might also suck.  Other ideas?
Comment 6 Mattias Wadenstein 2006-09-15 16:18:26 PDT
(I'm maswan in the IRC paste)

Well, instead of hanging up after 64k, how about sleeping half a second every second or some similar approach? Or for that matter, if you want to do it by bytes, read N k from the net, then sleep for as long as that took to read?

If you only do 64k before hanging up, you'll never get good rates for those of us on good bandwidth either, considering that it takes much more to get a decent-sized tcp window.

We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests per second was quite the surprise, and not what our mirror was tuned for.
Comment 7 Tomas Ögren 2006-09-15 18:33:07 PDT
(In reply to comment #6)
> We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests
> per second was quite the surprise, and not what our mirror was tuned for.

And we are getting log management problems due to a gazillion 64k requests (like 30 million or so in a day)..
Comment 8 Darin Fisher 2006-09-17 18:40:16 PDT
So, currently FF fetches a 64k chunk every minute.  Over a 50 kbps modem, that takes about 10 seconds to download.  A typical partial update is around 500k, which takes only 8 chunks to download (or ~8 minutes).

Ignoring the logs issue for a moment, have you guys tried increasing the keep-alive interval of your HTTP connections to over a minute?  That would eliminate the setup and teardown cost associated with the TCP connections.

We can also very easily tune FF to fetch the 64k chunk less frequently.  If we aim for having partial updates delivered within the span of an hour, then we could wait about 7-8 minutes between chunks.

We can and should also make the downloader use a more sophisticated method of minimizing bandwidth utilization, so that we can avoid 64k chunks in the first place.  However, that will entail more risk.
Comment 9 Darin Fisher 2006-09-17 18:49:54 PDT
Created attachment 238950 [details] [diff] [review]
v1 patch

OK, this patch bumps the chunk size up to 100000 bytes and the interval up to 10 minutes.  This is hopefully a decent short-term solution.
Comment 10 Mike Schroepfer 2006-09-17 19:56:27 PDT
Darin - why 100k?   Why not 256K or bigger? On windows update sizes have been as such (requests/size):

	   65K	   100K	    200K   300K
0.1: 751K   12       8       4       3
0.2: 557K   9        6       3       2
0.3: 232K   4        3       2       1
0.4: 511K   8        6       3       2
0.5: 565K   9        6       3       2
0.6: 49K    1        1       1       1
0.7: 531K   9        6       3       2

Do we deal gracefully if the partial chunk dl is aborted?   What's the advantage to keeping it small?
Comment 11 Mattias Wadenstein 2006-09-18 03:43:34 PDT
(In reply to comment #8)

> Ignoring the logs issue for a moment, have you guys tried increasing the
> keep-alive interval of your HTTP connections to over a minute?  That would
> eliminate the setup and teardown cost associated with the TCP connections.

We actually dropped max keep-alive from 5 seconds to 1 second, since we only have a finite number of threads. Over a minute would require something like
100k MaxClients (~1000-1200 requests/s), and I'm not sure either our machines nor apache would be happy with 100k threads, even if most of them are just idling in keep-alive.

> We can also very easily tune FF to fetch the 64k chunk less frequently.  If we
> aim for having partial updates delivered within the span of an hour, then we
> could wait about 7-8 minutes between chunks.

Does this really matter? In total the same ammount of chunks are going to be fetched, it will just be more spread out over time per client (but likely, more different clients will be fetching at every moment).

/Mattias Wadenstein
Comment 12 Darin Fisher 2006-09-18 09:00:16 PDT
> Darin - why 100k?   Why not 256K or bigger?

100k takes about 15 seconds to download over a 50 kbps modem.  During that time the users network is saturated with this download, and then they will be unable to make use of their network connection for much of anything else.  Their OS will ensure that other TCP connections are not entirely blocked out, but it will certainly seem like their network connection is suddenly much slower for that interval of time.  So, I think we should strive to keep that interval small-ish.


> Do we deal gracefully if the partial chunk dl is aborted?

Yes, the downloader is designed for that.  It'll pick up where it left off the next time the browser is started.


> Does this really matter? In total the same ammount of chunks are going to be
> fetched, it will just be more spread out over time per client (but likely, 
> more different clients will be fetching at every moment).

I was assuming that a big part of the problem was the swell of downloads after a release, so if we spread that swell out a little more, it would help.  Instead of a big spike in connections, with this change you'd see a softer spike in connections, no?
Comment 13 Myk Melez [:myk] [@mykmelez] 2006-09-18 10:24:21 PDT
> I was assuming that a big part of the problem was the swell of downloads after
> a release, so if we spread that swell out a little more, it would help. 
> Instead of a big spike in connections, with this change you'd see a softer
> spike in connections, no?

I think the primary problem is not the swell in downloads (as Mattias says in comment 6, they were "quite prepared to handle 1+ Gbit/s burst in traffic") but rather inefficient utilization of capacity.  Currently, a significant portion of mirror bandwidth isn't being used because the servers hit their connection limits before they hit their bandwidth capacity.

If you spread out the swell by pausing longer between requests, so the update gets downloaded more slowly, then more users will be able to start downloading the software sooner, but you still won't take any better advantage of the mirrors' additional bandwidth capacity.

The secondary problem is that many small requests create overlarge logs, increasing the log management burden for mirrors.  This problem also won't be mitigated by spreading out downloads, since the additional users who become able to connect will keep the servers at their connection limits and thus keep generating the same number of log entries.
Comment 14 Mattias Wadenstein 2006-09-18 10:26:06 PDT
> I was assuming that a big part of the problem was the swell of downloads after
> a release, so if we spread that swell out a little more, it would help. 
> Instead of a big spike in connections, with this change you'd see a softer
> spike in connections, no?

Spreading out the spike by a few hours won't make much of a difference. We're still seeing >500 requests per second today which is several days after the release. And while this is less than half the peak, it is still way more than we find resonable.

    http://www.acc.umu.se/technical/statistics/ftp/monitordata/index.html.en

    For reference, look at saimei on that graph does a third of the bandwidth of the other two hosts. It is not listed in mozilla bouncer and just handles the other things we mirror. It does 4 requests per second right now.

    Anyway, I'd really appriciate if you could so some different kind of throttling than splitting it up in tiny chunks. In the long run, the kind of manual intervention that we're doing to keep up with this load is not really sustainable, and it might make mirrors reconsider their commitment to mozilla mirroring.

    It would also be rather sad to remove all the request logging for mozilla downloads, both we and some mozilla people seem to want to have that around.
Comment 15 Mike Schroepfer 2006-09-18 10:33:36 PDT
Darin,

My concern is the 100K doesn't do enough to substantially reduce the number of req/s for our mirrors.  At 300K we'd hold the line for 45s on dial-up.  It's not like it is totally unusable - just slow.  Combine that with at 74% US broadband penetration (90% at work, and US is 20th in the world in BB penetration) - and the fact that 300k chunks will, on average, reduce the number of requests by 4x - we think we should use that number.  Given we are holding the line for 45s the 10 minute timeout also seems reasonable.  Make sense?

We should open a second bug to implement a better form of bandwidth throttling for future releases.   
Comment 16 Justin Dolske [:Dolske] 2006-09-18 10:33:58 PDT
Were there signs of this problem when the last update was released (which I wasn't around for), or on other mirrors? That was just two months ago, so one would think the traffic profile for this update would be very similar. Maybe we just reached a tipping point, but I'd want to rule out any kind of regression.

That's not to say we shouldn't try to improve the download behavior, though. I wonder if we could try to guesstimate the capability of the user's network connection by gathering some metrics at update time... For example: the time to establish the TCP connection, any delay before server responds to request, the download speed of the first chunk, etc. [The server's capabilities would affect these numbers, but that's not a bad thing since we want to be nice to both ends.] The updater could then decide if it should just go ahead and snarf the rest of the update immediately, or if it should backoff by trying again later.
Comment 17 Justin Dolske [:Dolske] 2006-09-18 10:35:58 PDT
(oops, dunno how the "mark FIXED" field got checked. sorry!)
Comment 18 Darin Fisher 2006-09-18 10:46:00 PDT
Created attachment 239048 [details] [diff] [review]
v1.1 patch - 300k chunks

OK, per discussion at the bonecho triage meeting this morning, we're going to go with a 300k chunk size and an interval of 10 minutes between chunks.  (Sucks to be a modem user.)

I'm also going to pursue teaching the downloader to throttle downloads itself.
Comment 19 Mike Schroepfer 2006-09-18 10:48:44 PDT
Comment on attachment 239048 [details] [diff] [review]
v1.1 patch - 300k chunks

a=schrep - thanks Darin!
Comment 20 Darin Fisher 2006-09-18 10:49:24 PDT
I filed bug 353182 for making nsIncrementalDownload smarter.
Comment 21 Darin Fisher 2006-09-18 10:51:25 PDT
fixed-on-trunk, fixed1.8.1
Comment 22 Christian :Biesinger (don't email me, ping me on IRC) 2006-09-19 10:26:12 PDT
(In reply to comment #15)
> Combine that with at 74% US
> broadband penetration (90% at work, and US is 20th in the world in BB
> penetration)

This product isn't a US-only product...
Comment 23 Christian :Biesinger (don't email me, ping me on IRC) 2006-09-19 10:27:20 PDT
(I'm not saying that I disagree with that change, I'm just saying that just considering US statistics is perhaps not such a good idea)
Comment 24 Myk Melez [:myk] [@mykmelez] 2006-09-19 11:42:35 PDT
> > Combine that with at 74% US
> > broadband penetration (90% at work, and US is 20th in the world in BB
> > penetration)
> 
> This product isn't a US-only product...
> ...
> (I'm not saying that I disagree with that change, I'm just saying that just
> considering US statistics is perhaps not such a good idea)

We weren't considering only US statistics.  As schrep pointed out in his comment, the US is 20th in the world in broadband penetration, which means many other countries have an even higher percentage of Internet users on broadband, so the implemented fix is even better for users in those countries.
Comment 25 Christian :Biesinger (don't email me, ping me on IRC) 2006-09-23 21:29:22 PDT
Ah, ok. sorry, I didn't notice that part of the comment.
Comment 26 Mike Connor [:mconnor] 2006-10-24 21:53:32 PDT
We need this for 1.5.0.8 or the major update to 2.0 will kill mirrors.
Comment 27 (not reading, please use seth@sspitzer.org instead) 2006-10-24 22:07:56 PDT
Created attachment 243439 [details] [diff] [review]
patch for the MOZILLA_1_8_0_BRANCH
Comment 28 (not reading, please use seth@sspitzer.org instead) 2006-10-25 10:08:57 PDT
I have tested this patch on my 1508 tree, and it does what we expect:

From my javascript console:

before:

onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 65536/7876331

after:

onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 300000/7876331

waiting for approval before I land

Comment 29 (not reading, please use seth@sspitzer.org instead) 2006-10-25 10:21:57 PDT
to answer a question dan raised in the 150x meeting:

See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/src/nsUpdateService.js.in#2212

2212     var interval = this.background ? DOWNLOAD_BACKGROUND_INTERVAL
2213                                    : DOWNLOAD_FOREGROUND_INTERVAL;
2214     this._request.init(uri, patchFile, DOWNLOAD_CHUNK_SIZE, interval);

all that differs is the interval.
Comment 30 Daniel Veditz [:dveditz] 2006-10-25 10:39:58 PDT
Comment on attachment 239048 [details] [diff] [review]
v1.1 patch - 300k chunks

approved for 1.8.0 branch, a=dveditz for drivers
Comment 31 (not reading, please use seth@sspitzer.org instead) 2006-10-25 10:42:37 PDT
fix landed on the MOZILLA_1_8_0_BRANCH
Comment 32 (not reading, please use seth@sspitzer.org instead) 2006-10-25 16:30:27 PDT
added a litmus test case for this, see http://litmus.mozilla.org/show_test.cgi?id=2711
Comment 33 Dave Liebreich [:davel] 2006-10-26 10:53:10 PDT
behaviour has changed from 1507 to 1508rc2

I set prefs to check for update soon after app start, and download in the background.  

1507 downloads 1 64K chunk each minute.  1508rc2 downloads 1 300000byte chunk each 10 minutes.  this is as expected


I selected Help->Downloading ... Update from the menu. hen the UI appeared, the download went to line speed for both 1507 and 1508rc2

I clicked on the "Hide" button.  after the dialog was dismissed, 1507 went back to the 1 64K chunk/min rate.  1508rc3 remained at line speed.

I measured the rate of the download by inspecting updates/0/update.mar at regular intervals (read: |ls -l| over and over)

Pref changes as follows:

app.update.interval: 60
app.update.timer: 600
app.update.url.override: file:///Users/davel/update.xml


update.xml:

<?xml version="1.0"?>
<updates>
    <update type="minor" version="2.0mt2" extensionVersion="2.0" buildID="2006092114" >
        <patch type="complete" URL="http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/1.5.0.8-candidates/rc2/firefox-1.5.0.8.en-US.mac.dmg" hashFunction="SHA1" hashValue="10c287df3d0479b0579a61531498addc4c325746" size="16507599"/>
    </update>
</updates>

Comment 34 (not reading, please use seth@sspitzer.org instead) 2006-10-26 12:04:30 PDT
Dave, here's the answer:

In my backport from the MOZILLA_1_8_BRANCH to the MOZILLA_1_8_0_BRANCH I included jwalden's fix for https://bugzilla.mozilla.org/show_bug.cgi?id=304381

See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/content/updates.js#1359

Note You need to log in before you can comment on or make changes to this bug.