14.87 KB, text/plain
16.47 KB, patch
|Details | Diff | Splinter Review|
2.32 KB, patch
|Details | Diff | Splinter Review|
1.31 KB, patch
Justin Lebar (not reading bugmail): review+
|Details | Diff | Splinter Review|
STR: * Set image.mem.min_discard_timeout_ms to 3000 (or leave it at 10000 if you're patient). * Load . * Switch to another tab and wait for the images to be discarded. * Switch to the tab with the images, and then quickly try to switch to another tab. Switching away from the tab with the images takes a few seconds on my fast Linux machine. AIUI we have a heuristic to ensure that jpeg decoding periodically yields and doesn't hog the event loop. Maybe this heuristic needs to be tuned?  https://bug664642.bugzilla.mozilla.org/attachment.cgi?id=539707
Users observe this as "slow tab switching", so I think this is a pretty important issue. I just tried setting image.mem.max_ms_before_yield to 10 (down from 400) and it didn't make things noticeably faster. Also setting image.mem.decode_bytes_at_a_time down to 20000 or 2000 from 200000 didn't help either.
What I observe when I make the max_ms_before_yield small is that when I switch away from the tab with the images, the new tab is pretty quickly highlighted in the tab bar, but the content doesn't change until a second or two later.
Ah, I apparently needed to restart the browser to see the effect of this pref change? That's not what I would have expected from a cursory look at the code... In any case, with a restart, it seems to make a huge difference. I don't know how this will impact our decoding speed, but I'd be happy to accept even a 10 or 20% slowdown in exchange for this snappiness improvement.
In order to hit 5 or 10ms on a wide variety of platforms, it looks like we'd have to turn the block size down pretty small (say from 200000 to 1024 bytes). But this might compromise performance more than we want, particularly because the code currently calls gettimeofday once for every block. My idea is to use an additive increase / multiplicative decrease scheme to get close to the right block size with only one gettimeofday call for each time we yield to the event loop. One thing which I find a bit perplexing is that decoding the first block seems to take about 200 times longer than any other block. Is this expected? Can anyone help me understand why this might be?
cc'ing People Who Might Know re comment 4.
I have no immediate thoughts. If you continue to remind me of this issue and I might get some :)
Ah, I think the reason we didn't notice horrible (400ms+) lag times before is that when we retrieve an image over the network (and possibly via cache), it comes in in 32k chunks, and we yield after each one. On the other hand, when we decode on draw, we can decode in much larger chunks.
Created attachment 544257 [details] [diff] [review] WIP v1 This still needs to be cleaned up, but the idea is here.
Created attachment 544314 [details] [diff] [review] Patch v1
Comment on attachment 544314 [details] [diff] [review] Patch v1 Review of attachment 544314 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/libpr0n/src/RasterImage.cpp @@ +82,5 @@ > + * Use LOG_IMAGE_DECODE_TYPES to enable/disable logging about how long we hold > + * the event loop captive without yielding while we decode an image. This is > + * mainly interesting on release builds, but you're probably not building with > + * PR_LOGGING enabled. So rather than make you recompile the world to enable > + * these message, here's a #define. I'd much rather we continue to use the regular imagelib logging infrastructure. Forcing logging on is bug 481962. @@ +2673,5 @@ > } > > +imgDecodeWorker::imgDecodeWorker(imgIContainer* aContainer) > + : mContainer(do_GetWeakReference(aContainer)), > + mDecodeBlockSize(gInitialDecodeBlockSize) Storing this in the decode worker means we throw away all the work we've done to even slightly estimate the decode time once we're done decoding. A global variable, or one per decoder, or one per imgRequest, might be a better choice. @@ +2732,5 @@ > + > + PRUint32 bytesDecoded = image->mBytesDecoded - oldBytesDecoded; > + > + // Modify mDecodeBlockSize so that next time, decoding one block takes > + // approximately gMSBeforeYield. I don't think we can assume that the decoding time from earlier chunks will relate to the decoding time from later chunks. That being said, I'm not sure if there's a better way.
Also, I'd like some way of testing this to ensure we're decoding asynchronously. I recognize that that's hard. :)
> I'd much rather we continue to use the regular imagelib logging infrastructure. > Forcing logging on is bug 481962. That sounds great. I'll make some noise in that bug. > Storing this in the decode worker means we throw away all the work we've done > to even slightly estimate the decode time once we're done decoding. A global > variable, or one per decoder, or one per imgRequest, might be a better choice. I should have mentioned this, but what I found when I tested using  was that some images decoded 10x faster (measured in bits decoded / s) than others. I don't know why. We could keep a running average of how long it usually takes to decode a JPEG, but then when confronted with a particularly slow image, we'll bite off more than we can chew in the first pass. > I don't think we can assume that the decoding time from earlier chunks will > relate to the decoding time from later chunks. > > That being said, I'm not sure if there's a better way. Me either. The hope is that we set the yield time low enough that even if we take 10x as long as we expected occasionally, that's OK. But the only way to be sure that we're not spinning for too long is to use very small block sizes and check the time after every iteration -- certainly we don't want to do that. > Also, I'd like some way of testing this to ensure we're decoding > asynchronously. I recognize that that's hard. :) Yeesh. How much would you like this? :)  https://bugzilla.mozilla.org/attachment.cgi?id=539707
Created attachment 544511 [details] [diff] [review] Patch v2
Comment on attachment 544511 [details] [diff] [review] Patch v2 Switches to PR_LOG and a better Preferences interface.
I think it would probably be good to add telemetry on how long we spend in the image decoder.
Comment on attachment 544511 [details] [diff] [review] Patch v2 Review of attachment 544511 [details] [diff] [review]: ----------------------------------------------------------------- There are still some fundamental issues here. Jeff and I have talked about this, and we think: 1. Adjusting decode time should absolutely not be per imgDecodeWorker. This leads to us throwing data away much too frequently. 2. We are adjusting a lot of constants, and our behaviour, but we haven't done much measurement. We should measure a) how long it takes to decode some images on different machines; b) what the effect of your patch is on these same machines (eg what do we end up calculating for chunk size). ::: modules/libpr0n/src/RasterImage.cpp @@ +279,5 @@ > !(aFlags & INIT_FLAG_DECODE_ON_DRAW)), > "Can't be discardable or decode-on-draw for multipart"); > > + if (NS_UNLIKELY(!gRegisteredPrefObservers)) > + RegisterPrefObservers(); You should probably check gRegisteredPrefObservers in only one spot, not both.
Comment on attachment 544511 [details] [diff] [review] Patch v2 Switching from 400ms to 5 seems a bit drastic. 400ms is really high, so I'd be happy to cut that in half without doing any real measurement.
The UI team's responsiveness goal is 50ms to respond to any user action. Some actions take multiple trips through the event loop to complete. And sometimes we underestimate how long it's going to take to decode a block. Maybe it should be 10, or 15ms. I don't think we understand the tradeoff well right now.
> 1. Adjusting decode time should absolutely not be per imgDecodeWorker. This > leads to us throwing data away much too frequently. Do you have a better idea in mind? If one JPEG decodes 10x slower than another, I'm not sure how we might do better than recalculating the data for each worker. Even if we could use past data to approximate the decoding speed, we'd save at most one trip through the event loop compared to the current approach. If that's actually worth optimizing out (I'm not convinced), we could just time the 4096 byte (or however-large) block and then run a full-sized block with our remaining time budget before yielding. Can you envision sharing data between workers beating that approach in some way? > 2. We are adjusting a lot of constants, and our behaviour, but we haven't > done much measurement. We should measure a) how long it takes to decode some > images on different machines; b) what the effect of your patch is on these > same machines (eg what do we end up calculating for chunk size). I'm not sure what measuring specifically the number of bytes decoded in 5ms on a variety of machines would tell us. If I said we decode JPEGs at about 42kb/5ms on my Nexus-S, is there some action you'd take? I think it might make sense to measure how much longer it takes from initial decode to end of final decode with this patch versus without it. I'm happy to measure that on the machines I have so we can tune the 5ms versus 10ms yield time. Is there anything else you think we should measure? If so, what would you want to use those measurements for?
(In reply to comment #15) > I think it would probably be good to add telemetry on how long we spend in > the image decoder. I can think of a few different measurements along these lines: 1) Start the clock when we first begin decoding the image, and stop it once we finish decoding. Potentially divide by the number of bytes in the compressed image. This is problematic to report because it depends on how many other images are on the page (if there are twice as many images, we'll appear to decode any one image at half the speed). Also, if we're decoding as we receive the image over the network a busy disk, the value here will reflect not the decoder but the speed of the network/disk. We could exclude these, and just time decodes for previously-discarded images. 2) Start the clock just before we decode a block, and stop the clock just after we finish decoding it. Here we'd want to check how closely we're meeting our 5ms (or however-long) goal. One problem is that the final decode tends to take much longer than previous ones, I presume because some other processing is hidden in the decode step when we hit the end of an image. We could just exclude the final decode, I guess. 3) Time how long it takes to decode a block, and then report the number of bytes in that block divided by the time. I guess this would be interesting for the purposes of seeing whether we made a significant optimization in the JPEG decoder, but I have difficulty imagining us taking action based on these numbers. I think (2) might be worth including in telemetry, and maybe (3). Not so much (1). What do you think?
> Is there anything else you think we should measure? If so, what would you > want to use those measurements for? I think getting user data on how long we're spending in the decode loop would be valuable. Also, getting numbers on bytes/s would be good. This can help us know how valuable improving the performance of this pipeline can be. (There's at minimum the long hanging fruit of doing the R8G8B8 to X8R8G8B8 conversion during color conversion.)
> Switching from 400ms to 5 seems a bit drastic. It does. But AFAICT, except in the case where we decode-on-draw a discarded image, we decode in 32k chunks, which take about 1ms on my machine. So in comparison, the 400ms for this previously-uncommon class of decodes may be what's drastic.
> I think getting user data on how long we're spending in the decode loop would > be valuable. Also, getting numbers on bytes/s would be good. Just to be clear, this is (1) and (2) from comment 20, right?
> Just to be clear, this is (1) and (2) from comment 20, right? Erm, (2) and (3). (I'm not sure whether "how long we're spending in the decode loop" is per block or per image.)
(In reply to comment #20) > (In reply to comment #15) > > I think it would probably be good to add telemetry on how long we spend in > > the image decoder. > > I can think of a few different measurements along these lines: > > 1) Start the clock when we first begin decoding the image, and stop it once > we finish decoding. Potentially divide by the number of bytes in the > compressed image. I'm not particularly interested in this measurement. > > 2) Start the clock just before we decode a block, and stop the clock just > after we finish decoding it. Here we'd want to check how closely we're > meeting our 5ms (or however-long) goal. I'd like this one. > One problem is that the final decode tends to take much longer than previous > ones, I presume because some other processing is hidden in the decode step > when we hit the end of an image. We could just exclude the final decode, I > guess. That's weird. We should figure out why this is. > 3) Time how long it takes to decode a block, and then report the number of > bytes in that block divided by the time. I guess this would be interesting > for the purposes of seeing whether we made a significant optimization in the > JPEG decoder, but I have difficulty imagining us taking action based on > these numbers. I'd like this one too, for the reasons I've explained above. > I think (2) might be worth including in telemetry, and maybe (3). Not so > much (1). What do you think? I'd like (2) and (3). -Jeff
Nominating for tracking-ff-8. Bug 664290 reduced the image discard timer from 2m to 10s. Because of this change, we decode discarded images much more often than we did. AFAICT, the effects of the 400ms yield are felt mainly (perhaps only) when decoding discarded images (comment 22), so in some sense, this bug is attempting to address a snappiness regression from bug 664290.
I'll spin a new patch and get some measurements once we decide what's the right approach re sharing decoding speed data between workers (comment 19).
(In reply to comment #19) > Do you have a better idea in mind? If one JPEG decodes 10x slower than > another, I'm not sure how we might do better than recalculating the data for > each worker. For each worker is definitely wrong because we throw away the worker after we've decoded. RasterImage is a better place, though I still believe that Decoder is better still.
> For each worker is definitely wrong because we throw away the worker after > we've decoded. RasterImage is a better place, though I still believe that > Decoder is better still. Ah, I understand now. Storing in the RasterImage sounds fine to me. If we were to store in the decoder, do you have any idea how we'd deal with the fact that some images decode more slowly than others? It's important that we don't overestimate how quickly we're going to decode this particular image.
Created attachment 544866 [details] PNG decoding log This is the log of decoding , a large PNG. Notice that the first and last steps take much longer than the rest. (The patch I used here is different from v2 in that it clamps the minimum block size to the block size multiple, 1024.)  http://www.libpng.org/pub/png/img_png/16million-pschmidt.png
We'll need to figure out why the last step takes so much longer if we're to store the block size in the image rather than the decoder. Otherwise when we redecode, the last decode step will have been the final decode for the image, which is slow, and so we'll use a smaller block size than we should. Which defeats the whole purpose. We similarly need to understand why the first step is slow, because if we use an appropriately-sized block for the initial decode, we could end up taking much longer than we want.
> We'll need to figure out why the last step takes so much longer For PNGs at least, this is buried deep in the decoder. The slow function is png_push_have_end(). Still looking to see what's going on...
Ah, I did not expect that function pointer to take me back into Mozilla code. It's Decoder::PostDecodeDone that's slow. My guess is that it's the image re-scaler which is eating up cycles here.
The slow part at the end is imgFrame::Optimize. I'll try to figure out what's slow at the beginning now.
The expensive call at the beginning is, perhaps unsurprisingly, nsPNGDecoder::info_callback(). There are two hot functions inside: * Decoder::PostSize, and * Decoder::CreateFrame. I'm not sure how we want to proceed here. We can now point to what's expensive at the beginning and the end of decoding. It seems pretty benign to me, or at least, it doesn't seem that we should block these improvements on making those routines return to the event loop periodically. Given that there are expensive routines at the beginning and end of decoding (confirmed for PNG, likely for JPEG), we could try to be smart about this and not count the first and last measured decoding speeds towards our measurement of how fast we can decode the image. On the other hand, we've already agreed that the speed at which we decoded a previous block doesn't predict perfectly the speed at which we'll decode the next block. So another option would be to write code which is reasonably robust to blocks suddenly taking longer than we expected, and then not to worry about the first/last block issues. My inclination is to go with the latter approach. I think that's basically what patch v2 does. It's trivial to save the block size in the RasterImage rather than the worker, so I'm happy to do that, but since we know that the speed at which the last block is decoded isn't indicative of the speed at which the first block will be decoded -- and anyway, we can't do much to affect the speed of either of these blocks, since the critical path is not in the image decoder -- the extra information will just get thrown away. I still need to measure the degree to which yielding affects decoding speed. I'm pretty sure it's going to be minimal, however. If the image is gigantic and takes 1s to decode, we're going to yield ~200 times, which is almost nothing. If yielding every 5ms is indeed cheap, I'm planning to submit basically the same patch as v2 for review again. If you'd like me to do something different, please ask!
WRT to telemetry, I'm planning to handle that separately.
I did some testing of the decoding speed. The extra yields don't seem to impact decoding speed much in and of themselves. When I dump the decoded copy of  and then redraw it, it takes about 300ms to decode with 5000ms yields (so just one yield during the decoding, at the beginning) and with 5ms yields. (These speeds are heavily influenced by what I presume to be cache effects. I'm reporting warm numbers, say after decoding the image twice in quick succession, because the cold values are all over the place.) The story is a bit different when we decode from the network or cache. For this test, I saved  to my disk, opened it in a tab, closed the tab, and then restored the tab. Here, the image arrives in 32kb chunks. If I set the yield time low enough, we actually end up doing more work per yield *and* finishing slower. It's the worst of both worlds. We initially decode a few bytes to calculate our speed. We yield and then, if we have a large time-before-yield, we consume all available bytes in one fell swoop. All decodes after this are kicked off when we receive a block of 32kb, and decode only that much. Each 32kb decode takes about 1ms on my machine. When we have a short yield time (say 5ms), we don't do that initial consume-everything step after the first yield. Therefore there's always ample data in the buffer, and we always spend a full 5ms decoding. I don't know why the second one ends up taking longer (*), but my guess is that it has something to do with drawing the decoded image to the screen. Even though we end up yielding more often when we have a larger time-before-yield, the image appears more smoothly (and more slowly) when we have a 5ms yield. (In contrast, when re-decoding a discarded image, the image appears all at once, regardless of the time-to-yield.) (*) The amount by which it is slower is highly variable, and I've had trouble reproducing this result. It's occasionally 1.8x slower, but it's often approximately the same speed.
 http://upload.wikimedia.org/wikipedia/commons/a/aa/Yiddish_WWI_poster2.jpg The tests indicate that this change is an unmitigated win for discarded images, and perhaps harmful when loading large images out of the cache. We could turn this on just for discarded images. I'm a bit concerned that we might want to yield more than once every 32kb when decoding some images;  decodes at a rate of about 1kb / 5ms for me, so 32kb ends up being a long time. It seems like maybe I should look into rate-limiting the drawing step...  http://www.libpng.org/pub/png/img_png/16million-pschmidt.png
> It seems like maybe I should look into rate-limiting the drawing step... Does anyone have thoughts as to whether this is a good idea, or whether I should try something else? Also, I'm not sure where to look to do this.
I think we should use the same logic regardless of whether we're redecoding or decoding from necko, unless we have some very good reason not to. Both codepaths end up using imgDecodeWorker unless I'm misunderstanding, right? I'm not sure what you mean by rate-limiting the drawing step. We intentionally display progressively when we're first downloading, because that's what users expect (i.e., it can take a while to download big images, and users understand that, but want to see what's coming in as it's coming in), and we wait until we're finished redecoding after discard because discarding is completely invisible to users and they think we're redownloading an image if we display it progressively when redecoding.
(In reply to comment #40) > I think we should use the same logic regardless of whether we're redecoding > or decoding from necko, unless we have some very good reason not to. Both > codepaths end up using imgDecodeWorker unless I'm misunderstanding, right? Yes, they both end up in imgDecodeWorker. FWIW, we don't currently use exactly the same logic for the first versus the second decode, since, as you pointed out, we display progressively only on the first decode. But I agree it would be nice not to add more of that kind of thing. > I'm not sure what you mean by rate-limiting the drawing step. I mean, right now it appears to draw after each yield. We might make it draw once every 50ms or something, and that might mitigate the performance impact of this patch. The knob to tweak how often we draw to the screen was staring at me this whole time -- it's just that FlushInvalidations() call we avoid doing on the second decode. I just set it to flush invalidations at most once every 50ms, and the speeds are much closer now with and without the patch when I test with jpeg  from comment 38. Decoding png  from comment 38 takes longer (factor of ~1.25) with the patch because without it, we decode in one fell swoop (locking up the browser for half a second or so), while with the patch, we draw it two or three times. (The image's compressed size is very small compared to how large it is, which is why the current implementation doesn't yield at all.) I'll clean up this newest incarnation and post it to the bug so you can have a look.
(In reply to comment #41) > I just set it to flush invalidations at most once every 50ms, and the speeds > are much closer now with and without the patch when I test with jpeg  > from comment 38. Adding yet another knob to tweak doesn't thrill me. That being said, maybe we should wait for the animation rate limiting to land, then re-implement that part of this patch so we rely on the refresh driver.
Created attachment 545384 [details] [diff] [review] Patch v3 Now with even more knobs. Is the rate-limiting you speak of bug 666446, or something else?
(In reply to comment #43) > Is the rate-limiting you speak of bug 666446, or something else? That's the one.
Waiting on bug 666446 sounds like a good plan to me, since it looks like that bug is getting traction.
I think a big part of the problem is that we can have a large number of imgDecodeWorkers waiting on the event queue at once. Say there's a page with 100 large images loaded in a background tab. Switching to the tab will kick off 100 async decodes which will dispatch 100 imgDecodeWorkers to the event queue. The event queue is FIFO, so each of these workers will need to run an iteration before the browser can go back to processing other events. Even if each worker is guaranteed to never spend more than 5ms per iteration, the upper bound for the browser being unresponsive is still 500ms. This seems to be a tricky problem since tuning the chunk size or yield deadline will only help when we're decoding a small number of images. Perhaps the imgDecodeWorker class could be turned into a singleton that manages async decoding for all RasterImages. Internally it would manage a queue of images that are waiting to be decoded. When run from the event queue, it would take an image off the queue, do some decoding (yielding as necessary), then re-post itself to the event queue if there's still more work to do. This way there would be no more than one imgDecodeWorker on the event queue at any given moment. From there we can tune the knobs until we find the right balance of responsiveness and decode time. How does this approach sound? I should have time to write a proof-of-concept patch on Monday if no one gets to it before then.
In fact I would imagine a singleton worker might even run faster than 100 workers, and conceivably produce better user experience than 100 workers all trying to decode at once and finishing in random-ish order. It will need to be tested to verify this assumption; it's possible for example that 2 or 4 or some small number might be faster/better than 1.
> I think a big part of the problem is that we can have a large number of > imgDecodeWorkers waiting on the event queue at once. I think this is an excellent point. > How does this approach sound? I should have time to write a proof-of-concept > patch on Monday if no one gets to it before then. That sounds like a plan to me. :) I wasn't going to touch this again until bug 666446 lands.
Note that bug 650968 is related.
Also, it appears we used to decode in 4kb blocks, aiming to yield after 5ms, but this was increased in bug 590260. As a first step, can we back that change out by modifying the prefs? I think we should return to a state where we don't lock up the UI for seconds during tab switching, and then work on improving it, e.g. by decoding in a single worker, and drawing using the mechanism in bug 666446. The perf hit is relatively small, and it's only relevant when loading large images from cache.
Created attachment 546593 [details] [diff] [review] Hack for printing start-to-finish decoding times.
Created attachment 547266 [details] [diff] [review] Effectively back out bug 590260 Joe r+'ed this patch in bug 590260 (I didn't notice that bzexport took the last bug number instead of the first), but it belongs here. I think we should take this as a fix for this bug and then file separate bugs for: * Doing all image decodes in a single worker, so we avoid the case where we block the event loop for 500ms by decoding 100 images for 5ms each. * Reducing the 4096 block size when we notice that it takes longer than 5ms to decode that much.
(In reply to comment #53) > inbound: http://hg.mozilla.org/integration/mozilla-inbound/rev/fa82e3fa46b9 Woah, do we have any data that suggest that the new prefs are better? i.e where does the 4096 block size come from?
I suspect the 4096 was basically pulled out of a hat, just as the 400kb appears to have been. The question is, is it too large, or too small? I don't think it's too large. The slowest-to-decode image I could find is , which on my (fast) machine, decodes at a rate of 1kb/5ms (comment 38). All the large jpegs I tried decode much faster than this (I don't have numbers offhand). So in the worst case, where your machine is 10x slower than mine, we're decoding at .1kb/5ms, or 40 * 5ms = 200ms before yielding. This is much better than when we had a 400kb block. If it's too small, we end up calling gettimeofday() more often than we need to. But to solve that problem, we need the second bullet point from comment 52: Increase the 4096 byte block size when we notice it takes less than 5ms to decode that much. In either case, I think what we checked in is a step in the right direction...  http://www.libpng.org/pub/png/img_png/16million-pschmidt.png
(In reply to comment #55) > If it's too small, we end up calling gettimeofday() more often than we need > to. But to solve that problem, we need the second bullet point from comment > 52: Increase the 4096 byte block size when we notice it takes less than 5ms > to decode that much. My concern is that it's too low. I don't think it's worth tuning this for worst-cases. I think it's a bad pessimization to design around worst-case latency. For example, I think it's unrealistic to lower the slow script warning to have a 5ms timeout. We should aim for the block size to be the value such that it takes 5ms for the slowest typical image to decode on the slowest typical machine. I would have preferred we do more measurement before rushing to change it.
> My concern is that it's too low. > > We should aim for the block size to be the value such that it takes 5ms for the > slowest typical image to decode on the slowest typical machine. I guess it's hard to know what's the decode rate of the slowest typical image on the slowest typical machine without a few weeks of telemetry (and of course our nightly testers' machines will be different from the wider audiences'), but based on the experiments I did on my machine, I think 4096 bytes is probably on the high end. I can already demonstrate one image which my fast desktop machine decodes at a rate of 1024 bytes / 5ms, so I imagine there are many such images for, say, phones. More to the point, the experiments in this bug suggest that there isn't much overhead in 10x as many gettimeofday() calls or 10x as many trips through the event loop. It's painting that slows you down, at least on my machine. I doubt that there's a single block size which is suitable for both smartphones and Core i7s, which is why I was trying to adaptively set the block size in this bug. I'll back out the change if you think that's a hard requirement. I definitely don't want to rush into this; I just don't want to let the perfect be the enemy of the good.
http://hg.mozilla.org/mozilla-central/rev/fa82e3fa46b9 I'm marking fixed but the status of the bug is a bit unclear, please correct if it's wrong.
> My concern is that it's too low. On IRC, Jeff and I discussed that if we have different block sizes for different formats, we could probably raise the jpeg block size substantially. That work is blocked on telemetry, bug 673176.
Removed blocking of bug 666446 since it looks like this was fixed another way and no longer depends on 666446.
No need to track this for 8, it's already in there...
As an FYI, this really dragged us down in TenFourFox. We ended up reverting two of the prefs; see http://code.google.com/p/tenfourfox/issues/detail?id=112 Admittedly I imagine optimizing for a slow G3 may be out of your purview, but even on a 6-year old G5 these changes hurt, and that's certainly within reasonable range of what would still be running Fx.
We'll be increasing this value in bug 811467