Spinoff of 411379. STR: 1) Download the testcase at the URL 2) Load on today's trunk on MacOS 3) Load on 2.0.0.x on MacOS Note trunk takes ~7x as long as branch (on same hardware): 2011 Vista 3.6s 2011 MacOS 3.4s Trunk Vista 3s Trunk MacOS 24s <----
My testing showed that the Trunk took twice as long as 188.8.131.52 which is still pretty bad. One comment on doing JPEG benchmarking which I thought that I'd mention here: to use the script in the other bug, load the page in a browser and exit the browser so that the disk files are cached. Then open the target browser and run the page and you'll get the render time. Then open the baseline browser and run the page and find the percentage difference. The first run to cache the images will be very slow.
One thing that I love about Apple is that they put the developer tools on the install disk and that includes a code profiler. I think that most of the regression is due to the use of these two CoreGraphics routines: argb32_image_argb32 and argb32_image_mark They use 33% of the time based on my code sampling and don't show up in 184.108.40.206 code profiles. BTW, the top consumer on the 220.127.116.11 test was IDCT which I guess is expected. The next highest routines are in Huffman. It's interesting that one of the routines appears to be a memory allocation routine - might be some room for improvement there. Huffman itself is very hard to optimize. At any rate, something appears to have changed in the use of Apple low-level graphics from 2.0 to 3.0. I will attempt to get a stack dump of calls to this routine using Sampler (which I've never used before).
Interesting part is here that the format should actually be RGB24 (which is same as ARGB32, expect that the A byte is ignored).
Another big difference between FF2 and FF3 is that the imgContainer now keeps a copy of the source data. This is done a nsTArray<PRUint8>. This causes a realloc for every growth step (4K>8K>16K>32K...) until the whole source data is one big allocated chunk. This means that there are multiple realloc's to need to alloc a chunk big enough to keep the source data (which is later used to restore the image when it is discarded after a timeout). Allocating this mRestoreData as a 'segmented buffer' where each segment is separately allocated may be less constraining on realloc...
(In reply to comment #4) > Another big difference between FF2 and FF3 is that the imgContainer now keeps a > copy of the source data. This is done a nsTArray<PRUint8>. This causes a > realloc for every growth step (4K>8K>16K>32K...) until the whole source data is > one big allocated chunk. This means that there are multiple realloc's to need > to alloc a chunk big enough to keep the source data (which is later used to > restore the image when it is discarded after a timeout). > > Allocating this mRestoreData as a 'segmented buffer' where each segment is > separately allocated may be less constraining on realloc... That difference doesn't show up on Windows though. I took a 3 MB JPEG and converted it to a BMP and it came out to 22 MB. Adding a byte would make it about 28 MB. So any extra image processing for the source should be pretty small compared to the processing of the resulting four-byte bitmap.
40.3% CoreGraphics argb32_image_mark_rgb32 24.7% CoreGraphics resample_byte_h_3cpp_vector 8.5% CoreGraphics decode_byte_8bpc_3 6.0% CoreGraphics decode_data 5.1% XUL jpeg_idct_islow 4.1% XUL ycc_rgb_convert 2.8% XUL decode_mcu 1.1% XUL h2v1_fancy_upsample
So 80% of the time is spent in CoreGraphics, so of these 24s, only 5s are outside CoreGraphics... and nsJPEGDecoder itself doesn't show up in these stats
For those that have reported results, could you repost your results and hardware? I'm curious about the processor and graphics solution. My performance differential was 2x, CPU is Core 2 Duo (Merom) and graphics solution is nVidia 8600. The machine is a MacBook Pro purchased Summer 2007.
(In reply to comment #8) > For those that have reported results, could you repost your results and > hardware? I'm curious about the processor and graphics solution. > > My performance differential was 2x, CPU is Core 2 Duo (Merom) and graphics > solution is nVidia 8600. The machine is a MacBook Pro purchased Summer 2007. > Tests run on MacBook Pro Core 2 Duo (2.4Ghz) with ATI graphics. Also have a mac mini I can repeat tests with if it is helpful
Created attachment 297432 [details] alternate test driver One note about this test: it's testing image drawing speed while images are still being decoded. This is pretty different from the post-load case, as there's an optimization step that hasn't happened yet. Here's a driver for the same jpeg suite that's in the URL field, but which instead just tests pure decoding performance, and not rendering/scaling/etc. performance. That said, the mac was doing horrible things here; I added a workaround to make it do the same thing that happens on win32, and I caught an extra optimization that was missing on win32 as well. That patch will be coming up.
Created attachment 298078 [details] [diff] [review] optimize image loading on the mac Ok, here's the patch. This adds new Cairo Quartz API (equivalent to what exists on win32) so that we can load directly into a Quartz surface instead of always copying; the copy kills us in case we have to render the image multiple times while loading (or we try to render incrementally). This also optimizes win32 so that we draw from the native surface type and not the image surface. (In case anyone's wondering, note that a similar optimization is not possible for gtk2 -- we can only do this on win32/osx because the image data lives in the same process space. On gtk2/unix, due to the X11 disaster, we'll have to do a bunch of extra work to use XShm and that'll require some additional cairo surgery to make possible.)
Works fine on Windows. I noticed a small performance improvement of about a little over one percent. I would need a bigger test to get more accuracy.
Firefox crashes while starting up with this patch on OSX for me. I pulled the code yesterday afternoon and built it without the patch and it ran. After putting in the patch and starting it up, it does the compatibility check, puts Minefield in the bar at the top of the screen and the it closes and pops up a window asking me if I want to close, restart or do something else.
Can you get a stack trace for that? I'm having a hard time reproducing the crash -- I /think/ I've seen it once, but I didn't get any crash dialog or anything, so not sure what happened.
This was built with only -O1 and not with -g. If you need anything further, let me know. I can rebuild with -g if need be. #0 0x018242b3 in _cairo_quartz_get_image () #1 0x018244ee in _cairo_quartz_surface_acquire_source_image () #2 0x0181cb21 in _cairo_surface_acquire_source_image () #3 0x0184b64b in _cairo_surface_fallback_snapshot () #4 0x0181cdf0 in _cairo_surface_snapshot () #5 0x018236be in _init_pattern_with_snapshot () #6 0x01823941 in _cairo_quartz_cairo_repeating_surface_pattern_to_quartz () #7 0x018240b7 in _cairo_quartz_setup_source () #8 0x01824f25 in _cairo_quartz_surface_fill () #9 0x0181d550 in _cairo_surface_fill () #10 0x01843e53 in _cairo_gstate_fill () #11 0x018165b7 in _moz_cairo_fill_preserve () #12 0x017dc322 in gfxContext::Fill () #13 0x01723a84 in nsThebesImage::ThebesDrawTile () #14 0x01726970 in nsThebesRenderingContext::DrawTile () #15 0x01197379 in nsCSSRendering::PaintBackgroundWithSC () #16 0x0119744f in nsCSSRendering::PaintBackground () #17 0x0119b7fb in nsDisplayBackground::Paint () #18 0x0119a196 in nsDisplayList::Paint () #19 0x0119a1d3 in nsDisplayWrapList::Paint () #20 0x0119a24f in nsDisplayClip::Paint () #21 0x0119a196 in nsDisplayList::Paint () #22 0x011a9e22 in nsLayoutUtils::PaintFrame () #23 0x011aee7c in PresShell::Paint () #24 0x01419c0e in nsViewManager::RenderViews () #25 0x01419f95 in nsViewManager::Refresh () #26 0x0141ab6c in nsViewManager::DispatchEvent () #27 0x014151e2 in HandleEvent () #28 0x017413b3 in nsChildView::DispatchEvent () #29 0x0173f364 in nsChildView::DispatchWindowEvent () #30 0x01745349 in -[ChildView drawRect:] () #31 0x9331a3d1 in -[NSView _drawRect:clip:] () #32 0x9331942b in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] () #33 0x9332b38f in _recursiveDisplayInRect2 () #34 0x9083ead0 in CFArrayApplyFunction () #35 0x93319633 in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] () #36 0x9332b38f in _recursiveDisplayInRect2 () #37 0x9083ead0 in CFArrayApplyFunction () #38 0x93319633 in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] () #39 0x93318493 in -[NSView _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] () #40 0x93317b98 in -[NSThemeFrame _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] () #41 0x93317382 in -[NSView _displayRectIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:] () #42 0x93316cae in -[NSView displayIfNeeded] () #43 0x93316a52 in -[NSWindow displayIfNeeded] () #44 0x932bc398 in -[NSWindow _reallyDoOrderWindow:relativeTo:findKey:forCounter:force:isModal:] () #45 0x932bbe7e in -[NSWindow orderWindow:relativeTo:] () #46 0x93312fc8 in -[NSWindow makeKeyAndOrderFront:] () #47 0x0173a970 in nsCocoaWindow::Show () #48 0x015ff95a in nsXULWindow::SetVisibility () #49 0x0160100f in nsXULWindow::OnChromeLoaded () #50 0x01604cdd in nsWebShellWindow::OnStateChange () #51 0x015bd108 in nsDocLoader::FireOnStateChange () #52 0x015bd4a9 in nsDocLoader::doStopDocumentLoad () #53 0x015bd598 in nsDocLoader::DocLoaderIsEmpty () #54 0x015bd9c7 in nsDocLoader::OnStopRequest () #55 0x01071202 in nsLoadGroup::RemoveRequest () #56 0x0116a7da in imgRequestProxy::RemoveFromLoadGroup () #57 0x0116a851 in imgRequestProxy::OnStopRequest () #58 0x01169fd4 in imgRequest::OnStopRequest () #59 0x01166f1e in ProxyListener::OnStopRequest () #60 0x01067659 in nsBaseChannel::OnStopRequest () #61 0x0106ce9c in nsInputStreamPump::OnStateStop () #62 0x0106d30c in nsInputStreamPump::OnInputStreamReady () #63 0x01959f6b in nsInputStreamReadyEvent::Run () #64 0x017a856c in nsThread::ProcessNextEvent () #65 0x01775e3a in NS_ProcessPendingEvents_P () #66 0x0174fbd5 in nsBaseAppShell::NativeEventCallback () #67 0x017369fc in nsAppShell::ProcessGeckoEvents () #68 0x9082cf32 in CFRunLoopRunSpecific () #69 0x9082ca6e in CFRunLoopRunInMode () #70 0x92df4878 in RunCurrentEventLoopInMode () #71 0x92df3eb9 in ReceiveNextEventCommon () #72 0x92df3dd9 in BlockUntilNextEventMatchingListInMode () #73 0x9329b485 in _DPSNextEvent () #74 0x9329b076 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] () #75 0x93294dfb in -[NSApplication run] () #76 0x017365a9 in nsAppShell::Run () #77 0x016140d3 in nsAppStartup::Run () #78 0x0101133f in XRE_main () #79 0x00002e02 in main ()
Ah, that gives me an idea of what's crashing; now the /why/ is a different issue (NULL surface being passed to cairo_quartz_surface_get_image). I'll track a few things down.
Created attachment 298845 [details] [diff] [review] patch, now with check and warning Ok, give this one a shot -- the only change is checking for NULL and printing a NS_WARNING in that case, and returning nsnull.
Created attachment 298849 [details] [diff] [review] one more Whoops, found one more issue. This is probably the real crash-causer, managed to reproduce it locally.
The latest patch doesn't crash (from my tests) and gets performance of about six percent slower than 18.104.22.168 which is a considerable improvement from before the patch.
(I'll land this tomorrow -- I have a locally merged version that comes after the cairo update that I'm going to land first)
Woot! 6% slower still sucks, but less so.
Alfred's next change should result in a net improvement in performance over 22.214.171.124.
Sure, but is that 6% recoverable on top of that?
No. The current patch for this doesn't apply due to other code changes.
Sorry, I have an updated/merged patch that I was going to check in, but was waiting for the tree to quiet down.
Could you email it to me (email@example.com)? I'm trying to do some performance tests on idct changes.
Created attachment 299528 [details] [diff] [review] updated patch Here's an updated patch; should apply cleanly. I'll get it checked in later tonight/tomorrow.
Comment on attachment 299528 [details] [diff] [review] updated patch wrong patch
Just checked in.
I backed this out because it apparently caused a really weird mochikit failure on the *linux* tinderbox. I'm having a very hard time believing that this is this patch's fault, but backing it out fixed the issue.
I saw the performance improvement with the earlier patch on the trunk on another maching but the machine that I've been using for the past few days has the trunk quite a bit slower than 3.0 beta 2. I'm not completely sure as to whether or not I have the patch in. I did pull the trunk last night but it may have been after you backed out the patch.
Relanded; the gremlins went away.
Hit enter too soon -- Michael, I would suggest that you use the alternate test driver (first attachment here) for benchmarking pure jpeg decoding performance (e.g. the IDCT test) -- there's no need to include rendering as part of those measurements.
I think that it's useful to test the complete time as that's what the user sees and it covers regressions. At the moment, the trunk seems to be 50% slower than Beta 2. Before, I think it was comparable to Beta 2. So I'm a bit worried that there's another problem. It's hard to tell given the fluidity of the code though. I have the idct code working on the Mac and shark reports a nice improvement in the amount of CPU spent in idct. I will look at your alternate test if I can get a hold of an OSX machine. The one that I was using is at college with my son right now.
(In reply to comment #35) > I think that it's useful to test the complete time as that's what the user sees > and it covers regressions. At the moment, the trunk seems to be 50% slower than > Beta 2. Before, I think it was comparable to Beta 2. So I'm a bit worried that > there's another problem. It's hard to tell given the fluidity of the code > though. Hm, I'm not sure what you mean -- you mean that with this patch (that is, current CVS trunk), it's 50% slower than b2? Or firefox 2? And that the previous time you tried the patch, it was comparable to b2 (fx2?)?
I think that the trunk is slower than B2 now. Not sure if your patch was in or not. I'm going to try to pull the trunk and build tonight and compare with B2.
Does bug 408029 bear any relation to the performance issues discussed here ? bug 408029 deals with slow downs / unresponsiveness when loading very large .jpg images. My humble tests show a slight improvement after the landing of this patch on Intel Macs, but not much so on PPC macs.
(In reply to comment #37) > I think that the trunk is slower than B2 now. Not sure if your patch was in or > not. I'm going to try to pull the trunk and build tonight and compare with B2. > Try with tomorrow's trunk (has this patch for sure). I will as well and will report results.
Just a note - I believe this is good for a 6% Tp win: http://graphs.mozilla.org/#spst=range&spss=1201547268.4328225&spse=1201561889.6438231&spstart=1200612371&spend=1201566190&bpst=Cursor&bpstart=1201538271&bpend=1201561889.6438231&m1tid=103002&m1bl=0&m1avg=0&m2tid=103067&m2bl=0&m2avg=0&m3tid=103117&m3bl=0&m3avg=0&m4tid=102925&m4bl=0&m4avg=0 Shows on mac only.
(In reply to comment #37) > I think that the trunk is slower than B2 now. Not sure if your patch was in or > not. I'm going to try to pull the trunk and build tonight and compare with B2. Just to confirm -- you are saying that the trunk is slower than *Firefox 3 Beta 2* (i.e. the previous beta), not *Firefox 2* ?
(In reply to comment #40) > Just a note - I believe this is good for a 6% Tp win: Neat! The previous dip is when the patch was briefly landed before I backed it out, I never even got around to looking at the numbers back then.
Sounds like a good time to do a build and try it out then.
I keep getting this while trying to build. I'm going to pull later this morning. make: *** No rule to make target `nsUnicodeToMacGurmukhi.cpp', needed by `nsUnicodeToMacGurmukhi.o'. Stop. make: *** Waiting for unfinished jobs.... nsUnicodeToMacGujarati.cpp
Today's Trunk. Test1 Test2 FF3 6.63 2.151 FF2 4.53 1.897 1.46x slower 1.13x slower Vista FF3 1.17 1.08 FF2 1.313 1.3 1.12x faster 1.2x faster This is on different hw so Vista/Mac numbers are not directly comparable. But something is still wrong on the mac.
(In reply to comment #45) > Today's Trunk. > > Test1 Test2 > FF3 6.63 2.151 > FF2 4.53 1.897 > 1.46x slower 1.13x slower > > Vista > > FF3 1.17 1.08 > FF2 1.313 1.3 > 1.12x faster 1.2x faster > > This is on different hw so Vista/Mac numbers are not directly comparable. But > something is still wrong on the mac. > On Same hardware Vista: FF3 1.65-1.84s 1400ms FF2 2s 1600ms So something is clearly wrong both on the JPEG side (test 2) and on the painting/scaling side (test 1) since both are faster on Windows and slower on mac (Test 1 by a much wider margin)
Windows has SSE2 IDCT so it gets a bump there on the JPEG side. I've always felt that you can't compare Mac numbers to Vista numbers because of all of the OS and GUI differences but the same OS comparison are definitely useful. I didn't get a chance to try a build this morning and will try to get the hardware tonight. Did you get a chance to run a test on Beta 2?
On Windows, the image scaling code is going through pixman. On the mac we're going through quartz. We should really split this bug up into two separate ones -- the JPEG decoding speed (sans drawing) is a separate beast from image scaling.
(In reply to comment #47) > Windows has SSE2 IDCT so it gets a bump there on the JPEG side. I've always > felt that you can't compare Mac numbers to Vista numbers because of all of the > OS and GUI differences but the same OS comparison are definitely useful. > Why don't we have SSE2 for the Mac?
> Why don't we have SSE2 for the Mac? The idct code is on my son's computer. The color code is waiting for Alfred's 10% checkin. I can do the sampling code but it's only a small improvement.
Here are my results on a number of builds. The Trunk before Vlad's change was around 18 or 19 seconds. Trunk: 13.2 seconds Trunk+IDCT: 11.2 seconds Trunk+IDCT+ Alfred: 10.6 seconds Trunk+IDCT+ Alfred+ jdcolor: 10.0 seconds Beta2: 12.3 seconds 126.96.36.199: 7.5 seconds mmoy 188.8.131.52 3.9 seconds At the moment, Shark shows about 30% in decode_mcu, 16% in idct, 15% in ycc_rgb_convert_argb (Alfred's new routine), 9% in h2v1_fancy_upsample, 6% in jpeg_fill_bit_buffer, $% in h2v2_fancy_upsample and the rest is trivial. It appears to me that the OS stuff to paint the windows (Quartz?) doesn't get tracked by Shark. It's also interesting that the time difference between Trunk to Trunk+IDCT+Alfred+jdcolor is pretty close to the difference between 184.108.40.206 and mmoy 220.127.116.11. It looks like more work is needed on the rendering side.
Test1=test that scales and paints the images Test2=test that just decodes the images Branch Test1 4.6s Test2 1.9s 2-5-2008 trunk Test1 6.2s Test2 2.1s 2-6-2008 trunk Test1 3.2s Test2 2.1s So Trunk scaling/painting is now ~2x faster than yesterday and 1.43x faster than branch (w00t) Trunk JPEG decoding is 1.10x slower than branch
Going to resolve this -- we should probably file a new bug for the pure jpeg decoding test (or is there an existing bug about getting the MMX etc. stuff working on the mac)?
One cause of being slower than FF2 in jpeg decoding is that the image source data is now stored for the 'discard/restore' function. This is slow because one nsTArray is used for the complete image source data, and adding every segment means reallocs of a large piece of memory (which will result in alloc/memcpy/free).
reallocs on mac and windows (with jemalloc) at least are quite fast.
But not when the chunk is already large (say 511KB, and another 4K needs to be added, and there is no room to do after that chunk), then whatever malloc library is used, it needs to allocate some new space, copy the 511KB of data to the new space, and free the old space. A simple test would be to disable the discarding (just do setenv MOZ_DISABLE_IMAGE_DISCARD="1"), and do the same performance tests on the decoding.
If the server sends a content-length for an image, shouldn't we use that as an initial alloc size (assuming the size is reasonable)?