412396 - JPEG decoding/painting 7x slower on trunk v.s. branch on mac

Reporter

Description

•

17 years ago

Spinoff of 411379.

STR:

1) Download the testcase at the URL
2) Load on today's trunk on MacOS 
3) Load on 2.0.0.x on MacOS

Note trunk takes ~7x as long as branch (on same hardware):

2011 Vista 3.6s
2011 MacOS 3.4s
Trunk Vista 3s
Trunk MacOS 24s <----

Mike Schroepfer

Reporter

Updated

•

17 years ago

Flags: blocking1.9+

Priority: -- → P2

Mike Schroepfer

Reporter

Updated

•

17 years ago

Blocks: 411379

Jeff Walden [:Waldo]

Updated

•

17 years ago

URL: people.mozilla.com/~schrep/jpeg.zip → http://people.mozilla.com/~schrep/jpe...

Jesse Ruderman

Updated

•

17 years ago

Keywords: perf, regression

Michael Moy

Comment 1

•

17 years ago

My testing showed that the Trunk took twice as long as 2.0.0.11 which is still pretty bad.

One comment on doing JPEG benchmarking which I thought that I'd mention here: to use the script in the other bug, load the page in a browser and exit the browser so that the disk files are cached. Then open the target browser and run the page and you'll get the render time. Then open the baseline browser and run the page and find the percentage difference. The first run to cache the images will be very slow.

Michael Moy

Comment 2

•

17 years ago

One thing that I love about Apple is that they put the developer tools on the install disk and that includes a code profiler.

I think that most of the regression is due to the use of these two CoreGraphics routines:

    argb32_image_argb32 and argb32_image_mark

They use 33% of the time based on my code sampling and don't show up in 2.0.0.11 code profiles. BTW, the top consumer on the 2.0.0.11 test was IDCT which I guess is expected. The next highest routines are in Huffman. It's interesting that one of the routines appears to be a memory allocation routine - might be some room for improvement there. Huffman itself is very hard to optimize.

At any rate, something appears to have changed in the use of Apple low-level graphics from 2.0 to 3.0. I will attempt to get a stack dump of calls to this routine using Sampler (which I've never used before).

Alfred Kayser

Comment 3

•

17 years ago

Interesting part is here that the format should actually be RGB24 (which is same as ARGB32, expect that the A byte is ignored).

Alfred Kayser

Comment 4

•

17 years ago

Another big difference between FF2 and FF3 is that the imgContainer now keeps a copy of the source data. This is done a nsTArray<PRUint8>. This causes a realloc for every growth step (4K>8K>16K>32K...) until the whole source data is one big allocated chunk. This means that there are multiple realloc's to need to alloc a chunk big enough to keep the source data (which is later used to restore the image when it is discarded after a timeout).

Allocating this mRestoreData as a 'segmented buffer' where each segment is separately allocated may be less constraining on realloc...

Michael Moy

Comment 5

•

17 years ago

(In reply to comment #4)
> Another big difference between FF2 and FF3 is that the imgContainer now keeps a
> copy of the source data. This is done a nsTArray<PRUint8>. This causes a
> realloc for every growth step (4K>8K>16K>32K...) until the whole source data is
> one big allocated chunk. This means that there are multiple realloc's to need
> to alloc a chunk big enough to keep the source data (which is later used to
> restore the image when it is discarded after a timeout).
> 
> Allocating this mRestoreData as a 'segmented buffer' where each segment is
> separately allocated may be less constraining on realloc...

That difference doesn't show up on Windows though.

I took a 3 MB JPEG and converted it to a BMP and it came out to 22 MB. Adding a byte would make it about 28 MB. So any extra image processing for the source should be pretty small compared to the processing of the resulting four-byte bitmap.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

17 years ago

Assignee: nobody → vladimir

Robert Sayre

Comment 6

•

17 years ago

40.3% CoreGraphics argb32_image_mark_rgb32
24.7% CoreGraphics resample_byte_h_3cpp_vector
8.5%  CoreGraphics decode_byte_8bpc_3
6.0%  CoreGraphics decode_data
5.1%  XUL          jpeg_idct_islow
4.1%  XUL          ycc_rgb_convert
2.8%  XUL          decode_mcu
1.1%  XUL          h2v1_fancy_upsample

Alfred Kayser

Comment 7

•

17 years ago

So 80% of the time is spent in CoreGraphics, so of these 24s, only 5s are outside CoreGraphics... and nsJPEGDecoder itself doesn't show up in these stats

Michael Moy

Comment 8

•

17 years ago

For those that have reported results, could you repost your results and hardware? I'm curious about the processor and graphics solution. 

My performance differential was 2x, CPU is Core 2 Duo (Merom) and graphics solution is nVidia 8600. The machine is a MacBook Pro purchased Summer 2007.

Mike Schroepfer

Reporter

Comment 9

•

17 years ago

(In reply to comment #8)
> For those that have reported results, could you repost your results and
> hardware? I'm curious about the processor and graphics solution. 
> 
> My performance differential was 2x, CPU is Core 2 Duo (Merom) and graphics
> solution is nVidia 8600. The machine is a MacBook Pro purchased Summer 2007.
> 

Tests run on MacBook Pro Core 2 Duo (2.4Ghz) with ATI graphics.

Also have a mac mini I can repeat tests with if it is helpful

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 10

•

17 years ago

Attached file alternate test driver — Details

One note about this test: it's testing image drawing speed while images are still being decoded.  This is pretty different from the post-load case, as there's an optimization step that hasn't happened yet.

Here's a driver for the same jpeg suite that's in the URL field, but which instead just tests pure decoding performance, and not rendering/scaling/etc. performance.

That said, the mac was doing horrible things here; I added a workaround to make it do the same thing that happens on win32, and I caught an extra optimization that was missing on win32 as well.  That patch will be coming up.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 11

•

17 years ago

Attached patch optimize image loading on the mac (obsolete) — Details — Splinter Review

Ok, here's the patch.  This adds new Cairo Quartz API (equivalent to what exists on win32) so that we can load directly into a Quartz surface instead of always copying; the copy kills us in case we have to render the image multiple times while loading (or we try to render incrementally).

This also optimizes win32 so that we draw from the native surface type and not the image surface.

(In case anyone's wondering, note that a similar optimization is not possible for gtk2 -- we can only do this on win32/osx because the image data lives in the same process space.  On gtk2/unix, due to the X11 disaster, we'll have to do a bunch of extra work to use XShm and that'll require some additional cairo surgery to make possible.)

Attachment #298078 - Flags: review?(pavlov)

Michael Moy

Comment 12

•

17 years ago

Works fine on Windows. I noticed a small performance improvement of about a little over one percent. I would need a bigger test to get more accuracy.

Michael Moy

Comment 13

•

17 years ago

Firefox crashes while starting up with this patch on OSX for me. I pulled the code yesterday afternoon and built it without the patch and it ran. After putting in the patch and starting it up, it does the compatibility check, puts Minefield in the bar at the top of the screen and the it closes and pops up a window asking me if I want to close, restart or do something else.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 14

•

17 years ago

Can you get a stack trace for that?  I'm having a hard time reproducing the crash -- I /think/ I've seen it once, but I didn't get any crash dialog or anything, so not sure what happened.

Michael Moy

Comment 15

•

17 years ago

This was built with only -O1 and not with -g. If you need anything further, let me know. I can rebuild with -g if need be.

#0  0x018242b3 in _cairo_quartz_get_image ()
#1  0x018244ee in _cairo_quartz_surface_acquire_source_image ()
#2  0x0181cb21 in _cairo_surface_acquire_source_image ()
#3  0x0184b64b in _cairo_surface_fallback_snapshot ()
#4  0x0181cdf0 in _cairo_surface_snapshot ()
#5  0x018236be in _init_pattern_with_snapshot ()
#6  0x01823941 in _cairo_quartz_cairo_repeating_surface_pattern_to_quartz ()
#7  0x018240b7 in _cairo_quartz_setup_source ()
#8  0x01824f25 in _cairo_quartz_surface_fill ()
#9  0x0181d550 in _cairo_surface_fill ()
#10 0x01843e53 in _cairo_gstate_fill ()
#11 0x018165b7 in _moz_cairo_fill_preserve ()
#12 0x017dc322 in gfxContext::Fill ()
#13 0x01723a84 in nsThebesImage::ThebesDrawTile ()
#14 0x01726970 in nsThebesRenderingContext::DrawTile ()
#15 0x01197379 in nsCSSRendering::PaintBackgroundWithSC ()
#16 0x0119744f in nsCSSRendering::PaintBackground ()
#17 0x0119b7fb in nsDisplayBackground::Paint ()
#18 0x0119a196 in nsDisplayList::Paint ()
#19 0x0119a1d3 in nsDisplayWrapList::Paint ()
#20 0x0119a24f in nsDisplayClip::Paint ()
#21 0x0119a196 in nsDisplayList::Paint ()
#22 0x011a9e22 in nsLayoutUtils::PaintFrame ()
#23 0x011aee7c in PresShell::Paint ()
#24 0x01419c0e in nsViewManager::RenderViews ()
#25 0x01419f95 in nsViewManager::Refresh ()
#26 0x0141ab6c in nsViewManager::DispatchEvent ()
#27 0x014151e2 in HandleEvent ()
#28 0x017413b3 in nsChildView::DispatchEvent ()
#29 0x0173f364 in nsChildView::DispatchWindowEvent ()
#30 0x01745349 in -[ChildView drawRect:] ()
#31 0x9331a3d1 in -[NSView _drawRect:clip:] ()
#32 0x9331942b in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] ()
#33 0x9332b38f in _recursiveDisplayInRect2 ()
#34 0x9083ead0 in CFArrayApplyFunction ()
#35 0x93319633 in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] ()
#36 0x9332b38f in _recursiveDisplayInRect2 ()
#37 0x9083ead0 in CFArrayApplyFunction ()
#38 0x93319633 in -[NSView _recursiveDisplayAllDirtyWithLockFocus:visRect:] ()
#39 0x93318493 in -[NSView _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#40 0x93317b98 in -[NSThemeFrame _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#41 0x93317382 in -[NSView _displayRectIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:] ()
#42 0x93316cae in -[NSView displayIfNeeded] ()
#43 0x93316a52 in -[NSWindow displayIfNeeded] ()
#44 0x932bc398 in -[NSWindow _reallyDoOrderWindow:relativeTo:findKey:forCounter:force:isModal:] ()
#45 0x932bbe7e in -[NSWindow orderWindow:relativeTo:] ()
#46 0x93312fc8 in -[NSWindow makeKeyAndOrderFront:] ()
#47 0x0173a970 in nsCocoaWindow::Show ()
#48 0x015ff95a in nsXULWindow::SetVisibility ()
#49 0x0160100f in nsXULWindow::OnChromeLoaded ()
#50 0x01604cdd in nsWebShellWindow::OnStateChange ()
#51 0x015bd108 in nsDocLoader::FireOnStateChange ()
#52 0x015bd4a9 in nsDocLoader::doStopDocumentLoad ()
#53 0x015bd598 in nsDocLoader::DocLoaderIsEmpty ()
#54 0x015bd9c7 in nsDocLoader::OnStopRequest ()
#55 0x01071202 in nsLoadGroup::RemoveRequest ()
#56 0x0116a7da in imgRequestProxy::RemoveFromLoadGroup ()
#57 0x0116a851 in imgRequestProxy::OnStopRequest ()
#58 0x01169fd4 in imgRequest::OnStopRequest ()
#59 0x01166f1e in ProxyListener::OnStopRequest ()
#60 0x01067659 in nsBaseChannel::OnStopRequest ()
#61 0x0106ce9c in nsInputStreamPump::OnStateStop ()
#62 0x0106d30c in nsInputStreamPump::OnInputStreamReady ()
#63 0x01959f6b in nsInputStreamReadyEvent::Run ()
#64 0x017a856c in nsThread::ProcessNextEvent ()
#65 0x01775e3a in NS_ProcessPendingEvents_P ()
#66 0x0174fbd5 in nsBaseAppShell::NativeEventCallback ()
#67 0x017369fc in nsAppShell::ProcessGeckoEvents ()
#68 0x9082cf32 in CFRunLoopRunSpecific ()
#69 0x9082ca6e in CFRunLoopRunInMode ()
#70 0x92df4878 in RunCurrentEventLoopInMode ()
#71 0x92df3eb9 in ReceiveNextEventCommon ()
#72 0x92df3dd9 in BlockUntilNextEventMatchingListInMode ()
#73 0x9329b485 in _DPSNextEvent ()
#74 0x9329b076 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#75 0x93294dfb in -[NSApplication run] ()
#76 0x017365a9 in nsAppShell::Run ()
#77 0x016140d3 in nsAppStartup::Run ()
#78 0x0101133f in XRE_main ()
#79 0x00002e02 in main ()

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 16

•

17 years ago

Ah, that gives me an idea of what's crashing; now the /why/ is a different issue (NULL surface being passed to cairo_quartz_surface_get_image).  I'll track a few things down.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 17

•

17 years ago

Attached patch patch, now with check and warning (obsolete) — Details — Splinter Review

Ok, give this one a shot -- the only change is checking for NULL and printing a NS_WARNING in that case, and returning nsnull.

Attachment #298078 - Attachment is obsolete: true

Attachment #298845 - Flags: review?(pavlov)

Attachment #298078 - Flags: review?(pavlov)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 18

•

17 years ago

Attached patch one more — Details — Splinter Review

Whoops, found one more issue.  This is probably the real crash-causer, managed to reproduce it locally.

Attachment #298845 - Attachment is obsolete: true

Attachment #298849 - Flags: review?(pavlov)

Attachment #298845 - Flags: review?(pavlov)

Stuart Parmenter

Updated

•

17 years ago

Attachment #298849 - Flags: review?(pavlov) → review+

Michael Moy

Comment 19

•

17 years ago

The latest patch doesn't crash (from my tests) and gets performance of about six percent slower than 2.0.0.11 which is a considerable improvement from before the patch.

Mike Schroepfer

Reporter

Updated

•

17 years ago

Keywords: checkin-needed

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 20

•

17 years ago

(I'll land this tomorrow -- I have a locally merged version that comes after the cairo update that I'm going to land first)

jag (Peter Annema)

Comment 21

•

17 years ago

Woot! 6% slower still sucks, but less so.

Michael Moy

Comment 22

•

17 years ago

Alfred's next change should result in a net improvement in performance over 2.0.0.11.

jag (Peter Annema)

Comment 23

•

17 years ago

Sure, but is that 6% recoverable on top of that?

Michael Moy

Comment 24

•

17 years ago

No.

The current patch for this doesn't apply due to other code changes.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 25

•

17 years ago

Sorry, I have an updated/merged patch that I was going to check in, but was waiting for the tree to quiet down.

Michael Moy

Comment 26

•

17 years ago

Could you email it to me (mmoy@yahoo.com)? I'm trying to do some performance tests on idct changes.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 27

•

17 years ago

Attached patch updated patch (obsolete) — Details — Splinter Review

Here's an updated patch; should apply cleanly. I'll get it checked in later tonight/tomorrow.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 28

•

17 years ago

Comment on attachment 299528 [details] [diff] [review]
updated patch

wrong patch

Attachment #299528 - Attachment is obsolete: true

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 29

•

17 years ago

Just checked in.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 30

•

17 years ago

I backed this out because it apparently caused a really weird mochikit failure on the *linux* tinderbox.  I'm having a very hard time believing that this is this patch's fault, but backing it out fixed the issue.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 31

•

17 years ago

Attached patch the backout patch (reapply with -R) — Details — Splinter Review

Michael Moy

Comment 32

•

17 years ago

I saw the performance improvement with the earlier patch on the trunk on another maching but the machine that I've been using for the past few days has the trunk quite a bit slower than 3.0 beta 2. I'm not completely sure as to whether or not I have the patch in. I did pull the trunk last night but it may have been after you backed out the patch.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 33

•

17 years ago

Relanded; the gremlins went away.

Status: REOPENED → RESOLVED

Closed: 17 years ago → 17 years ago

Resolution: --- → FIXED

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 34

•

17 years ago

Hit enter too soon -- Michael, I would suggest that you use the alternate test driver (first attachment here) for benchmarking pure jpeg decoding performance (e.g. the IDCT test) -- there's no need to include rendering as part of those measurements.

Michael Moy

Comment 35

•

17 years ago

I think that it's useful to test the complete time as that's what the user sees and it covers regressions. At the moment, the trunk seems to be 50% slower than Beta 2. Before, I think it was comparable to Beta 2. So I'm a bit worried that there's another problem. It's hard to tell given the fluidity of the code though.

I have the idct code working on the Mac and shark reports a nice improvement in the amount of CPU spent in idct. I will look at your alternate test if I can get a hold of an OSX machine. The one that I was using is at college with my son right now.

Reed Loden [:reed]

Updated

•

17 years ago

Keywords: checkin-needed

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 36

•

17 years ago

(In reply to comment #35)
> I think that it's useful to test the complete time as that's what the user sees
> and it covers regressions. At the moment, the trunk seems to be 50% slower than
> Beta 2. Before, I think it was comparable to Beta 2. So I'm a bit worried that
> there's another problem. It's hard to tell given the fluidity of the code
> though.

Hm, I'm not sure what you mean -- you mean that with this patch (that is, current CVS trunk), it's 50% slower than b2?  Or firefox 2?  And that the previous time you tried the patch, it was comparable to b2 (fx2?)?

Michael Moy

Comment 37

•

17 years ago

I think that the trunk is slower than B2 now. Not sure if your patch was in or not. I'm going to try to pull the trunk and build tonight and compare with B2.

philippe (part-time)

Comment 38

•

17 years ago

Does bug 408029 bear any relation to the performance issues discussed here ?

bug 408029 deals with slow downs / unresponsiveness when loading very large .jpg images. My humble tests show a slight improvement after the landing of this patch on Intel Macs, but not much so on PPC macs.

Mike Schroepfer

Reporter

Comment 39

•

17 years ago

(In reply to comment #37)
> I think that the trunk is slower than B2 now. Not sure if your patch was in or
> not. I'm going to try to pull the trunk and build tonight and compare with B2.
> 

Try with tomorrow's trunk (has this patch for sure).  I will as well and will report results.

Mike Schroepfer

Reporter

Comment 40

•

17 years ago

Just a note - I believe this is good for a 6% Tp win:

http://graphs.mozilla.org/#spst=range&spss=1201547268.4328225&spse=1201561889.6438231&spstart=1200612371&spend=1201566190&bpst=Cursor&bpstart=1201538271&bpend=1201561889.6438231&m1tid=103002&m1bl=0&m1avg=0&m2tid=103067&m2bl=0&m2avg=0&m3tid=103117&m3bl=0&m3avg=0&m4tid=102925&m4bl=0&m4avg=0

Shows on mac only.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 41

•

17 years ago

(In reply to comment #37)
> I think that the trunk is slower than B2 now. Not sure if your patch was in or
> not. I'm going to try to pull the trunk and build tonight and compare with B2.

Just to confirm -- you are saying that the trunk is slower than *Firefox 3 Beta 2* (i.e. the previous beta), not *Firefox 2* ?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 42

•

17 years ago

(In reply to comment #40)
> Just a note - I believe this is good for a 6% Tp win:

Neat!  The previous dip is when the patch was briefly landed before I backed it out, I never even got around to looking at the numbers back then.

Michael Moy

Comment 43

•

17 years ago

Sounds like a good time to do a build and try it out then.

Michael Moy

Comment 44

•

17 years ago

I keep getting this while trying to build. I'm going to pull later this morning.

make[6]: *** No rule to make target `nsUnicodeToMacGurmukhi.cpp', needed by `nsUnicodeToMacGurmukhi.o'.  Stop.
make[6]: *** Waiting for unfinished jobs....
nsUnicodeToMacGujarati.cpp

Mike Schroepfer

Reporter

Comment 45

•

17 years ago

Today's Trunk.

        Test1            Test2
FF3     6.63             2.151
FF2     4.53              1.897
        1.46x slower     1.13x slower

Vista

FF3     1.17            1.08
FF2     1.313           1.3
        1.12x faster    1.2x faster

This is on different hw so Vista/Mac numbers are not directly comparable.  But something is still wrong on the mac.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Mike Schroepfer

Reporter

Comment 46

•

17 years ago

(In reply to comment #45)
> Today's Trunk.
> 
>         Test1            Test2
> FF3     6.63             2.151
> FF2     4.53              1.897
>         1.46x slower     1.13x slower
> 
> Vista
> 
> FF3     1.17            1.08
> FF2     1.313           1.3
>         1.12x faster    1.2x faster
> 
> This is on different hw so Vista/Mac numbers are not directly comparable.  But
> something is still wrong on the mac.
> 

On Same hardware Vista:

FF3    1.65-1.84s      1400ms
FF2    2s              1600ms

So something is clearly wrong both on the JPEG side (test 2) and on the painting/scaling side (test 1) since both are faster on Windows and slower on mac (Test 1 by a much wider margin)

Michael Moy

Comment 47

•

17 years ago

Windows has SSE2 IDCT so it gets a bump there on the JPEG side. I've always felt that you can't compare Mac numbers to Vista numbers because of all of the OS and GUI differences but the same OS comparison are definitely useful.

I didn't get a chance to try a build this morning and will try to get the hardware tonight. Did you get a chance to run a test on Beta 2?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 48

•

17 years ago

On Windows, the image scaling code is going through pixman.  On the mac we're going through quartz.  We should really split this bug up into two separate ones -- the JPEG decoding speed (sans drawing) is a separate beast from image scaling.

Mike Schroepfer

Reporter

Comment 49

•

17 years ago

(In reply to comment #47)
> Windows has SSE2 IDCT so it gets a bump there on the JPEG side. I've always
> felt that you can't compare Mac numbers to Vista numbers because of all of the
> OS and GUI differences but the same OS comparison are definitely useful.
> 

Why don't we have SSE2 for the Mac?

Michael Moy

Comment 50

•

17 years ago

> Why don't we have SSE2 for the Mac?

The idct code is on my son's computer. The color code is waiting for Alfred's 10% checkin. I can do the sampling code but it's only a small improvement.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

17 years ago

Depends on: 414685

Michael Moy

Comment 51

•

17 years ago

Here are my results on a number of builds. The Trunk before Vlad's change was around 18 or 19 seconds.

Trunk:       13.2 seconds
Trunk+IDCT:  11.2 seconds
Trunk+IDCT+
  Alfred:    10.6 seconds
Trunk+IDCT+
  Alfred+
  jdcolor:   10.0 seconds
Beta2:       12.3 seconds
2.0.0.11:     7.5 seconds
mmoy 2.0.0.6  3.9 seconds

At the moment, Shark shows about 30% in decode_mcu, 16% in idct, 15% in ycc_rgb_convert_argb (Alfred's new routine), 9% in h2v1_fancy_upsample, 6% in jpeg_fill_bit_buffer, $% in h2v2_fancy_upsample and the rest is trivial.

It appears to me that the OS stuff to paint the windows (Quartz?) doesn't get tracked by Shark. 

It's also interesting that the time difference between Trunk to Trunk+IDCT+Alfred+jdcolor is pretty close to the difference between 2.0.0.11 and mmoy 2.0.0.6.

It looks like more work is needed on the rendering side.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

16 years ago

Target Milestone: --- → mozilla1.9beta4

Mike Schroepfer

Reporter

Comment 52

•

16 years ago

Test1=test that scales and paints the images 
Test2=test that just decodes the images

Branch
Test1 4.6s
Test2 1.9s

2-5-2008 trunk
Test1 6.2s
Test2 2.1s

2-6-2008 trunk
Test1 3.2s
Test2 2.1s

So Trunk scaling/painting is now ~2x faster than yesterday and 1.43x faster than branch (w00t)
Trunk JPEG decoding is 1.10x slower than branch

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 53

•

16 years ago

Going to resolve this -- we should probably file a new bug for the pure jpeg decoding test (or is there an existing bug about getting the MMX etc. stuff working on the mac)?

Status: REOPENED → RESOLVED

Closed: 17 years ago → 16 years ago

Resolution: --- → FIXED

Alfred Kayser

Comment 54

•

16 years ago

One cause of being slower than FF2 in jpeg decoding is that the image source data is now stored for the 'discard/restore' function. This is slow because one nsTArray is used for the complete image source data, and adding every segment means reallocs of a large piece of memory (which will result in alloc/memcpy/free).

Stuart Parmenter

Comment 55

•

16 years ago

reallocs on mac and windows (with jemalloc) at least are quite fast.

Alfred Kayser

Comment 56

•

16 years ago

But not when the chunk is already large (say 511KB, and another 4K needs to be added, and there is no room to do after that chunk), then whatever malloc library is used, it needs to allocate some new space, copy the 511KB of data to the new space, and free the old space. 

A simple test would be to disable the discarding (just do setenv MOZ_DISABLE_IMAGE_DISCARD="1"), and do the same performance tests on the decoding.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 57

•

16 years ago

If the server sends a content-length for an image, shouldn't we use that as an initial alloc size (assuming the size is reasonable)?

alternate test driver 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 906 bytes, text/html		Details
optimize image loading on the mac 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 7.63 KB, patch		Details \| Diff \| Splinter Review
patch, now with check and warning 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 7.94 KB, patch		Details \| Diff \| Splinter Review
one more 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 8.84 KB, patch	pavlov : review+	Details \| Diff \| Splinter Review
updated patch 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 7.71 KB, patch		Details \| Diff \| Splinter Review
the backout patch (reapply with -R) 17 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 32.24 KB, patch		Details \| Diff \| Splinter Review