[meta] StretchDIBits takes too long

RESOLVED FIXED

Status

()

defect
--
critical
RESOLVED FIXED
11 years ago
8 years ago

People

(Reporter: dougt, Assigned: kwwaters)

Tracking

Trunk
All
Windows CE
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(7 attachments, 2 obsolete attachments)

On windows mobile, we use StrechDIBits to draw to the screen:

http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsWindow.cpp#6189

During testing of Fennec, panning appears very sluggish.  When I profiled drawing, i found that calls to StretchDIBits take, on average, 98ms to redraw the entire HTC Touch Pro screen (480 x 640 pixels).

We really need to get this time down as it has a huge impact on fennec usability.
tracking-fennec: --- → ?
tracking-fennec: ? → 1.0a1-wm+
Here's a patch that uses Direct Draw to blit from the offscreen buffer to the onscreen one.  Since Direct Draw blits are asynchronous, this should let the blit overlap the compute for the next frame (on hardware-accelerated platforms, at least).  The offscreen buffer is still locked during the blit, so if cairo trys to render before the blit is finished, it will stall.  There is (optional) triple-buffering code that can cover this delay as well, but its unlikely that the CPU will beat the blitter, so it's off by default.

This patch needs to use the Win Mobile 6 SDK - ddraw.lib doesn't seem to be in the WM 5 SDK (although ddraw.h is) and the ddraw interfaces are wildly different between the two.  All the code is in widget/src/windows/nsWindow.cpp, and there are a bunch of #defines with DIRECT_DRAW in them that control debugging, etc.  PAINT_USE_DIRECT_DRAW controls the whole thing; if you turn that off and use PAINT_USE_IMAGE_SURFACE you should get the old code.

I didn't know how to make the build link with ddraw.lib; I hacked it into the tool wrappers in build/wince/tools, but that doesn't feel right.  Someone with more build skillz should do the right thing please.

Very little testing on the patch; it almost certainly needs more debug.  I ran nvtrender and it seemed fine until it got to the yahoo-jp page (which I lack the unicode fonts for) and then it hung.  I will continue to investigate, but for now heres a patch to try out.
Oops, my bad.  It's not a hang; i just didn't notice that my debugger still had a breakpoint set at _cairo_error (which we hit, by the way).
I've tried this patch on my real device (HTC Touch Diamond), but IDirectDraw::CreateSurface() fails with DDERR_UNSUPPORTEDFORMAT and causes crash.
Although the crash report dialog appears, drawing thread continue running. So I'm not able to get stack trace or other evidences yet.

What is wrong?

BTW the patch includes a typo:

  - #define PAINT_DIRECT_DRAW_DEBUG_ERRORS
  + #define PAINT_DIRECT_DRAW_DEBUG_ERROR
It seems that Touch Diamond doesn't support 32bits-RGB surface, but RGB-24 and ARGB-32 surfaces can be created.

Because cairo doesn't support RGB-24, I've tried the patch with ARGB-32.
Now it works, but Blt() method takes longer time than StretchDIBits() on my device.

  StretchDIBits(): about 80 - 100ms
  Blt(): about 170 - 200ms
Cairo does support RGB24, but only if it is not packed (that is, 4 bytes, one of which is ignored).
I forgot to attach the patch.
I've tested with this.
Posted file blittest.cpp
I wrote the attached test program to benchmark blitting using different methods.  There's some weird stuff here; I only have results for CE6 on APX2600, but if someone with one of the windows mobile devices wants to poke me I can give you a binary.  The time is the average per blit out of 50; the numbers in parens are min-max for a single blit.

Surface size 1024*600:

StretchDIBits 16: 175 (173-213)
StretchDIBits 24: 50 (50-51)
StretchDIBits 32: 168 (167-175)
StretchDIBits w/32->24: 75 (75-77)
CompatibleDC/CompatibleBitmap: 100 (100-104)
CompatibleDC/DIB 16: 176 (173-252)
CompatibleDC/DIB 24: 50 (50-53)
CompatibleDC/DIB 32: 167 (165-181)
DDraw: 248 (243-280)

Surface size 320*240:

StretchDIBits 16: 34 (33-66)
StretchDIBits 24: 9 (9-12)
StretchDIBits 32: 32 (32-35)
StretchDIBits w/32->24: 12 (12-15)
CompatibleDC/CompatibleBitmap: 20 (19-23)
CompatibleDC/DIB 16: 33 (33-35)
CompatibleDC/DIB 24: 9 (8-12)
CompatibleDC/DIB 32: 32 (31-34)
DDraw: 47 (47-51)

There's a few pieces of interesting info here:

- CreateDIBSection performance is identical to StretchDIBits for our purposes, so we can ignore the CompatibleDCCreateDIBSection pieces.  This is what I thought form the desktop, so this is good confirmation.

- StretchDIBits with 24bpp (BGR packed data) is substantially faster than either 16 or 32.  This makes no sense to me, and leads me to think that we're hitting some unoptimized path for 16 and 32 -- the device is in 16bpp display mode, so the 16bpp case should be at least faster than 32!  I know that traditionally Win32 GDI prefers 24bpp BGR, though on the desktop 32bpp is faster than 24bpp (and both are faster than 16bpp).

- Using a compatible bitmap is slower than StretchDIBits with 24bpp -- this is also very strange, because using a compatible bitmap is supposed to provide you with matching formats and the fastest path, not to mention living inside GDI and the driver.

- DirectDraw was surprisingly slow.  I only benchmarked drawing direct to the primary, while waiting for each previous operation to complete to provide comparable numbers.  I should have implemented triple buffering here as well, but I don't think it'll help us much in this testcase -- we'll end up blocking on the final Blt, even though we wouldn't in practice unless we're painting a lot of things in quick succession.  However, note that the test doesn't actually write anything to the pointer returned by Lock; modifying the bits directly using Fill32bppBits or adding in a memcpy increase the numbers substantially.

- Benchmarking a 24bpp blit and a 32->24 conversion results in the fastest numbers for something that we can implement easily for a quick speedup.  My 32->24 code is very braindead, but my late-night attempt at doing something smarter didn't provide any speedups.  I would think we could get the conversion faster with some smarter code (takes 25ms out of the 75ms).

- The fastest number for 1024x600 was 50ms -- given that this is 20fps, this seems way too slow.

However... I just tried DirectDraw using 16bpp (by matching the primary surface format exactly), and this yielded 3ms per blit.  So that at least does end up being a straight fast copy, no conversion; so what we can do here is to render to an internal 32bpp RGB buffer, and then run our 32->16 code to the ddraw surface to get things to the screen.  That'll probably get us the fastest results, provided we can get the clipping and other pieces right.
Oh, as mfinkle points out, the source I uploaded has a bunch of the tests #if 0'd out -- search for did_tests and change #if 0 to #if 1.
Using Vlad's blittest.exe (4/10/2009  7:09:08 PM)

HTC Touch Pro: VGA (480 x 640)

StretchDIBits 16: 81 (79-99)
StretchDIBits 24: 20 (18-23)
StretchDIBits 32: 74 (73-76)
StretchDIBits w/32->24: 103 (101-107)
CompatibleDC/Bitmap: 41 (40-44)
CompatibleDC/DIB 16: 80 (79-82)
CompatibleDC/DIB 24: 20 (18-23)
CompatibleDC/DIB 32: 75 (73-82)
DirectDraw failed
DirectDraw(2) failed

Samsung Omnia: WQVGA (240 x 400)

StretchDIBits 16: 36 (31-49)
StretchDIBits 24: 28 (27-35)
StretchDIBits 32: 30 (29-33)
StretchDIBits w/32->24: 56 (55-61)
CompatibleDC/Bitmap: 20 (19-25)
CompatibleDC/DIB 16: 32 (31-34)
CompatibleDC/DIB 24: 27 (27-30)
CompatibleDC/DIB 32: 30 (29-36)
TestDDraw (32bpp): 33 (33-37)
DirectDraw(2) failed
I've tested with same binary with mfinkle (I read IRC log).
It's not so different with above results.

HTC TyTN II: QVGA (240 x 320)

StretchDIBits 16: 33 (30-84)
StretchDIBits 24: 23 (21-30)
StretchDIBits 32: 30 (26-43)
StretchDIBits w/32->24: 111 (106-126)
CompatibleDC/Bitmap: 16 (15-22)
CompatibleDC/DIB 16: 33 (28-65)
CompatibleDC/DIB 24: 24 (21-38)
CompatibleDC/DIB 32: 30 (29-41)
TestDDraw (32bpp): 34 (31-47)
DirectDraw(2) failed


HTC Touch Diamond: VGA (480 x 640)

StretchDIBits 16: 81 (79-90)
StretchDIBits 24: 26 (18-80)
StretchDIBits 32: 75 (73-80)
StretchDIBits w/32->24: 108 (97-140)
CompatibleDC/Bitmap: 46 (43-52)
CompatibleDC/DIB 16: 82 (80-110)
CompatibleDC/DIB 24: 25 (17-42)
CompatibleDC/DIB 32: 77 (73-111)
DirectDraw failed
DirectDraw(2) failed
I updated http://people.mozilla.com/~vladimir/misc/blittest.exe with a version that does 32->24 conversion on just the portion that's to be painted, rather than the entire thing.  On my device here, that makes it the fastest non-directdraw approach that's feasible for us.  From looking at the previous sets of numbers, I'd expect it to also be fast on the HTC Touch Diamond/Pro, but it would be a tossup on the TyTN and Omnia.  Could you guys redownload and rerun?
Here's a quick patch to do the 32->24 conversion.  This gives a decent speedup, though nowhere near as fast as the directdraw approach, but that option might not be available on all our target devices.

We may need to do some tests on startup to figure out what the best approach is for any given device.
Attachment #371818 - Attachment is patch: true
Attachment #371818 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #11)
> Could you guys redownload and rerun?

Samsung Omnia: WQVGA (240 x 400)

StretchDIBits 16: 47 (46-52)
StretchDIBits 24: 29 (28-34)
StretchDIBits 32: 31 (30-34)
StretchDIBits w/32bpp->24bpp (397x214): 33 (32-36)
CompatibleDC/Bitmap (16): 21 (21-23)
TestDDraw (32bpp): 35 (34-40)
DirectDraw(2) failed
HTC TyTN II: QVGA (320x240)

StretchDIBits 16: 42 (41-45)
StretchDIBits 24: 36 (35-39)
StretchDIBits 32: 39 (36-45)
StretchDIBits w/32bpp->24bpp (317x214): 45 (43-58)
CompatibleDC/Bitmap (16): 27 (23-50)
TestDDraw (32bpp): 44 (40-51)
DirectDraw(2) failed

HTC Touch Diamond: VGA (480x640)

StretchDIBits 16: 82 (79-85)
StretchDIBits 24: 24 (17-40)
StretchDIBits 32: 75 (72-80)
StretchDIBits w/32bpp->24bpp (474x542): 59 (52-73)
CompatibleDC/Bitmap (16): 48 (43-90)
DirectDraw failed
DirectDraw(2) failed
Attachment #372408 - Attachment is patch: true
Attachment #372408 - Attachment mime type: application/octet-stream → text/plain
So I tried a different approach to see if we could get additional API acceleration.  On our platform, now that the StretchDIBits cost was mostly covered, some of the rendering costs were coming to to the top of the profile, and these were slower now because they were rendering into an uncached ddraw surface instead of a cacheable image surface.

The new approach was to do a new cairo backend for ddraw surfaces.  The backbuffer becomes the only gfxDDrawSurface in the system and everything else uses cairo to render to it.  All other cairo surfaces are either image surfaces or windows DIBs.  The OnPaint uses ddraw Blt to copy it to the primary surface.

At the moment, the only hardware acceleration for the ddraw surface is fill_rectangles, since that was a large component of the drawing.  The surface uses lazy locking and unlocking.  I see a significant speed up for most pages, except that the pages that are missing font support (unicode chars w/o an approproate international font) are slower.  I presume this is because the DrawHexChar code is thrashing the locks/unlocks because it probably does small fills followed by sorfware rendering.

Since the lock/unlock of the surface is done at a lower level, it doesn't seem to have the issues with OnPaint recursion that the last patch had.  cairo will lock or unlock the surface as needed and as long as a cairo draw is not interrupted its fine (OnPaint recursion I presume happens due to synchronous calls into the message handler caused by windows API calls like UpdateRect, so I'm pretty sure cairo calls can't be interrupted).

There are still a number of things I'd like to try - I have other ideas how to improve the performance of this patch, but I'm throwing it over the wall for evaluation.  It will have the same problems as the previous patch on other devices --- if you don't support x8r8g8b8 surfaces it will fail.  If you don't have hardware support for x8r8g8b8 blt to your primary, then 24bpp might be faster, etc.  In addition I don't have the (broken) fallback path that the previous patch had - if you don't have direct draw or if ddraw fails it will die in an unfortuante way (ie there's no real error handling yet).  I agree with Vlad, we will probably need to do some runtime determination of the best strategy for each platform.  We might be able to get this from Caps bits, but probably not in any useful way.

I think there may be an easy fix for the DrawHexchar slowdown, which would basically be to not use hw accel on the fill if the surface is locked and the fill is small.  I'm going to do some profiling.  In addition there may be a way to use hw accel for compositing and masking which I think is leaving at least another 10% on the table.  Currently it may be doing read-modify-write to the uncached ddraw surface and we could probably do a copy of the source surface to a ddraw surface and then use the hardware to composite.  This might make unaccelerated platforms slower though.

Patch attached.
(In reply to comment #16)
> Created an attachment (id=372459) [details]
> cairo direct-draw backend

Only skimmed the patch initially, but so far it looks great.  One possible optimization that I see is the backend itself could forward calls to the image surface directly, instead of going through the acquire/release path... but at worst that should just be an extra few calls.  Jeff might have some feedback there.

> At the moment, the only hardware acceleration for the ddraw surface is
> fill_rectangles, since that was a large component of the drawing.  The surface
> uses lazy locking and unlocking.  I see a significant speed up for most pages,
> except that the pages that are missing font support (unicode chars w/o an
> approproate international font) are slower.  I presume this is because the
> DrawHexChar code is thrashing the locks/unlocks because it probably does small
> fills followed by sorfware rendering.

That makes sense, and I agree with your suggestion on how to fix DrawHexChars.. other options there are a) getting rid of the fancy version of DrawHexChars entirely, and just having it draw small solid filled rects; b) fixing it to draw to a temporary image surface and then blitting that in at once.

> Since the lock/unlock of the surface is done at a lower level, it doesn't seem
> to have the issues with OnPaint recursion that the last patch had.  cairo will
> lock or unlock the surface as needed and as long as a cairo draw is not
> interrupted its fine (OnPaint recursion I presume happens due to synchronous
> calls into the message handler caused by windows API calls like UpdateRect, so
> I'm pretty sure cairo calls can't be interrupted).

Right; cairo is threadsafe, but since we only do drawing on the main thread we disable all the threadsafe locking.  Even then, drawing to the same surface/cairo_t from different threads is never safe.

> There are still a number of things I'd like to try - I have other ideas how to
> improve the performance of this patch, but I'm throwing it over the wall for
> evaluation.  It will have the same problems as the previous patch on other
> devices --- if you don't support x8r8g8b8 surfaces it will fail.  If you don't
> have hardware support for x8r8g8b8 blt to your primary, then 24bpp might be
> faster, etc.  In addition I don't have the (broken) fallback path that the
> previous patch had - if you don't have direct draw or if ddraw fails it will
> die in an unfortuante way (ie there's no real error handling yet).  I agree
> with Vlad, we will probably need to do some runtime determination of the best
> strategy for each platform.  We might be able to get this from Caps bits, but
> probably not in any useful way.

Yep.. worst-case we can do a quick timing loop on first start, and then save the result in a pref.  But I think having the ddraw path as the primary approach and keeping the StretchDIBits as the fallback (with or without 32->24) should keep us compatible with all hardware.
Oh, also -- I wonder if we could get faster text rendering by implementing show_glyphs here, and caching text glyphs into DirectDraw surfaces?  Then we could be doing ddraw->ddraw blits for text, though that might hurt us if we have to render to a temporary non-ddraw image surface (such as when doing a surface with alpha).
Updated patch for the 32->24 optimization when using StretchDIBits.  The smarter code here isn't any faster or slower than the dumb code on some platforms, but on others it makes a huge difference.
Attachment #372357 - Attachment is obsolete: true
Attachment #372503 - Flags: review?(jmuizelaar)
Just tried the directdraw patch -- it's running fine here, though some colors seem to be off; e.g. the text highlight color should be blue (e.g. highlight the URL in the urlbar), but it's showing up as dark red.  This happens whether I use ddraw or the stretchdib path.

Also, on the apx at least, the ddraw patch is noticeably faster on simpler pages (e.g. the minefield start pages).  On more complex pages (I've been using cnet.com, since they don't send you off to a mobile version), it feels like a tossup between ddraw and the 24bpp stretchdib; I'm guessing that actual cairo rendering is dominating there.
Attachment #372503 - Flags: review?(jmuizelaar) → review+
Comment on attachment 372503 [details] [diff] [review]
updated 32->24 StretchDIBits patch

>b=484864, StretchDIBits takes too long (do 32->24 conversion on WinCE)
>
>diff -r a79d486e5195 widget/src/windows/nsWindow.cpp
>--- a/widget/src/windows/nsWindow.cpp    Mon Apr 13 14:11:28 2009 -0700
>+++ b/widget/src/windows/nsWindow.cpp    Mon Apr 13 17:13:02 2009 -0700
>@@ -180,6 +180,8 @@
>  */
> #ifdef WINCE
> 
>+#define PAINT_USE_IMAGE_SURFACE_24BPP
>+
> // used by nsIdleServiceWin.cpp
> DWORD gLastWindowMessageTime = 0;
> 
>@@ -6074,6 +6076,69 @@
>         bi.biBitCount = 32;
>         bi.biCompression = BI_RGB;
> 
>+#ifdef PAINT_USE_IMAGE_SURFACE_24BPP
>+        // On Windows CE/Windows Mobile, 24bpp packed-pixel sources
>+        // seem to be far faster to blit than 32bpp (see bug 484864).
>+        // So, convert the bits to 24bpp by stripping out the unused
>+        // alpha byte.  24bpp DIBs also have scanlines that are 4-byte
>+        // aligned though, so that must be taken into account.
>+        int srcstride = surfaceSize.width*4;
>+        int dststride = surfaceSize.width*3;
>+        if (dststride & 3) dststride = (dststride & ~3) + 4;

You could write this as:
dststride = (dststride + 3) & ~3;
and avoid the conditional

>+
>+        int width_over_4 = surfaceSize.width / 4;
>+        int width_left = surfaceSize.width % 4;
>+
>+        // Convert in place
>+        for (int j = 0; j < surfaceSize.height; ++j) {
>+          unsigned int *src = (unsigned int*) (targetSurface->Data() + j*srcstride);
>+          unsigned int *dst = (unsigned int*) (targetSurface->Data() + j*dststride);
>+
>+          // go 4 pixels at a time, since each 4 pixels
>+          // turns into 3 DWORDs when converted into BGR:
>+          // BGRx BGRx BGRx BGRx -> BGRB GRBG RBGR
>+          //
>+          // However, since we're dealing with little-endian ints, this is actually:
>+          // xRGB xrgb xRGB xrgb -> bRGB GBrg rgbR
>+          for (int i = 0; i < width_over_4; ++i) {

I find 'while' loops work well for unrolling like this. e.g:

while (width_left > 4) {
   do_stuff();
   width_left -= 4;
}

switch (width_left) {
   welcwc
}

Not a huge deal though.

In the long term we should probably move loops like this into pixman if possible.
Landed 32->24 code as http://hg.mozilla.org/mozilla-central/rev/2d9d7822862a with jeff's suggestions.  Let's keep this bug open for improvements & ddraw work.
There's something to be said for one-patch-per-bug, any reason we shouldn't just clone this to a new meta bug for that?
Update to the cairo direct-draw backend patch.

Fixes the red-blue swap in fill_rectangles.  Adds an optimization to use software for small fill_rectangles if the ddraw surface is currently locked, because locking and unlocking is somewhat expensive.
Summary: StretchDIBits takes too long → [meta] StretchDIBits takes too long
tracking-fennec: 1.0a1-wm+ → 1.0b1-wm+
This is a fix for the cairo ddraw backend which fixes a clipping issues when the window isn't contained in the screen; such as when the window is partially offscreen or when the window is larger than the screen.  This gets fennec drawing with ddraw again.
removing the blocking flag since most of what we care about has landed
tracking-fennec: 1.0b1-wm+ → ---
what blassey said. more or less fixed.  :-)
Status: NEW → RESOLVED
tracking-fennec: --- → ?
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → kwwaters
tracking-fennec: ? → ---
You need to log in before you can comment on or make changes to this bug.