Speed up Canvas's getImageData() and putImageData()

RESOLVED FIXED in mozilla1.9.3a2

Status

()

Core
Canvas: 2D
RESOLVED FIXED
9 years ago
7 years ago

People

(Reporter: Justin Lebar (not reading bugmail), Assigned: Justin Lebar (not reading bugmail))

Tracking

(Depends on: 1 bug)

Trunk
mozilla1.9.3a2
x86
All
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 4 obsolete attachments)

(Assignee)

Description

9 years ago
nsCanvasRenderingContext2D::GetImageData has a tight loop in which it unpremultiplies the pixels on the canvas.  I'm looking at whether this loop can be sped up.
(Assignee)

Comment 1

9 years ago
Created attachment 403437 [details]
Standalone benchmark

I benchmarked on a Ubuntu 9.04 guest VM (allocated 1GB RAM) running in VirtualBox on a Win XP host (2GB, Intel Core 2 Duo T7200 @ 2.0Ghz).

I tried a few approaches to speeding up the loop.  Some of them worked:

  * extraCheck - Don't unpremultiply when alpha = 255.  This is the most trivial change to the algorithm; it's just an extra condition in one if statement.  The assumption here is that pixels with alpha = 255 (i.e., opaque pixels) are common.
  * tableLookup - Precompute the value of unpremultiplying pixel value p by alpha value a for all p and a, and store it in a table.  Unpremultiplying an image now becomes a set of table lookups.

And some of them didn't work:

  * twoAtOnce - Unrolling the loop so we process two pixels at once, in the hope that this might cause fewer pipeline stalls.  This didn't yield any speedup, and sometimes slowed things down.  I also tried combining it with both of the above approaches; again, no luck.  I didn't include these in the benchmark results below, but I left the implementations in the attachment.

  * SSE - I actually spent a while writing an SSE implementation, but I don't think it's a fruitful path to continue down, for two reasons:  First, afaik, there's no integer divide in SSE, so we have to convert to floats, divide, then convert back to integers.  This wasn't speedy in my tests.  But more importantly, there's no packed dword left bit shift instruction.  (There is a packed dword left *byte* shift instruction.)  We need a packed dword left bit shift in order to implement INT_TO_JSVAL on a vector.

The results from running my benchmark are below.  In a nutshell, we get no benefit from extraCheck if the alpha is random, as expected.  If the alpha is always 255, we get a 15-20% speedup.  The lookupTable performs 12-15% faster than the original with random alphas, and about as well as extraCheck when alpha = 255.

These results are pretty sensible.  lookupTable gets a speed boost when the alpha is fixed because we use less of the lookup table, leading to less cache contention as we read and write the pixel data.

I get the best results by compiling with -O2.  If I use -Os (which I understand may be the Mozilla default?), extraCheck performs about 5% worse in the alpha=255 case.

I benchmarked with different sizes of pixel arrays, those are the w/h parameters in the output.  I also ran the original code twice, at the beginning and end of every test run, just to be sure nothing funky was going on.  It's not.

** w=200, h=200, uniformly random pixels.
       original avg: 6.9262 ms
     extraCheck avg: 6.92021 ms
    lookupTable avg: 6.12208 ms
      original2 avg: 7.05881 ms
extraCheck is 1.02003 times faster than original2.
lookupTable is 1.15301 times faster than original2.

** w=200, h=200, a=255.
       original avg: 7.03998 ms
     extraCheck avg: 6.01677 ms
    lookupTable avg: 5.92034 ms
      original2 avg: 7.15211 ms
extraCheck is 1.18869 times faster than original2.
lookupTable is 1.20806 times faster than original2.

** w=400, h=400, uniformly random pixels.
       original avg: 28.0648 ms
     extraCheck avg: 27.622 ms
    lookupTable avg: 24.3644 ms
      original2 avg: 28.035 ms
extraCheck is 1.01495 times faster than original2.
lookupTable is 1.15065 times faster than original2.

** w=400, h=400, a=255.
       original avg: 27.5603 ms
     extraCheck avg: 23.4347 ms
    lookupTable avg: 23.1969 ms
      original2 avg: 28.03 ms
extraCheck is 1.19609 times faster than original2.
lookupTable is 1.20835 times faster than original2.

** w=600, h=600, uniformly random pixels.
       original avg: 62.9217 ms
     extraCheck avg: 62.9582 ms
    lookupTable avg: 55.4517 ms
      original2 avg: 62.2523 ms
extraCheck is 0.988787 times faster than original2.
lookupTable is 1.12264 times faster than original2.

** w=600, h=600, a=255.
       original avg: 63.5015 ms
     extraCheck avg: 53.7265 ms
    lookupTable avg: 53.0987 ms
      original2 avg: 63.1705 ms
extraCheck is 1.17578 times faster than original2.
lookupTable is 1.18968 times faster than original2.

** w=1000, h=1000, uniformly random pixels.
       original avg: 176.719 ms
     extraCheck avg: 175.625 ms
    lookupTable avg: 152.861 ms
      original2 avg: 173.205 ms
extraCheck is 0.986221 times faster than original2.
lookupTable is 1.13309 times faster than original2.

** w=1000, h=1000, a=255.
       original avg: 174.075 ms
     extraCheck avg: 146.091 ms
    lookupTable avg: 146.85 ms
      original2 avg: 175.632 ms
extraCheck is 1.20221 times faster than original2.
lookupTable is 1.196 times faster than original2.
(Assignee)

Updated

9 years ago
Attachment #403437 - Attachment mime type: application/octet-stream → text/plain
(Assignee)

Comment 2

9 years ago
The question now is: What should we do to the implementation?  I don't see any harm in implementing extraCheck, unless there are some pathological systems on which |(x * 255) / 255 != x|.

The case for the lookup table is, I think, somewhat harder to make.  It takes less than 2ms to compute, and it takes up only 128k of memory, and it speeds things up in the case of alpha != 255.  But we probably don't want to leave those 128k sitting around, and if we share the table between browsing contexts, deciding when to clean it up would be something of a pain.
Status: NEW → ASSIGNED
Is getImageData performance dominated by the unpremultiply loop, or by the JS object creation?
(Assignee)

Comment 4

9 years ago
They appear to be about the same.(In reply to comment #3)
> Is getImageData performance dominated by the unpremultiply loop, or by the JS
> object creation?
They appear to be about the same.  For a 500x500 canvas with random pixels, I time the unpremultiply loop at 46-55ms and newarrayobject at 55-60ms.

(I got these results by calling PR_IntervalNow() at various points in the GetImageData function.  I originally used PR_Now(), but that told me that the unpremultiply loop executed in 0us.  I'm not sure why that happened, and I figure it's probably not relevant, but I figured I'd mention it just in case it's a red flag.)
Created attachment 411450 [details]
javascript-based getImageData() performance test

Hello!

Sorry for bothering. While working on a web application which uses Canvas I bumped into this problem and I noticed that getImageData() is very much slower when compared to Safari/Chrome/Opera.

I am attaching a small TC which performs color inversion with get/putImageData() for every pixel (for illustrative purposes). This is obviously not a legitimate use of getImageData(). Still, the method should execute faster, given there are other legitimate use-cases for the method.

For an 110 x 110 pixels image Firefox 3.5 takes 22 seconds to perform the color inversion while Opera 10 and Chrome 4 take only 0.1 seconds. I tested this on an Ubuntu 9.10 Linux system on an AMD Athlon XP 1800+ 1.5 Ghz machine - official nvidia drivers from Ubuntu repos (geforce 6600gs card).

If I run the test inside the Firebug profiler it points out that most of the time is spent with the getImageData() execution.

There's something wrong - the difference between Gecko and the rest is far too high. I hope this can be fixed soon.

Thanks!
(Assignee)

Comment 6

9 years ago
Mihai,(In reply to comment #5)
> For an 110 x 110 pixels image Firefox 3.5 takes 22 seconds to perform the color
> inversion while Opera 10 and Chrome 4 take only 0.1 seconds. I tested this on
> an Ubuntu 9.10 Linux system on an AMD Athlon XP 1800+ 1.5 Ghz machine -
> official nvidia drivers from Ubuntu repos (geforce 6600gs card).

You may want to try turning off Firebug.  On my machine, FF 3.5 takes .5s do to the conversion, and Chrome takes .1s.  We're still slower, but not 200x.  :)
Hey Justin,

(In reply to comment #6)

> You may want to try turning off Firebug.  On my machine, FF 3.5 takes .5s do to
> the conversion, and Chrome takes .1s.  We're still slower, but not 200x.  :)

Your concern is valid, however during testing Firebug was turned of. I know Firebug causes an important performance penalty. Just to be sure, I now tested in a clean profile where Firebug is not even installed. The result is the same.

What OS did you use for testing? Windows by any chance? I just booted Windows XP in VirtualBox. I ran the tests with the 110x110 pixels image and I got the following results:

Opera 10: 40 ms
Chrome 4: 140 ms
Firefox 3.5: 430 ms

As you can see Firefox is a lot more optimized on Windows. Additionally, the results are pretty much consistent to what you've reported: FF takes around 0.5s to apply the filter.

It seems that even on Windows the getImageData() quickly becomes much slower. I increased the canvas image to 400x400 pixels. Results:

Opera: 0.6 seconds
Chrome: 2 seconds
Firefox: 8 seconds

For the same 400x400 pixels image on Linux, natively, in a clean profile, I get these results:

Opera: 1.9 seconds
Chrome: 2.3 seconds
Firefox: did not end. after 120 seconds it did complete about 1-2% of the image. I can make a screenshot. (it's something which makes me say "wtf?")

Based on testing it looks to me that the issue is valid on Windows, not only on Linux. However, on Linux the issue is much greater - I don't know why. I tested several builds (firefox 3.6 nightly, firefox 3.7 nightly, etc). Both nightlies take the same amount of time to execute the script.

Please retest and try increasing the image size, depending on your machine performance (say pick 600x600 or 1000x1000). If you have some Linux system at hand, please compare.

Thanks for your quick reply and for testing. ;)
(Assignee)

Comment 8

9 years ago
Created attachment 411530 [details]
perf test which calls get/putImageData just once

(In reply to comment #7)
Mihai,

I just reproduced the behavior you've described on my VirtualBox Ubuntu 9.10 install.   It indeed takes forever to run your script, even on a 110x110 image.

Your benchmark does test odd behavior, however.  I don't think repeatedly calling get/putImageData with single-pixel arguments is necessarily something we need to make really fast in general.

It's still interesting to me that the Linux build takes so much longer to perform these small get/putImageData calls.  Perhaps you could file another bug to that effect.  There may be a difficult-to-find and difficult-to-fix reason behind this, however.

In the meantime, I modified your benchmark so it calls get/putImageData only once on a 1000x1000 image.  My stats are:

Windows XP Chrome 4.0.223.16: 740ms
Windows XP Opera 10.01:       1780ms
Windows XP Firefox 3.5.4:     566ms

VirtualBox Ubuntu 9.10 Firefox 3.7a1: 2472ms

I imagine the Linux number would be better if it were run natively.
Note that on linux, the actual canvas buffer lives in the X server, so each get requires asking for the data from the X server, and each put requires pushing the data to the X server (though the put is async, while the get is not).   The other platforms just have an image they can poke at directly, and so don't suffer that penalty.  We could make this faster by not using an X drawable for rendering or something.
Justin: Thanks for testing.

The test is indeed odd, but that's the whole purpose: very *intensive* use of get/putImageData().

The modified test you have attached has shifted the balance from get/putImageData() performance testing to JS perf testing. The tight loops can run with JIT and such. This is reflected by the results:

Chrome 4: 2.8 seconds
Opera 10: 6.6 seconds (slower JS engine)
Firefox 3.6 nightly: 1.3 seconds

(test ran natively on Linux)

The problem at hand is: why it takes forever something which runs quite well in the other two browsers?

Another interesting test you might want to perform: modify the perf test I attached to only loop through the first 110 pixels (both for x and y). Then increase the canvas dimensions to 400x400. Run the perf test. Here are my results on native Linux:

Chrome 4: 0.1 seconds
Opera 10: 0.1 seconds

(the same as the initial test: color inversion for 110x110 takes 0.1 seconds)

Firefox 3.6: did not end. after 120 seconds it did the color inversion for about 60x60 pixels (about a half of what it has to do).

This test shows that the getImageData() performance is pretty constant in Chrome and Opera, as long as you perform the same reads, irrespective of the actual canvas dimensions. However, in Firefox, the performance of the individual reads goes down when the canvas size goes up.

Irrespective of how odd the perf test is, Gecko shouldn't be so much slower when compared to the other engines.

Vladimir: Perhaps the X server increases the performance gap between Gecko and Opera/Chrome. However, please note that with big images, the gap is noticeable in Windows as well. Therefore, it is possible there are two performance bottlenecks: in the common cross-platform code, *and* in the platform-specific code for Linux.
(Assignee)

Comment 11

9 years ago
(In reply to comment #10)
> Irrespective of how odd the perf test is, Gecko shouldn't be so much slower
> when compared to the other engines.
>  [snip]
> In Firefox, the performance of the individual reads goes down when the canvas 
> size goes up.

Mihai, could you file a new bug for this behavior?  I agree that it doesn't seem right.
(In reply to comment #11)
> (In reply to comment #10)
> > Irrespective of how odd the perf test is, Gecko shouldn't be so much slower
> > when compared to the other engines.
> >  [snip]
> > In Firefox, the performance of the individual reads goes down when the canvas 
> > size goes up.
> 
> Mihai, could you file a new bug for this behavior?  I agree that it doesn't
> seem right.

On Linux or on other platforms?  On Linux, that sounds right -- the constant time per read operation is based on the size of the image, since the whole thing generally ends up getting read from the X server.
(Assignee)

Comment 13

9 years ago
Oh, I see.  I misunderstood comment #10 to be suggesting that the time of a fixed-size read increased as the canvas size increased even on Windows.  I just tested it, and that's not the case.
Should I make any bug report? Two separate bug reports, one for Windows, and another for Linux?
(Assignee)

Comment 15

9 years ago
Created attachment 414209 [details] [diff] [review]
Patch for review v1

This simple change speeds up getImageData by about 20% on my benchmark (will attach in a sec) in the case where alpha=255.  For random a, it appears no slower than the original.  Tryserver builds at https://build.mozilla.org/tryserver-builds/justin.lebar@gmail.com-1259042030/

My testing indicates that we might be able to get a similar speedup for all alpha values by using a lookup table.  But I'm not sure we want to keep a 64k lookup table around just for GetImageData.
Attachment #414209 - Flags: review?(vladimir)
(Assignee)

Comment 16

9 years ago
Created attachment 414210 [details]
alpha=255 vs random alpha GetImageData benchmark

On my Windows XP machine, the unaltered build takes 1200ms in both the alpha=255 and random alpha cases, while the patched build takes 1000ms in the alpha=255 case and 1200ms in the random alpha case.
(In reply to comment #15)
> My testing indicates that we might be able to get a similar speedup for all
> alpha values by using a lookup table.  But I'm not sure we want to keep a 64k
> lookup table around just for GetImageData.

We can build it dynamically as needed, and then throw it away after a minute or so of disuse.  Though 64k isn't that much either way.
(Assignee)

Comment 18

9 years ago
Created attachment 416219 [details] [diff] [review]
Lookup table for GetImageData

This patch uses a lookup table which it populates on the first call to GetImageData.  I measure it to be as fast in the random alpha case as my previous patch was in the alpha=255 case.  (1000ms for both cases in attachment 414210 [details].)

I'm not sure how to time out the table after a period of no use; if we want to do that, I could use a pointer or two.  The overhead of creating the lookup table is about 2ms on my machine, so it might be feasible to use the lookup table even for small getImageData calls.

Once we settle on code for GetImageData that we like, I'll try to do the same thing for PutImageData -- I don't see why it couldn't be sped up in the same fashion.
Attachment #414209 - Attachment is obsolete: true
Ah ha, I knew there was a reason why I couldn't find this bug, wrong component!

I like the lookup table idea -- just create it globally on first use, and destroy it when the last 2D canvas rendering context goes away.  Just increment a static counter in the constructor and decrement in the destructor; if that goes to 0, delete the table.  This plus the work I have in bug 534467 should make getImageData/putImageData quite fast.
Component: Graphics → Canvas: 2D
QA Contact: thebes → canvas.2d
(Assignee)

Comment 20

9 years ago
Created attachment 426749 [details] [diff] [review]
Patch v1

This has a lookup table for both getImageData and putImageData.  The lookup tables are deleted after the last rendering context dies.

Benchmarks in a sec.
Attachment #416219 - Attachment is obsolete: true
(Assignee)

Comment 21

9 years ago
Created attachment 426752 [details]
get and put ImageData benchmarks

Benchmark for get and putImageData.

The numbers for getImageData should be comparable to the last benchmark (attachment 414210 [details]); I just added benchmarks for putImageData.

Results from today's nightly build:
alpha=255 put took 1062 ms.
alpha=255 get took 1194 ms.
Random alpha put took 1060 ms.
Random alpha get took 1230 ms.

With patch:
alpha=255 put took 591 ms.
alpha=255 get took 968 ms.
Random alpha put took 618 ms.
Random alpha get took 979 ms.
Attachment #414210 - Attachment is obsolete: true
(Assignee)

Updated

9 years ago
Attachment #426749 - Flags: review?(vladimir)
(Assignee)

Updated

9 years ago
Summary: Speed up Canvas's getImageData() → Speed up Canvas's getImageData() and putImageData()
Comment on attachment 426749 [details] [diff] [review]
Patch v1

Looks great, one problem to fix though:

>+    sNumLivingContexts--;
>+    if (!sNumLivingContexts) {
>+        delete sUnpremultiplyTable;
>+        delete sPremultiplyTable;
>+    }

Need to set these back to nsnull here as well, or they won't get allocated the next time they're needed.
Attachment #426749 - Flags: review?(vladimir) → review+
(Assignee)

Comment 23

9 years ago
Created attachment 426760 [details] [diff] [review]
Patch for checkin

Thanks for the review!
Attachment #426749 - Attachment is obsolete: true
(Assignee)

Updated

9 years ago
Keywords: checkin-needed

Updated

9 years ago
Assignee: nobody → justin.lebar+bug
http://hg.mozilla.org/mozilla-central/rev/5729dd173cab
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a2
(Assignee)

Updated

7 years ago
Depends on: 659725
You need to log in before you can comment on or make changes to this bug.