Animated gifs use large amounts of CPU when not visible

VERIFIED FIXED in mozilla0.9.9



18 years ago
2 years ago


(Reporter: jesup, Assigned: jesup)


({perf, testcase})

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)



(7 attachments, 4 obsolete attachments)

888 bytes, text/html
223.23 KB, text/html
3.24 KB, patch
Details | Diff | Splinter Review
4.76 KB, patch
: review+
: superreview+
Details | Diff | Splinter Review
3.79 KB, patch
: review+
Details | Diff | Splinter Review
8.40 KB, text/html
13.50 KB, text/html
When an animated gif is not visible (obscured or scrolled out of the visible
page) they still can cause major CPU use.  The example I'm using (from bug
86319) uses 60-70% cpu on a 233MMX even when it's not visible.  IE uses close to
0% when it's not visible.  (Both use 100% when it is visible on that system:
233MMX, 64MB, older cheap S3? video card.)  I'll upload my testcase and get a jprof.
Posted file Testcase
It also includes loading the page, but that's down in the noise.  84%+ in
nsImageGTK::DrawToImage() and imgContainer::BuildCompositeMask().  Are we
really rebuilding the mask on every frame?
tor, aren't you in the middle of whacking that code?
It isn't our actual problem, but since I was looking at the CompositeMask code,
I tweaked it for a speed improvement.  Rough calculation using jprof before and
after is that it's about 16% faster.  And yes, we are building the masks every time.
This should speed up the mask generation some.	Nothing fancy, just hoisting
things out of loops, and avoiding calculations of indexes and masks on the fly
- use pointers and shifts of mask vars instead.

Easier to read after applying; I'll post a -w -u as well.  This does NOT solve
the basic problem of this bug; it may help some.  Rough (very) calulation of
improvement is 16%; note that in this case CPU use remains at around 100% (it
just spins faster).
Comment on attachment 69091 [details] [diff] [review]
small perf patch to CompositeMask

Attachment #69091 - Flags: review+
The basic problem, as per expectations, is that on wakeup, we're calling DrawTo
to draw the next frame, and then calling BuildCompositeMask - all without any
idea if the frame is visible.  Both DrawToImage() and BuildCompositeMask are
expensive due to transparency issues.  Sounds like we either need to clip (may
be hard) or do a visibility check (easier I'd think), or find a way to cache the
alpha data, or suspend the actual animation when not visible except for keeping
the frame count (and correctly rebuild the on-screen data when it scrolls back
on screen! - not trivial).

If we totally suspend anims that are off-screen, there will probably be some
objections, though I don't think it violates any specs (it might, I suppose).  
Yes, I'm reworking a fair bit of nsImageGTK code.  I was hoping to move all the
work of nsImageGTK to the server side, but since imgContainer is fiddling with
the bits (image and alpha) inside a Lock/Unlock it can't be done.

Attached is an (untested) subset of my 121015 patch which removes some useless
work from DrawToImage and probaly makes it more correct (we'll always pick up
the locked bits modifications, which I think we could miss with the current

What some people have been talking about is caching the GIF frames
post-composite so that this expensive work only needs to be done once.
Posted patch rewritten patch (obsolete) — Splinter Review
Updated patch.	Now uses almost no time for CompositeMask.  Removed all
bit-twiddling from the common cases (no offset between alpha and overlay).  Do
multiple bits at a time (8 at a time for no alpha/overlay offset).  Don't write
to  alpha if overlay bits are 0x00.

Much, much much faster; probably close to optimal.  jprof with the patch shows
99% of CPU use is now in DrawToImage, 0 to 1% in CompositeMask.  Tried with
anim with colored background to make sure it was working correctly.

Also corrects a bug in previous patch when overlayXoffset != 0 (missing shift).
Attachment #69091 - Attachment is obsolete: true
Attachment #69092 - Attachment is obsolete: true
--> Randell  :-)  who is doing much appreciated work.
Assignee: pavlov → rjesup
Target Milestone: --- → mozilla0.9.9
r=/sr= for the patch?  Again, this still won't solve the problem with
DrawImageTo(), but it will speed up transparent anims in all cases including
reducing the overhead when they're offscreen.
Keywords: patch
Comment on attachment 69370 [details] [diff] [review]
rewritten patch

Attachment #69370 - Flags: review+
Blocks: 119597
Comment on attachment 69370 [details] [diff] [review]
rewritten patch

>+        for(PRUint32 i=0; 
>+            i < height;
>+            i++) {

Nit: put this on one line, it's a canonical for loop that fits.  The for loop
header it replaces didn't fit, but it's history and cvs diff shows it far
before this replacement.

>+          PRUint8 pixels;
>+          PRUint32 j;
>+          // use locals to avoid keeping track of how much we need to add
>+          // at the end of a line.  we don't really need this since we may
>+          // be able to calculate the ending offsets, but it's simpler and
>+          // cheap.
>+          PRUint8 *localOverlay = overlayLine-1;
>+          PRUint8 *localAlpha   = alphaLine;
>+          for (j = 0; j < width; j += 8) {
>+            // don't do in for(...) to avoid reference past end of buffer
>+            pixels = *++localOverlay;

So you're subtracting overlayLine - 1 before the loop, then preincrementing,
for an extra sub and as many adds as would obtain if you just post-incremented.

There's no inefficiency with post-incrementing vs. pre- here -- addi;load or
load;addi on RISCy architectures.  Please correct me if I'm wrong in Intel.

>+            if (pixels == 0) {
>+              // no bits to set - iterate and bump output pointer
>+              ++localAlpha;
>+            }
>+            else {
>+              // for the last few bits, we need to special-case it

"last few" or "first few"?  /me is confused.

>+              if (j >= 8) {
>+                if (mask_offset == 0) {
>+                  // simple case, no offset
>+                  *localAlpha++ |= pixels;
>+                }
>+                else {
>+                  *localAlpha++ |= (pixels >> mask_offset);
>+                  *localAlpha   |= (pixels << (8U-mask_offset));
>+                }
>+              }

The above case (inside the (j >= 8) condition) could be the body of this loop,
if you hoist this code:
>+              else {
>+                // last few bits have to be handled more carefully if
>+                // width is not a multiple of 8.
>+                // set bits we don't want to change to 0
>+                pixels = (pixels >> (8U-j)) << (8U-j);
>+                *localAlpha++ |= (pixels >> mask_offset);
>+                // don't touch this byte unless we have bits for it
>+                if (j > (8U - mask_offset))
>+                  *localAlpha |= (pixels << (8U-mask_offset));
>+              }

Then the loop's smaller, has only one (predicted taken, I hope) branch around
the (pixels == 0) ++localAlpha code, and runs a bit faster.

>+            }
>           }
>+// Windows and OS/2 have the funky bottom up data storage we need to account for
>+#if defined(XP_WIN) || defined(XP_OS2)
>+          alphaLine   -= abprComposite;
>+          overlayLine -= abprOverlay;
>+          alphaLine   += abprComposite;
>+          overlayLine += abprOverlay;
>         }
>       }
>       break;
>     default:
>       break;

If you could unroll the j < 8 case and re-test, I'll sr= quickly.  Thanks,

Nice patch, thanks!
>So you're subtracting overlayLine - 1 before the loop, then preincrementing,
>for an extra sub and as many adds as would obtain if you just post-incremented.
>There's no inefficiency with post-incrementing vs. pre- here -- addi;load or
>load;addi on RISCy architectures.  Please correct me if I'm wrong in Intel.

It was a left-over from an intermediate form of the patch. Fixed.

>+              // for the last few bits, we need to special-case it
>+              if (j >= 8) {
>"last few" or "first few"?  /me is confused.

   Last few bits of a line.  This is a bug in the logic of the code; at one
point I reversed the value of j to be decrementing but didn't reverse the for()
statement.  It should be:

          for (j = width; j > 0; j -= 8) {

(otherwise the handling of partials isn't correct, due to the (8U-j) term).

Revised, and retesting.  I'll upload after running some tests.

Perf tests show 0% in BuildCompositeMask for not-visible (all CPU time is now in
DrawToImage, used to be 50% BuildCompositeMask), 1.2% when it's visible (used to
be >20%).  Looks pretty good to me.  Patch to be attached shortly.
Updated patch as per comments.	Also fixes incorrect mask_offset calculation. 
For a good testcase, see (which
has offsets that vary from frame to frame).

r=/sr= please, and lets get it in.
Attachment #69370 - Attachment is obsolete: true
Comment on attachment 70147 [details] [diff] [review]
Updated patch, tested heavily

Carrying r= forward. if you use the prevailing if-{ style (the { on the same
line as the if, not on its own line).

Attachment #70147 - Flags: superreview+
Attachment #70147 - Flags: review+
Patch for imgContainer.cpp checked in.

I have a patch that does the same sort of hackery to DrawToImage to optimize
drawing transparent anims.  This should (mostly) solve the other half of the
equation, especially when combined with bug 125157 (limit animation rates - no
infinitely fast anims).  The patch is working but still has 1 bug I need to
ferret out for tomorrow.  That should solve the other bad actor in the heavy CPU
use for transparent anims.
Keywords: mozilla1.0mozilla0.9.9
Sorry, I meant bug 125137 for the animation-refresh-limiting bug.  Marking this
as dependent on that (since speeding up these functions merely wastes cpu faster
unless we limit the rate we try to animate).  Also, we still need to try to
shortcut all of this masking/drawing if the anim isn't visible.
Depends on: 125137
Similar rewrite to DrawToImage to avoid bit-twiddling.	We should also apply
the patch #69176 - it's not integrated here.  I can either provide an updated
copy of that after we get this patch in, or I can merge them if people think it
won't cause any danger of not getting this fix in.

Should greatly reduce CPU use in DrawToImage.  Tested heavily using similar
testcases as per above.

r/sr ASAP so this can make 0.9.9 please
I'm wondering if lookup tables might further improve speed. For example, 
where you have 
  alphaPixels &= 0xff << (8 - (aDWidth-x));
the right hand side presumable has only 8 possible values so it could be
  alphaPixels &= maskvals1[aDWidth-x];
There are other similar places.

Just an idea.
Maybe an array would be cheaper - but probably not, since you'd introduce a
memory read into the loop.  Right now mask can live entirely in a register, and
mask >>= 1 is cheap.  Memory references are bad, tight register loops are good

I'll remove the NULL and substitute nsnull (I was just silencing one of many
warnings already in the code) before I check in.
Memory is red-shifting away from the super-scalar pipeline on modern CPUs.  Even
on-CPU cache memory may take multiple cycles to access.  So i agree with rjesup
in general.  The Intel P4 seems, however, to have penalized non-constant shifts,
and we can't turn them into multiplies easily here.  But don't prematurely
optimize for P4, I say.  Make the compiler sweat it, measure first, etc.

Pav, can you r= (test if not read the code :-)?

Realtime jprof of the testcase, with the anim not visible.  The anim has a
repeat rate of 10ms (100Hz), which in this test was not limited (bug 125137 not
in place).  Note, dual-450 P3 Linux, remote X display (no bits are actually
being drawn).

Before latest patch, 25% hits in DrawToImage (75% poll)
After latest patch, 8% hits in DrawToImage, 90+% poll.

Before either this patch or the other one already committed, non-poll hits were

Looks like overhead is down 66% for DrawToImage, 80-90% down overall for both


I hope my code didn't make anyone's head hurt (pav).  ;-) I've thought of going
through and making it even faster, since it still shows up as an appreciable %.
 It would be pretty easy to extend to 32-bit chunks, and perhaps more
importantly I could gather runs of 0xff's and do larger memcpy's.  I'm not sure
these would produce a measurable improvement, however.

I'm ready to checkin once r/sr are in place.
Note: 1st patch applied, second not applied.
The "reduce work in DrawToImage" patch posted earlier
( appears to do
the opposite; even with the other 2 patches applied it raises CPU use from
11-14% to 45-50% (measured via 'top' using the testcase).  So I think we should
leave that one out for now, and just put in my patch.  Perhaps we can revisit
things later.  Tor might well fix some of this in his GTK rewrite.

As per the obvious, the biggest win would be to not do any (or much) of this
work for non-visible areas.
Interesting - do you have a way of measuring frame rate before/after the change?
Comment on attachment 70330 [details] [diff] [review]
Patch to majorly speed up DrawToImage

Attachment #70330 - Flags: superreview+
rate: I haven't put anything in.  I could, but honestly I don't care how close
to 0ms we can animate.  I do care how much CPU time animating uses - my
measurements of the testcase (big 10ms transparent anim, which has a lot of area
with mixed (not 0 or ff) mask bytes, showed a 85-90% total reduction in CPU use.
 I don't think we can do a lot more without making it not try to mask/draw
non-visible anims.  At a 10ms anim rate, nothing visible on-screen, we used 8%
of one CPU of a dual-450 PIII.  We used to use more than 60%.

We also need to approve 125137 to stop trying to animate at 0ms - we can argue
if it should max out at 10, 16, 20ms.  I can't see any reason to animate faster
than 10ms in our system.
Comment on attachment 70330 [details] [diff] [review]
Patch to majorly speed up DrawToImage

Do the other platforms need similar unrolling? Seems like there should be a way
to structure this based on the processor's preferred "native" register size.
For example, on Mac I'm thinking this could be even more intersting when using
the AltiVec ney Velocity Engine vector unit registers in full 128 bit mode.
Attachment #70330 - Flags: review+
a=roc+moz for 0.9.9

This isn't actually unrolling here. This is just smarter code. Per-processor
vectorized code for this and similar loops (e.g., in nsBlender) would be very
interesting but should definitely be spun off into a separate bug. We'll always
want a reasonably easy to understand, reasonably fast C path anyway.
Fix checked in.  If someone want to try to rehabilitate the other DrawToImage
patch (which _seemed_ to slow things down for me), feel free to reopen or add
another bug.
Closed: 18 years ago
Resolution: --- → FIXED
Does this fix the bug I filed a bit ago? If not, could the same fix be applied
to 120154's case?

Bug 120154 - animated GIF eating cpu when mozilla in another tab or minimized
Hmm.  resource:///res/samples/gear1.gif simply stopped
animating in Mozilla very recently.  Side-effect of this
That resource url animates fine for me with the patch, just tried it.
Sorry for the false alarm -- it actually looks like it's
tor's server-side scaling code that I'm testing in my tree
that caused gears1.gif to stop animating.  I'll comment
over on that bug.
Re-ran jprof with build from midday today.  99.5% of hits are in poll(), 0.4% in
DrawToImage() when the testcase is offscreen.  Verified.  Bigtime.  :-)  (tor's
changes may be involved in this too)
Hm.... I still see images in _minimized_ mozilla windows taking up lots of cpu.
Was this bug supposed to address that, or should I open a new one?
This was not directly targeted at "minimize", though it would have an effect on
that as well.  Note that this patch didn't stop us from updating invisible
images; it merely reduces the overhead of masking/drawing transparent anims
(which was way higher than it should have been).  It has no effect on
non-transparent anims, for example.  Bug 125137 would help with reducing anim
overhead in general.
Ah, OK.  I was dealing with a non-transparent gif.  :)
So, really, this was a good optimization but it didn't technically
address the bug in hand?

Ie. it helps with 'animated gifs use large amounts of CPU' but
not 'animated gifs use large amounts of CPU when not visible'
except as a side-effect of making (transparent) animated GIFs
faster overall.

How does moz avoid draws to updated images which are
(scrolled) off-screen?  Does it bother, at the client side?  If
it just sets a clip-region and lets X11 clip the image at the
server side then each frame still has to be dithered/remapped
and go 'over the wire', IIRC.

The fact that we still animate when not visible is bug 120154.  I'll post a
jprof there.
You need to log in before you can comment on or make changes to this bug.