Animated gifs use large amounts of CPU when not visible

VERIFIED FIXED in mozilla0.9.9

Status

()

--
major
VERIFIED FIXED
17 years ago
11 months ago

People

(Reporter: jesup, Assigned: jesup)

Tracking

({perf, testcase})

Trunk
mozilla0.9.9
perf, testcase
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments, 4 obsolete attachments)

(Assignee)

Description

17 years ago
When an animated gif is not visible (obscured or scrolled out of the visible
page) they still can cause major CPU use.  The example I'm using (from bug
86319) uses 60-70% cpu on a 233MMX even when it's not visible.  IE uses close to
0% when it's not visible.  (Both use 100% when it is visible on that system:
233MMX, 64MB, older cheap S3? video card.)  I'll upload my testcase and get a jprof.
(Assignee)

Comment 1

17 years ago
Created attachment 69045 [details]
Testcase
(Assignee)

Updated

17 years ago
Keywords: mozilla1.0, perf, testcase
(Assignee)

Comment 2

17 years ago
Created attachment 69046 [details]
jprof of the page (anim not visible)

It also includes loading the page, but that's down in the noise.  84%+ in
nsImageGTK::DrawToImage() and imgContainer::BuildCompositeMask().  Are we
really rebuilding the mask on every frame?
tor, aren't you in the middle of whacking that code?
(Assignee)

Comment 4

17 years ago
It isn't our actual problem, but since I was looking at the CompositeMask code,
I tweaked it for a speed improvement.  Rough calculation using jprof before and
after is that it's about 16% faster.  And yes, we are building the masks every time.
(Assignee)

Comment 5

17 years ago
Created attachment 69091 [details] [diff] [review]
small perf patch to CompositeMask

This should speed up the mask generation some.	Nothing fancy, just hoisting
things out of loops, and avoiding calculations of indexes and masks on the fly
- use pointers and shifts of mask vars instead.

Easier to read after applying; I'll post a -w -u as well.  This does NOT solve
the basic problem of this bug; it may help some.  Rough (very) calulation of
improvement is 16%; note that in this case CPU use remains at around 100% (it
just spins faster).
(Assignee)

Comment 6

17 years ago
Created attachment 69092 [details] [diff] [review]
-w -u patch - not actually much more readable

Comment 7

17 years ago
Comment on attachment 69091 [details] [diff] [review]
small perf patch to CompositeMask

r=pavlov
Attachment #69091 - Flags: review+
(Assignee)

Comment 8

17 years ago
The basic problem, as per expectations, is that on wakeup, we're calling DrawTo
to draw the next frame, and then calling BuildCompositeMask - all without any
idea if the frame is visible.  Both DrawToImage() and BuildCompositeMask are
expensive due to transparency issues.  Sounds like we either need to clip (may
be hard) or do a visibility check (easier I'd think), or find a way to cache the
alpha data, or suspend the actual animation when not visible except for keeping
the frame count (and correctly rebuild the on-screen data when it scrolls back
on screen! - not trivial).

If we totally suspend anims that are off-screen, there will probably be some
objections, though I don't think it violates any specs (it might, I suppose).  

Comment 9

17 years ago
Created attachment 69176 [details] [diff] [review]
reduce work in nsImageGTK::DrawToImage

Comment 10

17 years ago
Yes, I'm reworking a fair bit of nsImageGTK code.  I was hoping to move all the
work of nsImageGTK to the server side, but since imgContainer is fiddling with
the bits (image and alpha) inside a Lock/Unlock it can't be done.

Attached is an (untested) subset of my 121015 patch which removes some useless
work from DrawToImage and probaly makes it more correct (we'll always pick up
the locked bits modifications, which I think we could miss with the current
setup).

What some people have been talking about is caching the GIF frames
post-composite so that this expensive work only needs to be done once.
(Assignee)

Comment 11

17 years ago
Created attachment 69370 [details] [diff] [review]
rewritten patch

Updated patch.	Now uses almost no time for CompositeMask.  Removed all
bit-twiddling from the common cases (no offset between alpha and overlay).  Do
multiple bits at a time (8 at a time for no alpha/overlay offset).  Don't write
to  alpha if overlay bits are 0x00.

Much, much much faster; probably close to optimal.  jprof with the patch shows
99% of CPU use is now in DrawToImage, 0 to 1% in CompositeMask.  Tried with
anim with colored background to make sure it was working correctly.

Also corrects a bug in previous patch when overlayXoffset != 0 (missing shift).
Attachment #69091 - Attachment is obsolete: true
Attachment #69092 - Attachment is obsolete: true

Comment 12

17 years ago
--> Randell  :-)  who is doing much appreciated work.
Assignee: pavlov → rjesup
Target Milestone: --- → mozilla0.9.9
(Assignee)

Comment 13

17 years ago
r=/sr= for the patch?  Again, this still won't solve the problem with
DrawImageTo(), but it will speed up transparent anims in all cases including
reducing the overhead when they're offscreen.
Keywords: patch

Comment 14

17 years ago
Comment on attachment 69370 [details] [diff] [review]
rewritten patch

r=pavlov
Attachment #69370 - Flags: review+
(Assignee)

Updated

17 years ago
Blocks: 119597
Comment on attachment 69370 [details] [diff] [review]
rewritten patch

>+        for(PRUint32 i=0; 
>+            i < height;
>+            i++) {

Nit: put this on one line, it's a canonical for loop that fits.  The for loop
header it replaces didn't fit, but it's history and cvs diff shows it far
before this replacement.

>+
>+          PRUint8 pixels;
>+          PRUint32 j;
>+          // use locals to avoid keeping track of how much we need to add
>+          // at the end of a line.  we don't really need this since we may
>+          // be able to calculate the ending offsets, but it's simpler and
>+          // cheap.
>+          PRUint8 *localOverlay = overlayLine-1;
>+          PRUint8 *localAlpha   = alphaLine;
>+
>+          for (j = 0; j < width; j += 8) {
>+            // don't do in for(...) to avoid reference past end of buffer
>+            pixels = *++localOverlay;

So you're subtracting overlayLine - 1 before the loop, then preincrementing,
for an extra sub and as many adds as would obtain if you just post-incremented.

There's no inefficiency with post-incrementing vs. pre- here -- addi;load or
load;addi on RISCy architectures.  Please correct me if I'm wrong in Intel.

>+
>+            if (pixels == 0) {
>+              // no bits to set - iterate and bump output pointer
>+              ++localAlpha;
>+            }
>+            else {
>+              // for the last few bits, we need to special-case it

"last few" or "first few"?  /me is confused.

>+              if (j >= 8) {
>+                if (mask_offset == 0) {
>+                  // simple case, no offset
>+                  *localAlpha++ |= pixels;
>+                }
>+                else {
>+                  *localAlpha++ |= (pixels >> mask_offset);
>+                  *localAlpha   |= (pixels << (8U-mask_offset));
>+                }
>+              }

The above case (inside the (j >= 8) condition) could be the body of this loop,
if you hoist this code:
>+              else {
>+                // last few bits have to be handled more carefully if
>+                // width is not a multiple of 8.
>+
>+                // set bits we don't want to change to 0
>+                pixels = (pixels >> (8U-j)) << (8U-j);
>+                *localAlpha++ |= (pixels >> mask_offset);
>+                // don't touch this byte unless we have bits for it
>+                if (j > (8U - mask_offset))
>+                  *localAlpha |= (pixels << (8U-mask_offset));
>+              }

Then the loop's smaller, has only one (predicted taken, I hope) branch around
the (pixels == 0) ++localAlpha code, and runs a bit faster.

>+            }
>           }
>+
>+// Windows and OS/2 have the funky bottom up data storage we need to account for
>+#if defined(XP_WIN) || defined(XP_OS2)
>+          alphaLine   -= abprComposite;
>+          overlayLine -= abprOverlay;
>+#else
>+          alphaLine   += abprComposite;
>+          overlayLine += abprOverlay;
>+#endif
>         }
>       }
>-
>       break;
>     default:
>       break;

If you could unroll the j < 8 case and re-test, I'll sr= quickly.  Thanks,

/be

Comment 16

17 years ago
Nice patch, thanks!
(Assignee)

Comment 17

17 years ago
>So you're subtracting overlayLine - 1 before the loop, then preincrementing,
>for an extra sub and as many adds as would obtain if you just post-incremented.
>There's no inefficiency with post-incrementing vs. pre- here -- addi;load or
>load;addi on RISCy architectures.  Please correct me if I'm wrong in Intel.

It was a left-over from an intermediate form of the patch. Fixed.

>+              // for the last few bits, we need to special-case it
>+              if (j >= 8) {
>
>"last few" or "first few"?  /me is confused.

   Last few bits of a line.  This is a bug in the logic of the code; at one
point I reversed the value of j to be decrementing but didn't reverse the for()
statement.  It should be:

          for (j = width; j > 0; j -= 8) {

(otherwise the handling of partials isn't correct, due to the (8U-j) term).

Revised, and retesting.  I'll upload after running some tests.

(Assignee)

Comment 18

17 years ago
Perf tests show 0% in BuildCompositeMask for not-visible (all CPU time is now in
DrawToImage, used to be 50% BuildCompositeMask), 1.2% when it's visible (used to
be >20%).  Looks pretty good to me.  Patch to be attached shortly.
(Assignee)

Comment 19

17 years ago
Created attachment 70147 [details] [diff] [review]
Updated patch, tested heavily

Updated patch as per comments.	Also fixes incorrect mask_offset calculation. 
For a good testcase, see http://www.agag.com/gallery/toon/finnyluv.shtml (which
has offsets that vary from frame to frame).

r=/sr= please, and lets get it in.
Attachment #69370 - Attachment is obsolete: true
Comment on attachment 70147 [details] [diff] [review]
Updated patch, tested heavily

Carrying r= forward.

sr=brendan@mozilla.org if you use the prevailing if-{ style (the { on the same
line as the if, not on its own line).

/be
Attachment #70147 - Flags: superreview+
Attachment #70147 - Flags: review+
(Assignee)

Comment 21

17 years ago
Patch for imgContainer.cpp checked in.

I have a patch that does the same sort of hackery to DrawToImage to optimize
drawing transparent anims.  This should (mostly) solve the other half of the
equation, especially when combined with bug 125157 (limit animation rates - no
infinitely fast anims).  The patch is working but still has 1 bug I need to
ferret out for tomorrow.  That should solve the other bad actor in the heavy CPU
use for transparent anims.
Keywords: mozilla1.0 → mozilla0.9.9
(Assignee)

Comment 22

17 years ago
Sorry, I meant bug 125137 for the animation-refresh-limiting bug.  Marking this
as dependent on that (since speeding up these functions merely wastes cpu faster
unless we limit the rate we try to animate).  Also, we still need to try to
shortcut all of this masking/drawing if the anim isn't visible.
Depends on: 125137
(Assignee)

Comment 23

17 years ago
Created attachment 70330 [details] [diff] [review]
Patch to majorly speed up DrawToImage

Similar rewrite to DrawToImage to avoid bit-twiddling.	We should also apply
the patch #69176 - it's not integrated here.  I can either provide an updated
copy of that after we get this patch in, or I can merge them if people think it
won't cause any danger of not getting this fix in.

Should greatly reduce CPU use in DrawToImage.  Tested heavily using similar
testcases as per above.

r/sr ASAP so this can make 0.9.9 please

Comment 24

17 years ago
I'm wondering if lookup tables might further improve speed. For example, 
where you have 
  alphaPixels &= 0xff << (8 - (aDWidth-x));
the right hand side presumable has only 8 possible values so it could be
  alphaPixels &= maskvals1[aDWidth-x];
There are other similar places.

Just an idea.
(Assignee)

Comment 25

17 years ago
Maybe an array would be cheaper - but probably not, since you'd introduce a
memory read into the loop.  Right now mask can live entirely in a register, and
mask >>= 1 is cheap.  Memory references are bad, tight register loops are good
(generally).

I'll remove the NULL and substitute nsnull (I was just silencing one of many
warnings already in the code) before I check in.
Memory is red-shifting away from the super-scalar pipeline on modern CPUs.  Even
on-CPU cache memory may take multiple cycles to access.  So i agree with rjesup
in general.  The Intel P4 seems, however, to have penalized non-constant shifts,
and we can't turn them into multiplies easily here.  But don't prematurely
optimize for P4, I say.  Make the compiler sweat it, measure first, etc.

Pav, can you r= (test if not read the code :-)?

/be
(Assignee)

Comment 27

17 years ago
Realtime jprof of the testcase, with the anim not visible.  The anim has a
repeat rate of 10ms (100Hz), which in this test was not limited (bug 125137 not
in place).  Note, dual-450 P3 Linux, remote X display (no bits are actually
being drawn).

Before latest patch, 25% hits in DrawToImage (75% poll)
After latest patch, 8% hits in DrawToImage, 90+% poll.

Before either this patch or the other one already committed, non-poll hits were
>50%.

Looks like overhead is down 66% for DrawToImage, 80-90% down overall for both
patches.

Yum.

I hope my code didn't make anyone's head hurt (pav).  ;-) I've thought of going
through and making it even faster, since it still shows up as an appreciable %.
 It would be pretty easy to extend to 32-bit chunks, and perhaps more
importantly I could gather runs of 0xff's and do larger memcpy's.  I'm not sure
these would produce a measurable improvement, however.

I'm ready to checkin once r/sr are in place.
(Assignee)

Comment 28

17 years ago
Created attachment 70368 [details]
jprof before latest patch (realtime, dual-450 P3 linux) of Testcase

Note: 1st patch applied, second not applied.
(Assignee)

Comment 29

17 years ago
Created attachment 70369 [details]
real-time jprof, after DrawToImage patch
(Assignee)

Comment 30

17 years ago
The "reduce work in DrawToImage" patch posted earlier
(http://bugzilla.mozilla.org/attachment.cgi?id=69176&action=view) appears to do
the opposite; even with the other 2 patches applied it raises CPU use from
11-14% to 45-50% (measured via 'top' using the testcase).  So I think we should
leave that one out for now, and just put in my patch.  Perhaps we can revisit
things later.  Tor might well fix some of this in his GTK rewrite.

As per the obvious, the biggest win would be to not do any (or much) of this
work for non-visible areas.

Comment 31

17 years ago
Interesting - do you have a way of measuring frame rate before/after the change?

Comment 32

17 years ago
Comment on attachment 70330 [details] [diff] [review]
Patch to majorly speed up DrawToImage

sr=tor
Attachment #70330 - Flags: superreview+
(Assignee)

Comment 33

17 years ago
rate: I haven't put anything in.  I could, but honestly I don't care how close
to 0ms we can animate.  I do care how much CPU time animating uses - my
measurements of the testcase (big 10ms transparent anim, which has a lot of area
with mixed (not 0 or ff) mask bytes, showed a 85-90% total reduction in CPU use.
 I don't think we can do a lot more without making it not try to mask/draw
non-visible anims.  At a 10ms anim rate, nothing visible on-screen, we used 8%
of one CPU of a dual-450 PIII.  We used to use more than 60%.

We also need to approve 125137 to stop trying to animate at 0ms - we can argue
if it should max out at 10, 16, 20ms.  I can't see any reason to animate faster
than 10ms in our system.

Comment 34

17 years ago
Comment on attachment 70330 [details] [diff] [review]
Patch to majorly speed up DrawToImage

r=saari
Do the other platforms need similar unrolling? Seems like there should be a way
to structure this based on the processor's preferred "native" register size.
For example, on Mac I'm thinking this could be even more intersting when using
the AltiVec ney Velocity Engine vector unit registers in full 128 bit mode.
Attachment #70330 - Flags: review+
a=roc+moz for 0.9.9

This isn't actually unrolling here. This is just smarter code. Per-processor
vectorized code for this and similar loops (e.g., in nsBlender) would be very
interesting but should definitely be spun off into a separate bug. We'll always
want a reasonably easy to understand, reasonably fast C path anyway.
(Assignee)

Comment 36

17 years ago
Fix checked in.  If someone want to try to rehabilitate the other DrawToImage
patch (which _seemed_ to slow things down for me), feel free to reopen or add
another bug.
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED

Comment 37

17 years ago
Does this fix the bug I filed a bit ago? If not, could the same fix be applied
to 120154's case?

Bug 120154 - animated GIF eating cpu when mozilla in another tab or minimized

Comment 38

17 years ago
Hmm.  resource:///res/samples/gear1.gif simply stopped
animating in Mozilla very recently.  Side-effect of this
fix?
(Assignee)

Comment 39

17 years ago
That resource url animates fine for me with the patch, just tried it.

Comment 40

17 years ago
Sorry for the false alarm -- it actually looks like it's
tor's server-side scaling code that I'm testing in my tree
that caused gears1.gif to stop animating.  I'll comment
over on that bug.
Keywords: mozilla0.9.9 → mozilla0.9.9+
(Assignee)

Comment 41

17 years ago
Re-ran jprof with build from midday today.  99.5% of hits are in poll(), 0.4% in
DrawToImage() when the testcase is offscreen.  Verified.  Bigtime.  :-)  (tor's
changes may be involved in this too)
Status: RESOLVED → VERIFIED
Hm.... I still see images in _minimized_ mozilla windows taking up lots of cpu.
Was this bug supposed to address that, or should I open a new one?
(Assignee)

Comment 43

17 years ago
This was not directly targeted at "minimize", though it would have an effect on
that as well.  Note that this patch didn't stop us from updating invisible
images; it merely reduces the overhead of masking/drawing transparent anims
(which was way higher than it should have been).  It has no effect on
non-transparent anims, for example.  Bug 125137 would help with reducing anim
overhead in general.
Ah, OK.  I was dealing with a non-transparent gif.  :)

Comment 45

17 years ago
So, really, this was a good optimization but it didn't technically
address the bug in hand?

Ie. it helps with 'animated gifs use large amounts of CPU' but
not 'animated gifs use large amounts of CPU when not visible'
except as a side-effect of making (transparent) animated GIFs
faster overall.

How does moz avoid draws to updated images which are
(scrolled) off-screen?  Does it bother, at the client side?  If
it just sets a clip-region and lets X11 clip the image at the
server side then each frame still has to be dithered/remapped
and go 'over the wire', IIRC.

(Assignee)

Comment 46

17 years ago
The fact that we still animate when not visible is bug 120154.  I'll post a
jprof there.
Comment hidden (mozreview-request)
Attachment #8914375 - Attachment is obsolete: true
Attachment #8914375 - Flags: review?(gasolin)
You need to log in before you can comment on or make changes to this bug.