Last Comment Bug 744309 - Random 100% CPU usage, due to image decoding
: Random 100% CPU usage, due to image decoding
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: ImageLib (show other bugs)
: Trunk
: x86 All
: -- major (vote)
: mozilla18
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-10 21:32 PDT by Boris Zbarsky [:bz]
Modified: 2012-12-02 11:18 PST (History)
14 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
wontfix
+
fixed


Attachments
Patch, v1 (3.37 KB, patch)
2012-09-10 19:01 PDT, Justin Lebar (not reading bugmail)
joe: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta-
Details | Diff | Review

Description Boris Zbarsky [:bz] 2012-04-10 21:32:20 PDT
I know this is a crappy bug report, so please bear with me.

The last week or two my nightly (shark branch) builds have, every so often, started chewing up 100% of a CPU core; it would keep happening until I restarted the build.  Today I finally managed to get a profile of it when it happened.  The profile showed about 50% of the time under mozilla::image::imgDecodeRequestor::Run.  The rest was handling all the mac event loop gunk to deal with the resulting paint events from the invalidates.

The image(s) in question seem to be GIFs based on the profiler stacks.

While this is happening, memory usage slowly grows (where by "slowly" I mean about 0.5MB/s or so).

I have the profiles still open in case there are questions I can answer.

Nominating for tracking, because whatever the heck is going on we _really_ don't want to ship this.  :(
Comment 1 Bobby Holley (PTO through June 13) 2012-04-10 22:38:48 PDT
This rings a bell with bug 583069. I don't think that explains why we're thrashing, though.

This could happen if mInDecoder somehow got stuck to true, since then each call into RequestDecode would do an async dispatch, forever. But it's not clear to me how that could happen modulo memory corruption...
Comment 2 Justin Lebar (not reading bugmail) 2012-04-10 23:23:16 PDT
> This could happen if mInDecoder somehow got stuck to true, since then each call into RequestDecode 
> would do an async dispatch, forever.

But then we wouldn't decode anything, right?  We'd just spin and see that there's nothing to do?

This could also happen if we try to draw an image which isn't locked over and over.  The image isn't locked, we decode it, then we toss it (very quickly these days under some circumstances).

Next time this happens, you could try making image.mem.max_decoded_image_kb large.  If that immediately stops the loop, then we're thrashing there, somehow.
Comment 3 Scott Johnson (:jwir3) 2012-04-11 09:18:11 PDT
bz: Is this _only_ happening on trunk (shark branch)? Or have you noticed it in a release/beta/aurora?
Comment 4 Boris Zbarsky [:bz] 2012-04-11 09:58:12 PDT
I don't really browse with release/beta/aurora, so I have no idea whether it's happening there....

Justin, the next time it happens (likely to be a day or two), I'll try what you suggest.
Comment 5 Boris Zbarsky [:bz] 2012-04-11 19:58:17 PDT
Happened again.  I tried bumping that pref up by 100x.  No effect.  :(
Comment 6 Boris Zbarsky [:bz] 2012-04-17 13:03:29 PDT
This has been happening to me pretty reliably, averaging about once every 36 hours or so.  :(
Comment 7 Justin Lebar (not reading bugmail) 2012-04-17 17:01:16 PDT
I'm not sure what we should be doing here.  If you can catch it in a debugger and see what's causing the decodes, that would help, I guess.

It sure sounds like we're decoding and then immediately discarding an image.  If it's not because of image.mem.max_decoded_image_kb, I don't know what it would be due to.  You've been browsing with max_decoded_image_kb set high and still see this problem (as opposed to increasing the pref while you're ilooping)?
Comment 8 Boris Zbarsky [:bz] 2012-04-17 17:29:11 PDT
I can do that if I try browsing in a debug build, but that's been an exercise in pain when I've tried it recently...  The nightlies we ship, including the shark ones, don't seem to have usable debugging symbols that gdb can use.  If such symbols do exist, _please_ tell me!

> You've been browsing with max_decoded_image_kb set high

No.  I can try that if you want.  Let me know.
Comment 9 Justin Lebar (not reading bugmail) 2012-04-17 17:32:42 PDT
> You've been browsing with max_decoded_image_kb set high
>
> No.  I can try that if you want.  Let me know.

I can't speak to the debug symbols (our shark builds really *should* ship with symbols!), but yes, if we can just increase this max_decoded_image_kb pref, we should do so.  (See also bug 746055 comment 9.)
Comment 10 Boris Zbarsky [:bz] 2012-04-17 17:39:16 PDT
> our shark builds really *should* ship with symbols

They have function names, but nothing else.
Comment 11 Boris Zbarsky [:bz] 2012-04-17 17:41:02 PDT
OK.  Bumped the pref to 2000000 and restarted.  We'll see how it goes.
Comment 12 Justin Lebar (not reading bugmail) 2012-04-22 21:29:06 PDT
It's possible that we're touching the discard logic off main thread on Mac (we know we're doing it on Windows).  If so, the discard tracker's linked list could become corrupted, or our count of how much memory is devoted to allocations could go negative.  Either way, that would be Bad.  See bug 745662.
Comment 13 Boris Zbarsky [:bz] 2012-04-22 21:31:36 PDT
For what it's worth, since comment 11 it's been 6 days and the problem hasn't come up yet....
Comment 14 Bobby Holley (PTO through June 13) 2012-04-23 02:48:24 PDT
(In reply to Justin Lebar [:jlebar] from comment #12)
> It's possible that we're touching the discard logic off main thread on Mac
> (we know we're doing it on Windows).  If so, the discard tracker's linked
> list could become corrupted, or our count of how much memory is devoted to
> allocations could go negative.  Either way, that would be Bad.  See bug
> 745662.

Maybe we should reland that assert after fixing the peek thing?
Comment 15 Alex Keybl [:akeybl] 2012-05-02 15:48:02 PDT
(In reply to Justin Lebar [:jlebar] from comment #9)
> I can't speak to the debug symbols (our shark builds really *should* ship
> with symbols!), but yes, if we can just increase this max_decoded_image_kb
> pref, we should do so.  (See also bug 746055 comment 9.)

Given Comment 13, can we move forward with this fix?
Comment 16 Justin Lebar (not reading bugmail) 2012-05-02 23:18:01 PDT
> Given Comment 13, can we move forward with this fix?

I bumped the pref to 250MB in bug 746055.

That's a wallpaper fix, though, so I don't think we should close this bug.  I presume we can still get into this state -- whatever it is, somehow.

We could bump the pref higher, but it's serving a purpose (preventing OOMs), so I'm not sure we want to make it infinity.

My plan is to revisit this after we fix bug 745141.  Once that bug is fixed, we can add release-time main-thread assertions to the discard logic, so we can ensure that we're not hitting another race condition there.  With that ruled out, I'll be able to dig in here, I hope.

I've also been browsing with the pref set very low, in the hopes of triggering this bug, but so far, no dice.
Comment 17 Justin Lebar (not reading bugmail) 2012-05-02 23:27:01 PDT
With this bug wallpapered over, I'm OK no longer tracking for FF14, unless we hear more reports of this happening to people.
Comment 18 Boris Zbarsky [:bz] 2012-05-03 14:48:43 PDT
That seems totally reasonable to me.
Comment 19 Joe Drew (not getting mail) 2012-05-14 15:20:13 PDT
Justin, if you believe this is unaffected for 14, can you mark this bug as such?
Comment 20 Justin Lebar (not reading bugmail) 2012-05-14 15:33:07 PDT
I think this is fixed for FF14 by the patch for Aurora in bug 745141, yes.

The story for FF15 is cloudier; we're basically hoping that the 250mb cap is big enough not to trigger this problem.  Unfortunately I don't have room on my plate to dig in deeper anytime soon; b2g is eating me alive.  We could just disable the max-image cap everywhere.  Or we could hope that 250mb is good enough.
Comment 21 Justin Lebar (not reading bugmail) 2012-08-02 22:19:34 PDT
Has anyone seen this happen lately?  It's really a bad bug, and I'm afraid I only papered over it...
Comment 22 Boris Zbarsky [:bz] 2012-08-02 22:34:08 PDT
I haven't seen this recur, no...
Comment 23 Robert Lickenbrock [:rclick] 2012-08-06 15:21:59 PDT
I can sometimes see this on Windows with image.mem.max_decoded_image_kb set to 1, but I don't have a definitive way of reproducing beyond browsing around and hoping it happens.

I got a profile of it happening at [1]. It looks like roughly half the time is spent under either imgDecodeRequestor::Run or DiscardRunnable::Run.

[1] https://people.mozilla.com/~bgirard/cleopatra/?report=8d36a4b437652a22391d4c2ca40f745d34274b2f
Comment 24 Justin Lebar (not reading bugmail) 2012-09-10 13:43:23 PDT
I just saw this happening on my Linux box (where I have image.mem.max_decoded_image_kb set to very low, at 256, in an attempt to trigger this bug).

24% of my time is spent in nsGIFDecoder2:DoLzw.  Here's the stack (although it's kind of obvious):

            --- mozilla::image::nsGIFDecoder2::DoLzw(unsigned char const*)
               |          
               |--99.85%-- mozilla::image::nsGIFDecoder2::WriteInternal(char const*, unsigned int)
               |          mozilla::image::RasterImage::WriteToDecoder(char const*, unsigned int)
               |          mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::RasterImage::DecodeWorker::DecodeType)
               |          mozilla::image::RasterImage::DecodeWorker::DecodeABitOf(mozilla::image::RasterImage*)
               |          mozilla::image::RasterImage::RequestDecode()
               |          mozilla::image::imgDecodeRequestor::Run()
               |          nsThread::ProcessNextEvent(bool, bool*)
               |          NS_ProcessNextEvent_P(nsIThread*, bool)
               |          mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)
               |          MessageLoop::Run()
               |          nsBaseAppShell::Run()
               |          nsAppStartup::Run()
               |          XREMain::XRE_mainRun()
               |          XREMain::XRE_main(int, char**, nsXREAppData const*)
               |          XRE_main
               |          main
Comment 25 Justin Lebar (not reading bugmail) 2012-09-10 14:34:53 PDT
I caught this in gdb.  It looks like we're calling RequestDecode on the same image repeatedly.  On my machine, I actually have three or four images which are all getting repeatedly RequestDecoded()'ed.

In GDB, I see four calls in a row to RequestDecode() for the same image.  Then we move on to the next image.  Here are backtraces for one set of four calls:

== First hit ==

> Breakpoint 1, mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> 2416	  if (mError)
> (gdb) bt
> #0  mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> #1  0x00007f2c93321db5 in mozilla::image::imgDecodeRequestor::Run() () from /home/jlebar/code/moz/nightly/libxul.so
> #2  0x00007f2c93ec841c in nsThread::ProcessNextEvent(bool, bool*) () from /home/jlebar/code/moz/nightly/libxul.so
> #3  0x00007f2c93e92b3c in NS_ProcessNextEvent_P(nsIThread*, bool) () from /home/jlebar/code/moz/nightly/libxul.so
> #4  0x00007f2c93b751db in nsXULWindow::ShowModal() () from /home/jlebar/code/moz/nightly/libxul.so
> #5  0x00007f2c93b51f7a in nsWindowWatcher::OpenWindowInternal(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so
> 
Looks like we're in a nested event loop; I don't know if that's relevant.

This call does InitDecoder(/* aDoSizeDecode = */ false);, then returns.

== Second hit ==

> Breakpoint 1, mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> 2416	  if (mError)
> (gdb) bt
> #0  mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> #1  0x00007f2c9331d1d1 in WantDecodedFrames (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2406
> #2  mozilla::image::RasterImage::GetImgFrame (this=0x7f2c5be98b60, framenum=0)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:590
> #3  0x00007f2c933200a8 in mozilla::image::RasterImage::SetFrameDisposalMethod(unsigned int, int) ()
>    from /home/jlebar/code/moz/nightly/libxul.so
> #4  0x00007f2c93337c6c in mozilla::image::nsGIFDecoder2::BeginImageFrame (this=0x7f2c2e563000, 
>     aDepth=<optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/image/decoders/nsGIFDecoder2.cpp:202
> #5  0x00007f2c93338648 in mozilla::image::nsGIFDecoder2::WriteInternal (this=0x7f2c2e563000, 
>     aBuffer=<optimized out>, aCount=<optimized out>)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/decoders/nsGIFDecoder2.cpp:922
> #6  0x00007f2c9331f2bb in mozilla::image::RasterImage::WriteToDecoder(char const*, unsigned int) ()
>    from /home/jlebar/code/moz/nightly/libxul.so
> #7  0x00007f2c9331f58b in mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::RasterImage::DecodeWorker::DecodeType) () from /home/jlebar/code/moz/nightly/libxul.so
> #8  0x00007f2c9331d860 in mozilla::image::RasterImage::DecodeWorker::DecodeABitOf (this=0x7f2c73ebfa10, 
>     aImg=0x7f2c5be98b60) at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2904
> #9  0x00007f2c9331d020 in mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2467
> #10 0x00007f2c93321db5 in mozilla::image::imgDecodeRequestor::Run() () from /home/jlebar/code/moz/nightly/libxul.so
> #11 0x00007f2c93ec841c in nsThread::ProcessNextEvent(bool, bool*) () from /home/jlebar/code/moz/nightly/libxul.so
> #12 0x00007f2c93e92b3c in NS_ProcessNextEvent_P(nsIThread*, bool) () from /home/jlebar/code/moz/nightly/libxul.so
> #13 0x00007f2c93b751db in nsXULWindow::ShowModal() () from /home/jlebar/code/moz/nightly/libxul.so
> #14 0x00007f2c93b51f7a in nsWindowWatcher::OpenWindowInternal(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so
> #15 0x00007f2c93b4e6c2 in nsWindowWatcher::OpenWindow(nsIDOMWindow*, char const*, char const*, char const*, nsISupports*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so

This call does 

    if (mDecoder && !mDecoder->IsSizeDecode())

sees that the condition is true, and returns.

== Third hit ==

> Breakpoint 1, mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> 2416	  if (mError)
> (gdb) bt
> #0  mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> #1  0x00007f2c9331d1d1 in WantDecodedFrames (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2406
> #2  mozilla::image::RasterImage::GetImgFrame (this=0x7f2c5be98b60, framenum=0)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:590
> #3  0x00007f2c93320058 in mozilla::image::RasterImage::SetFrameTimeout(unsigned int, int) ()
>    from /home/jlebar/code/moz/nightly/libxul.so
> #4  0x00007f2c93337db6 in mozilla::image::nsGIFDecoder2::EndImageFrame (this=0x7f2c2e563000)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/decoders/nsGIFDecoder2.cpp:262
> #5  0x00007f2c93338a61 in mozilla::image::nsGIFDecoder2::WriteInternal (this=0x7f2c2e563000, 
>     aBuffer=<optimized out>, aCount=<optimized out>)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/decoders/nsGIFDecoder2.cpp:1017
> #6  0x00007f2c9331f2bb in mozilla::image::RasterImage::WriteToDecoder(char const*, unsigned int) ()
>    from /home/jlebar/code/moz/nightly/libxul.so
> #7  0x00007f2c9331f58b in mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::RasterImage::DecodeWorker::DecodeType) () from /home/jlebar/code/moz/nightly/libxul.so
> #8  0x00007f2c9331d860 in mozilla::image::RasterImage::DecodeWorker::DecodeABitOf (this=0x7f2c73ebfa10, 
>     aImg=0x7f2c5be98b60) at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2904
> #9  0x00007f2c9331d020 in mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2467
> #10 0x00007f2c93321db5 in mozilla::image::imgDecodeRequestor::Run() () from /home/jlebar/code/moz/nightly/libxul.so
> #11 0x00007f2c93ec841c in nsThread::ProcessNextEvent(bool, bool*) () from /home/jlebar/code/moz/nightly/libxul.so
> #12 0x00007f2c93e92b3c in NS_ProcessNextEvent_P(nsIThread*, bool) () from /home/jlebar/code/moz/nightly/libxul.so
> #13 0x00007f2c93b751db in nsXULWindow::ShowModal() () from /home/jlebar/code/moz/nightly/libxul.so
> #14 0x00007f2c93b51f7a in nsWindowWatcher::OpenWindowInternal(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so
> #15 0x00007f2c93b4e6c2 in nsWindowWatcher::OpenWindow(nsIDOMWindow*, char const*, char const*, char const*, nsISupports*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so

Just like the second hit, this call does 

    if (mDecoder && !mDecoder->IsSizeDecode())

sees that the condition is true, and returns.

== Fourth hit ==

> Breakpoint 1, mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> 2416	  if (mError)
> (gdb) bt
> #0  mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2416
> #1  0x00007f2c9331d1d1 in WantDecodedFrames (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2406
> #2  mozilla::image::RasterImage::GetImgFrame (this=0x7f2c5be98b60, framenum=0)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:590
> #3  0x00007f2c9331ffb8 in mozilla::image::RasterImage::SetFrameAsNonPremult(unsigned int, bool) ()
>    from /home/jlebar/code/moz/nightly/libxul.so
> #4  0x00007f2c9331c29e in mozilla::image::Decoder::PostDecodeDone (this=0x7f2c2e563000)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/Decoder.cpp:275
> #5  0x00007f2c93337e8b in mozilla::image::nsGIFDecoder2::FinishInternal (this=0x7f2c2e563000)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/decoders/nsGIFDecoder2.cpp:115
> #6  0x00007f2c9331c353 in mozilla::image::Decoder::Finish (this=0x7f2c2e563000)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/Decoder.cpp:88
> #7  0x00007f2c9331f1d7 in mozilla::image::RasterImage::ShutdownDecoder(mozilla::image::RasterImage::eShutdownIntent) () from /home/jlebar/code/moz/nightly/libxul.so
> #8  0x00007f2c9331f66e in mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::RasterImage::DecodeWorker::DecodeType) () from /home/jlebar/code/moz/nightly/libxul.so
> #9  0x00007f2c9331d860 in mozilla::image::RasterImage::DecodeWorker::DecodeABitOf (this=0x7f2c73ebfa10, 
>     aImg=0x7f2c5be98b60) at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2904
> #10 0x00007f2c9331d020 in mozilla::image::RasterImage::RequestDecode (this=0x7f2c5be98b60)
>     at /builds/slave/m-cen-lnx64-ntly/build/image/src/RasterImage.cpp:2467
> #11 0x00007f2c93321db5 in mozilla::image::imgDecodeRequestor::Run() () from /home/jlebar/code/moz/nightly/libxul.so
> #12 0x00007f2c93ec841c in nsThread::ProcessNextEvent(bool, bool*) () from /home/jlebar/code/moz/nightly/libxul.so
> #13 0x00007f2c93e92b3c in NS_ProcessNextEvent_P(nsIThread*, bool) () from /home/jlebar/code/moz/nightly/libxul.so
> #14 0x00007f2c93b751db in nsXULWindow::ShowModal() () from /home/jlebar/code/moz/nightly/libxul.so
> #15 0x00007f2c93b51f7a in nsWindowWatcher::OpenWindowInternal(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, nsIDOMWindow**) () from /home/jlebar/code/moz/nightly/libxul.so

This call does |if (mInDecoder)|, sees that the condition is true, and then does

  nsRefPtr<imgDecodeRequestor> requestor = new imgDecodeRequestor(this);

and re-enqueues itself into the event loop!
Comment 26 Justin Lebar (not reading bugmail) 2012-09-10 15:00:04 PDT
I may understand what's going on here.

> // Simplified a bit
> RasterImage::RequestDecode()
> {
>   if (mError || mDecoded || !StoringSourceData() || (mDecoder && !mDecoder->IsSizeDecode()))
>     return;
> 
>   if (mInDecoder) {
>     nsRefPtr<imgDecodeRequestor> requestor = new imgDecodeRequestor(this);
>     return NS_DispatchToCurrentThread(requestor);
>     // asynchronously calls this->RequestDecode();
>   }

What I think is happening is, after Decoder::Finish() finishes, we're setting
neither mError nor mDecoded.

Therefore, when the imgDecodeRequestor runs, it starts a new decode.  This
process repeats ad infinitum.

Of course, we shouldn't even be spawning this imgDecodeRequestor in the first
place.  But I don't think that's the main bug.
Comment 27 Justin Lebar (not reading bugmail) 2012-09-10 15:18:06 PDT
> we're setting neither mError nor mDecoded.

Although that's pretty odd, because Decoder::PostDecodeDone (in the last stack trace in comment 25) should call RasterImage::DecodingComplete, which will ensure that either mError or mDecoded is set.

Unfortunately GDB seems to have died on me, so the above stacks are probably all we're going to get from this session.
Comment 28 Joe Drew (not getting mail) 2012-09-10 15:51:47 PDT
PostDecodeDone calls DecodingComplete _after_ calling SetFrameAsNonPremult, which is what's calling GetImgFrame that calls WantDecodedFrames that calls RequestDecode that actually re-enqueues the decode request.

However, that shouldn't have any effect if the image is actually already decoded. AFAICT the only way this should not be a no-op is if a request for the image to be discarded or for the image in a format (nonpremult vs premult, etc) that it isn't already in.

I'd be very interested to see where/whether there is a call to RasterImage::Discard for this image after the 4th RequestDecode call; that is, if you can ever reproduce this in a debugger again. :)
Comment 29 Justin Lebar (not reading bugmail) 2012-09-10 16:31:49 PDT
> or for the image in a format (nonpremult vs premult, etc) that it isn't already in.

I think you a word here, but I'm not quite sure what it is.

> I'd be very interested to see where/whether there is a call to RasterImage::Discard for this image 
> after the 4th RequestDecode call; that is, if you can ever reproduce this in a debugger again. :)

That discarding has something to do with this would make sense, since it seems that it may be easier to reproduce this issue by decreasing the max-decoded-image-bytes threshold.

I can make the imgDecodeRequestor not get enqueued when we're in the process of finishing a decode.  And I can modify the discard heuristics to be flexible with the max-decoded-image-bytes limit, although I'm afraid that could only paper over this problem.  (Instead of decoding in a loop, we could end up decoding once per second, which would still be awful for battery life on B2G!)

Any other ideas for how to fix this, assuming I don't get another shot at this in gdb?
Comment 30 Justin Lebar (not reading bugmail) 2012-09-10 19:01:59 PDT
Created attachment 659932 [details] [diff] [review]
Patch, v1

I'm not at all pleased with this.
Comment 31 Joe Drew (not getting mail) 2012-09-10 19:32:26 PDT
Comment on attachment 659932 [details] [diff] [review]
Patch, v1

o god o god

You don't actually need this extra flag - (mDecoder && mDecoder->GetDecodeDone()) should be enough, I think.

I can't quite convince myself that this is the fundamentally wrong way to solve this bug, sadly. I might even r+ a fix that used the above :(
Comment 32 Joe Drew (not getting mail) 2012-09-10 19:33:54 PDT
(In reply to Justin Lebar [:jlebar] from comment #29)
> > or for the image in a format (nonpremult vs premult, etc) that it isn't already in.
> 
> I think you a word here, but I'm not quite sure what it is.

Slightly rewritten:

"Ask for the image in a format different from the one it's stored in."

As for reproducing this in gdb - maybe I can start browsing with the set of values you've tried, since they clearly work!
Comment 33 Justin Lebar (not reading bugmail) 2012-09-10 19:51:09 PDT
> You don't actually need this extra flag - (mDecoder && mDecoder->GetDecodeDone()) should be enough, 
> I think.

We set mDecoder to null before calling decoder->Finish(), in ShutdownDecoder.

> As for reproducing this in gdb - maybe I can start browsing with the set of values you've tried, 
> since they clearly work!

They may work, but it had been more than a month since I last saw this bug.
Comment 34 Justin Lebar (not reading bugmail) 2012-09-10 20:49:04 PDT
Comment on attachment 659932 [details] [diff] [review]
Patch, v1

Back to r? because I don't think comment 31 works.
Comment 35 Joe Drew (not getting mail) 2012-09-17 14:45:42 PDT
(In reply to Justin Lebar [:jlebar] from comment #33)
> > You don't actually need this extra flag - (mDecoder && mDecoder->GetDecodeDone()) should be enough, 
> > I think.
> 
> We set mDecoder to null before calling decoder->Finish(), in ShutdownDecoder.

I don't understand why this would affect things. Surely we would check just after IsSizeDecode(), and hence ShutdownDecoder won't have been called?
Comment 36 Justin Lebar (not reading bugmail) 2012-09-17 18:29:44 PDT
Maybe I don't understand comment 31 correctly.  To be really concrete about it:

> You don't actually need this extra flag - (mDecoder &&
> mDecoder->GetDecodeDone()) should be enough, I think.

I thought/think you were/are suggesting that I should replace 

> @@ -2423,16 +2426,22 @@ RasterImage::RequestDecode()
>    if (mDecoder && !mDecoder->IsSizeDecode())
>      return NS_OK;
> +  if (mFinishing)
> +    return NS_OK;

with

> @@ -2423,16 +2426,22 @@ RasterImage::RequestDecode()
>    if (mDecoder && !mDecoder->IsSizeDecode())
>      return NS_OK;
> +  if (mDecoder && mDecoder->GetDecodeDone())
> +    return NS_OK;

What I was saying is that the RequestDecode() call we want to be affected by this early return happens during the call to decoder->Finish() inside ShutdownDecoder:

> RasterImage::ShutdownDecoder() {
>   [...]
>   nsRefPtr<Decoder> decoder = mDecoder;
>   mDecoder = nullptr;
> 
>   mInDecoder = true;
>   decoder->Finish();
>   mInDecoder = false;
>   [...]
> }

But notice that before we call decoder->Finish(), we've already set mDecoder = nullptr.  So |mDecoder && mDecoder->GetDecodeDone()| will not cause the desired early return.
Comment 37 Joe Drew (not getting mail) 2012-09-17 18:32:34 PDT
Oh, I missed that in the callstack.

..

this sucks.
Comment 38 Joe Drew (not getting mail) 2012-09-17 18:32:59 PDT
Comment on attachment 659932 [details] [diff] [review]
Patch, v1

i hope you're happy ahhhhh
Comment 39 Justin Lebar (not reading bugmail) 2012-09-17 18:55:16 PDT
> i hope you're happy ahhhhh

I am not at all happy, but at least we may finally have fixed this bug.

I may or may not have had a try run for this, but try is currently very backed up, so ima push to inbound.

https://hg.mozilla.org/integration/mozilla-inbound/rev/62a48b2712fb
Comment 40 Justin Lebar (not reading bugmail) 2012-09-17 18:59:54 PDT
Comment on attachment 659932 [details] [diff] [review]
Patch, v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): maybe bug 732820?

User impact if declined: Occasional 100% CPU usage by Firefox until browser is restarted.  Browser remains relatively responsive throughout, in my tests.

Testing completed (on m-c, etc.): We need to let this bake on m-c, of course.  In fact, it may even get backed out.  But I didn't want to forget to come back in a few days and request approval.

Risk to taking this patch (and alternatives if risky): This code is so messed up...there's a possibility I regressed something else.  I think it's unlikely we wouldn't notice pretty quickly if I did break something.

String or UUID changes made by this patch: None
Comment 41 Graeme McCutcheon [:graememcc] 2012-09-18 05:02:58 PDT
https://hg.mozilla.org/mozilla-central/rev/62a48b2712fb
Comment 42 Alex Keybl [:akeybl] 2012-09-18 09:35:52 PDT
Comment on attachment 659932 [details] [diff] [review]
Patch, v1

[Triage Comment]
Given where we are in the cycle, the risk here, and the fact that FF15 was affected (without significant user feedback), let's fix in FF17 first.
Comment 43 Justin Lebar (not reading bugmail) 2012-09-20 09:50:55 PDT
Landed for Aurora 17: https://hg.mozilla.org/releases/mozilla-aurora/rev/847afafd1c6d
Comment 44 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-10-16 16:01:09 PDT
Boris, can you please check if this is now fixed for you with the Firefox 17 Beta builds?
Comment 45 Boris Zbarsky [:bz] 2012-10-17 08:17:12 PDT
I have no way to really check this.  See "Random" in summary...
Comment 46 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-10-17 11:37:24 PDT
(In reply to Boris Zbarsky (:bz) from comment #45)
> I have no way to really check this.  See "Random" in summary...

Okay, thanks. I think we'll just need to trust the fix and keep an eye out on our feedback channels after we release.

Note You need to log in before you can comment on or make changes to this bug.