Closed Bug 824658 Opened 7 years ago Closed 7 years ago

Long-lived loop decoding a PNG


(Core :: ImageLib, defect)

Not set



Tracking Status
firefox19 --- wontfix
firefox20 --- fixed
firefox21 --- fixed
b2g18 --- fixed


(Reporter: justin.lebar+bug, Assigned: khuey)



(Keywords: regression, Whiteboard: [Snappy:P1])


(1 file)

We've seen this before, and I remember that it took us months to figure out what was going on.  Unfortunately I just saw it again.  :(

This loop was not infinite; it went away after a while.

I guess I'll try to catch it in gdb.
This seems to be happening quite often.  The jank is awful.
Whiteboard: [Snappy:P1]
This bug also prevents Firefox from cleanly shutting down.  So a user will have to kill it using the task manager before starting Firefox up again.

erm, sorry, I guess that's not a profile.  :)  Anyway, it looked just like the one in comment 0.
Keywords: regression
I caught this in GDB and, before it segfaulted Firefox, I observed that we appeared to be discarding the image that we'd then re-decode.  We were discarding from the discard manager, which indicates that the image we were drawing was not locked.
Assignee: nobody → justin.lebar+bug
Blocks: 824583
I've verified that some images are ::Drawn while discardable.  I suspect such images may be the problem here.

If so, my plan is to notice when an image is Drawn while discardable and discard it less aggressively.  (In particular, drawn-while-discardable images won't be subject to our hard cap on decoded image memory.)

But I need to test for a bit.
This is bad, right?

> #0  mozilla::image::RasterImage::RequestDecodeCore (this=0x12c667b80, aDecodeType=<value temporarily unavailable, due to optimizations>) at ../../../../image/src/RasterImage.cpp:2784
> #1  0x0000000101362a45 in mozilla::image::RasterImage::GetImgFrame (this=0x12c667b80, framenum=0) at ../../../../image/src/RasterImage.cpp:817
> #2  0x0000000101363f17 in mozilla::image::RasterImage::SetFrameAsNonPremult (this=0x18, aFrameNum=1, aIsNonPremult=false) at ../../../../image/src/RasterImage.cpp:1567
> #3  0x000000010135ff90 in mozilla::image::Decoder::PostDecodeDone (this=0x10b2a6870) at ../../../../image/src/Decoder.cpp:282
> #4  0x000000010259a96b in MOZ_PNG_push_have_end [inlined] () at /Users/jlebar/code/moz/ff-inbound/src/media/libpng/pngpread.c:1453
> #5  0x000000010259a96b in MOZ_PNG_push_read_chunk (png_ptr=0x10fe9d800, info_ptr=0x127119a80) at ../../../../media/libpng/pngpread.c:2784
> #6  0x000000010259a52b in MOZ_PNG_proc_some_data [inlined] () at /Users/jlebar/code/moz/ff-inbound/src/media/libpng/pngpread.c:121
> #7  0x000000010259a52b in MOZ_PNG_process_data (png_ptr=0x10fe9d800, info_ptr=0x127119a80, buffer=<value temporarily unavailable, due to optimizations>, buffer_size=<value temporarily unavailable, due to optimizations>) at ../../../../media/libpng/pngpread.c:2784
> #8  0x000000010137f941 in mozilla::image::nsPNGDecoder::WriteInternal (this=<value temporarily unavailable, due to optimizations>, aBuffer=<value temporarily unavailable, due to optimizations>, aCount=<value temporarily unavailable, due to optimizations>) at ../../../../image/decoders/nsPNGDecoder.cpp:348
> #9  0x00000001013646e5 in mozilla::image::RasterImage::WriteToDecoder (this=0x12c667b80, aBuffer=0x12d647008 "?PNG\r\n\032\n", aCount=7867) at ../../../../image/src/RasterImage.cpp:2688
> #10 0x0000000101366aca in mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage (this=<value temporarily unavailable, due to optimizations>, aImg=0x12c667b80, aDecodeType=mozilla::image::RasterImage::DecodeWorker::DECODE_TYPE_NORMAL) at ../../../../image/src/RasterImage.cpp:3200
Yeah, and the image's lock count is 0, so as soon as it finishes decoding, we discard it.  Then the runnable comes 'round the mountain, and we're screwed.
Here is my hypothesis as to how this is happening.

1. Image A's decode finishes.  Image A sets mDecoded = true.  Image A's lock
   count is 0, so it calls DiscardTracker::Reset.
2. DiscardTracker::Reset notices that we're low on memory and calls
   MaybeDiscardSoon, which enqueues a task to run DiscardTracker::DiscardNow.

3. From image B's Decoder::PostDecodeDone, we end up in RequestDecodeCore.
   This notices that mInDecoder and dispatches a new imgDecodeRequestor to the
   main thread.

4. Image B's decode finishes.  Its lock count is also 0.

5. DiscardTracker::DiscardNow runs.  This happens before image B's
   imgDecodeRequestor runs, because DiscardNow was enqueued after image A's
   DiscardNow discards image B (and maybe also image A, but we don't care).

6. Image B's imgDecodeRequestor runs and decodes image B.
It really looks like we should modify SetFrameAsNonPremult (and similar) to call GetImgFrameNoDecode(), since they're called from within decoders. If we do we no longer call RequestDecode at all afaict.
> It really looks like we should modify SetFrameAsNonPremult (and similar) to call GetImgFrameNoDecode(), since they're called from 
> within decoders.

I can do that.  But we've already had another bug just like this, so I also want to change the discard tracker so that it's more forgiving of this sort of thing.
Heh, so: If you ever load a tab in the background, we'll decode images in that tab while they're unlocked.  And that's of course all you need to trigger this bug.

I'm becoming increasingly skeptical that this logic where we willingly keep only X kb of decoded image data can work properly.

If we're over the limit, should we even decode an unlocked image?  We're going to throw it away immediately...
I'm probably seeing this problem when nobody else is because I've set image.mem.max_decoded_kb to 1024.  (I set this specifically to tickle bugs like this.)

But we set this value to 5mb on B2G, so it's not inconceivable that we'd hit this bug there.
Attached patch PatchSplinter Review
Assignee: justin.lebar+bug → khuey
Attachment #700350 - Flags: review?(justin.lebar+bug)
Comment on attachment 700350 [details] [diff] [review]

Whatchathink, joe?

khuey is also reworking the discard tracker to avoid this class of issue (per comment 10), but he says he's going to do this in a separate bug.
Attachment #700350 - Flags: review?(justin.lebar+bug) → review?(joe)
Comment on attachment 700350 [details] [diff] [review]

Review of attachment 700350 [details] [diff] [review]:

This looks lovely
Attachment #700350 - Flags: review?(joe) → review+
Comment on attachment 700350 [details] [diff] [review]

[Approval Request Comment]
Per comment 15, I'm afraid B2G is particularly susceptible to this bug.  It's hard to notice that it's occurring on the phone, except you'll run down your battery and the phone will be sluggish.

Imagelib changes are always scary, but I think we should take this.

No string changes, etc.
Attachment #700350 - Flags: approval-mozilla-b2g18?
> Per comment 15, I'm afraid B2G is particularly susceptible to this bug.

Sorry, comment 12.
blocking-b2g: --- → tef?
khuey is away, so I landed this with Joe's blessing so that we could get as much bake time on nightly as possible before porting to b2g.  Hope that's OK, Kyle!

(I realized that this may not have been pushed to try only after pushing this to inbound.  If it burns, let it be known that it's my fault, not khuey's.)
Yep that's fine.  I was on a plane :-P
Attachment #700350 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
(In reply to Ryan VanderMeulen from comment #21)
> This be nominated for Aurora/Beta as well?

0.5% of users (well, 0.5% of sessions) are hitting this bug.  I'm not sure if it's worth taking on Aurora and Beta.  Maybe Aurora, though.
Comment on attachment 700350 [details] [diff] [review]

[Approval Request Comment]
* Bug caused by (feature/regressing bug #): Probably bug 732820.
* User impact if declined: 0.5% of users will experience high CPU usage and a janky browser for at least some minutes, possibly until they restart the browser.  This may also cause shutdown hangs.
* Testing completed (on m-c, etc.): On m-c.
* Risk to taking this patch (and alternatives if risky): I'd say moderate risk; imagelib changes often have regressions.  I'm not convinced this is worth taking on beta, but maybe it's worthwhile for Aurora.
String or UUID changes made by this patch: none
Attachment #700350 - Flags: approval-mozilla-beta?
Attachment #700350 - Flags: approval-mozilla-aurora?
Comment on attachment 700350 [details] [diff] [review]

Agreed on taking this on Aurora only given the risk profile. We're early enough in the Aurora cycle for this to get the necessary bake time for release.
Attachment #700350 - Flags: approval-mozilla-beta?
Attachment #700350 - Flags: approval-mozilla-beta-
Attachment #700350 - Flags: approval-mozilla-aurora?
Attachment #700350 - Flags: approval-mozilla-aurora+
blocking-b2g: tef? → ---
Oops, sorry guys.  I just pushed an empty cset for this to Aurora.

(It claims to have modified some files, but there's no diff in those files; this is a known bug in hg.)
You need to log in before you can comment on or make changes to this bug.