100% CPU usage due to what looks like GIF image decoding stuck in a loop

RESOLVED FIXED in Firefox 60

Status

()

defect
P3
normal
RESOLVED FIXED
Last year
11 months ago

People

(Reporter: mstange, Assigned: aosmond)

Tracking

({power, regression})

Trunk
mozilla61
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox59 unaffected, firefox60 fixed, firefox61 fixed)

Details

(crash signature)

Attachments

(4 attachments, 1 obsolete attachment)

Profile: https://perfht.ml/2GfPgbL

I only have one window open and was looking at a blank tab while taking this profile. My battery percentage had been going down at an alarming rate for at least 20 minutes before I started to investigate.
Probably from bug 523950.
Blocks: 523950
I'm investigating this. I suspect it is related to some recent shutdown crashes, e.g.:

https://crash-stats.mozilla.com/report/index/4c97739d-7eb7-4d9a-952d-c10be0180309#allthreads

My best guess at this point is that it never releases the thread, hence the shutdown hang / 100% CPU. This could in theory happen because of how advancing and discarding interact.
Assignee: nobody → aosmond
Crash Signature: [@ shutdownhang | nsThread::Shutdown | mozilla::image::DecodePoolImpl::Shutdown ]
I'm not convinced this is the correct solution yet... but in theory, it would make these crash reports go away if it is simply a matter of the refresh ticks advancing the animation.
Can we get a status here?
Flags: needinfo?(aosmond)
Priority: -- → P3
Comment on attachment 8958189 [details] [diff] [review]
0002-Bug-1444537-Part-1.-Ensure-discarding-of-animated-im.patch, v1

I haven't been able to figure out any other root cause besides the shutdown path. Let's just land and see if it happens again.
Flags: needinfo?(aosmond)
Attachment #8958189 - Flags: review?(tnikkel)
Attachment #8958192 - Flags: review?(tnikkel)
Attachment #8958189 - Flags: review?(tnikkel) → review+
Attachment #8958192 - Flags: review?(tnikkel) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4d66a4a931e4
Part 1. Ensure discarding of animated image frames is clear. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c5d4e965209
Part 2. Shutting down the decode pool should make animated decoders bail early. r=tnikkel
https://hg.mozilla.org/mozilla-central/rev/4d66a4a931e4
https://hg.mozilla.org/mozilla-central/rev/1c5d4e965209
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
I want to see the crashes go away, will then request uplift.
Flags: needinfo?(aosmond)
Crash Signature: [@ shutdownhang | nsThread::Shutdown | mozilla::image::DecodePoolImpl::Shutdown ] → [@ shutdownhang | nsThread::Shutdown | mozilla::image::DecodePoolImpl::Shutdown ] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | nsThread::Shutdown | mozilla::image::DecodePoolImpl::Shutdown ] [@ shutdownhang | __psynch_cvwait | <name omitted> | nsT…
Comment on attachment 8958189 [details] [diff] [review]
0002-Bug-1444537-Part-1.-Ensure-discarding-of-animated-im.patch, v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 523950
[User impact if declined]: May hang on shutdown due to stuck decoding an animated image.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: The reproduction rate was low on nightly, but it has not reoccurred since landing.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: Both patches in this bug (part 1 and part 2).
[Is the change risky?]: No.
[Why is the change risky/not risky?]: The changes are small and well understood. They should only have an functional impact during shutdown. 
[String changes made/needed]: None.
Flags: needinfo?(aosmond)
Attachment #8958189 - Flags: approval-mozilla-beta?
Comment on attachment 8958189 [details] [diff] [review]
0002-Bug-1444537-Part-1.-Ensure-discarding-of-animated-im.patch, v1

gotta have our gifs.  approved for 60.0b8
Attachment #8958189 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Andrew Osmond [:aosmond] from comment #10)
> [User impact if declined]: May hang on shutdown due to stuck decoding an
> animated image.
> [Is this code covered by automated tests?]: Yes.
> [Has the fix been verified in Nightly?]: The reproduction rate was low on
> nightly, but it has not reoccurred since landing.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that this fix has automated tests.
Flags: qe-verify-
This is still happening. I see a path I missed, ugh.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
In two places we should have also checked if we were shutting down:

https://searchfox.org/mozilla-central/rev/b55e1a1cbcaee34878e133fbac20c4c2af6e11b5/image/AnimationSurfaceProvider.cpp#275
https://searchfox.org/mozilla-central/rev/b55e1a1cbcaee34878e133fbac20c4c2af6e11b5/image/AnimationSurfaceProvider.cpp#332

But this begs the question, why are we so frequently in always in AnimationSurfaceProvider::FinishDecoding in the crash reports, or a similar teardown method? Most of the time during decode should be spent producing frames, not destroying the decoder. It doesn't seem like it is coincidence.

In theory we could hit an infinite loop if we recreate the decoder, and it terminates before producing a frame. It takes the early return at line 332 and loops again without giving up the thread context.

If a decoder doesn't produce the same number of frames on a redecode as the first decode, we should probably abort and stop the animation. We've already forgotten about the image at this point, so it is difficult for us to signal it and update the number of frames, get back in sync with FrameAnimator. Ugly.
Comment on attachment 8966978 [details] [diff] [review]
0002-Bug-1444537-Part-4.-Add-gtests-for-AnimatedFrameBuff.patch, v1

I realize the behaviour is less than ideal but we need something that can be uplifted to beta. The crash rate is high enough that I do worry what the release population will look like. Open to alternatives!

As we move towards generating full frames, and ditching FrameAnimator, I can promise things will get better :).
Attachment #8966978 - Flags: review?(tnikkel)
Attachment #8966977 - Flags: review?(tnikkel)
Attachment #8966977 - Flags: review?(tnikkel) → review+
Attachment #8966978 - Flags: review?(tnikkel) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/07f5d48b90cb
Part 3. Fix how redecode errors could cause animated image state inconsistencies. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/0d23d74448c8
Part 4. Add gtests for AnimatedFrameBuffer for redecode errors. r=tnikkel
https://hg.mozilla.org/mozilla-central/rev/07f5d48b90cb
https://hg.mozilla.org/mozilla-central/rev/0d23d74448c8
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Resolution: --- → FIXED
Backout by csabou@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/a0c455e036df
Backed out 2 changesets for causing crashes on test_discardFramesAnimatedImage.html. a=backout
See Also: → 1454824
This took more time than I thought it would. The stack trace was a bit misleading, as it felt like a nullptr deref, but I believe it was actually an assertion failure. It should happen on other platforms besides Android however as I don't believe it to be a timing problem in general. It was a result of a missing continue I removed in error.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0c46e142e953fae02989b2cd7939c3f62b57b3c

If the try is green, I will reland.
Attachment #8966977 - Attachment is obsolete: true
Flags: needinfo?(aosmond)
Attachment #8970557 - Flags: review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1173dccd114e
Part 3. Fix how redecode errors could cause animated image state inconsistencies. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/96a90528f1cd
Part 4. Add gtests for AnimatedFrameBuffer for redecode errors. r=tnikkel
https://hg.mozilla.org/mozilla-central/rev/1173dccd114e
https://hg.mozilla.org/mozilla-central/rev/96a90528f1cd
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1454130
Product: Core → Core Graveyard
Product: Core Graveyard → Core
You need to log in before you can comment on or make changes to this bug.