Closed Bug 1120279 Opened 9 years ago Closed 8 years ago

crash in mozilla::image::imgFrame::GetAnimationData()

Categories

(Core :: Graphics: ImageLib, defect)

x86
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: away, Assigned: aosmond)

References

Details

(Keywords: crash, Whiteboard: gfx-noted)

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-e801057b-cd50-4aa7-999f-018fe2150109.
=============================================================

Null |frame| in GetTimeoutForFrame. This began with build 20150109030224. Bug 1116733 is definitely a suspect but it's not clear whether this is a new issue or just a new signature due to refactoring.
Flags: needinfo?(seth)
I'm pretty certain that this bug, bug 1108034, and bug 1119518 are dupes that only look different due to refactoring. I will do some more detailed triage and confirm that soon.
Whiteboard: gfx-noted
Crash Signature: [@ mozilla::image::imgFrame::GetAnimationData()] → [@ mozilla::image::imgFrame::GetAnimationData()] [@ mozilla::image::imgFrame::GetAnimationData() const ]
OS: Windows NT → All
This should be fixed by bug 1079627. The underlying problem will be fixed in bug 1126739. The aim is to uplift both of these to Aurora. I cannot set the dependencies correctly here because of a circular dependency, unfortunately.
Depends on: 1126739
I think I hit this too: https://crash-stats.mozilla.com/report/index/6a40a5b3-2f74-415c-bfaa-d6eb32150205

Updating to see if I just don't have the latest version.
Flags: needinfo?(seth)
Crash Signature: [@ mozilla::image::imgFrame::GetAnimationData()] [@ mozilla::image::imgFrame::GetAnimationData() const ] → [@ mozilla::image::imgFrame::GetAnimationData()] [@ mozilla::image::imgFrame::GetAnimationData() const ] [@ mozilla::image::imgFrame::GetAnimationData] [@ mozilla::image::imgFrame::GetAnimationData const ]
Given that we just have a steady stream of these crashes coming in, I'd say none of the listed possible dependencies fully resolved the issue.
Assignee: nobody → aosmond
It fails trying to get the pointer for imgFrame::mMonitor::mMutex::mLock (at offset 0x4 or 0x8 depending on arch), which means it is imgFrame pointer itself that is null. As such, the signature would be different in a debug build as we would trip the assertion at https://dxr.mozilla.org/mozilla-central/rev/b1156b0eb96fcb49966b20e5fcf6a01f634ea2ee/image/FrameAnimator.cpp#419 but I don't see any bugs filed against it, so thus far it has only been observed in release builds.
We failed to get the previous frame for the blend. We don't know about the status for the next frame (for the purposes of FrameAnimator::DoBlend, aNextFrameIndex is always aPrevFrameIndex + 1.)

We know that FrameAnimator::GetRawFrame failed gracefully because if it thought it successfully acquired an imgFrame, it would have failed on the exact same error earlier when creating an RawAccessFrameRef from the DrawableSurface's imgFrame pointer, as imgFrame::LockImageData acquires mMonitor. So https://dxr.mozilla.org/mozilla-central/rev/b1156b0eb96fcb49966b20e5fcf6a01f634ea2ee/image/FrameAnimator.cpp#405 was never reached.

Thus either the DrawableSurface::Seek or SurfaceCache::Lookup failed.
Before FrameAnimator::DoBlend is called, FrameAnimator::AdvanceFrame acquires the RawAccessFrameRef for the next frame. In order to continue, the decoding must either be finished, or we have the complete next frame.

https://dxr.mozilla.org/mozilla-central/rev/b1156b0eb96fcb49966b20e5fcf6a01f634ea2ee/image/FrameAnimator.cpp#206

It might be a problem with just the previous/current frame then rather than both *or* the decoding completed, updated the frame count but did not supply the necessary frames.
Analysis of a minidump indicates the following:
1) aPrevFrameIndex is 0, aNextFrameIndex is 1.
2) FrameAnimator::GetRawFrame returns null for both indices.
3) The stack suggests the possibility that SurfaceCache::Lookup found the image key but failed on the surface key step.

The implications of 1 and 2 while in FrameAnimator::DoBlend
1) AnimationState::mDoneDecoding is true, because we did not already return (https://dxr.mozilla.org/mozilla-central/rev/82d0a583a9a39bf0b0000bccbf6d5c9ec2596bcc/image/FrameAnimator.cpp#206).
2) AnimationState::KnownFrameCount > 1, because we did not already return (https://dxr.mozilla.org/mozilla-central/rev/82d0a583a9a39bf0b0000bccbf6d5c9ec2596bcc/image/FrameAnimator.cpp#180).

This suggests the animated image successfully decoded, but its frames failed to be put in SurfaceCache or they somehow got evicted.
One possible sequence of events:

1) Precondition: Animated image is fully decoded.

2) imgFrame::Draw fails (maybe the surface/volatile buffer was purged, for example).

3) RasterImage::RecoverFromInvalidFrames is called, triggers the removal of the image/surfaces from SurfaceCache and re-adds only the image entry (https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/image/RasterImage.cpp#1220).

4) One of the following happened
   a) RasterImage::mHasSourceData remains false. Perhaps the final RasterImage::OnImageDataAvailable (off main thread) has triggered, we finished decoding, and a Draw came in before OnImageDataComplete (main thread) could trigger. As a result, our requested RasterImage::Decode would happen asynchronously instead of synchronously (https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/image/RasterImage.cpp#1092).

   b) Decoder::InitInternal failed (specifically nsPNGDecoder::InitInternal is in theory but unlikely to be fallible), decode aborted silently (https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/image/DecoderFactory.cpp#188).

   c) SurfaceCache::Insert failed, decode aborted silently (https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/image/DecoderFactory.cpp#203).

5) Image still sits in the nsRefreshDriver list, it attempts to advance the frame, AnimationState::mDoneDecoding remains true (because we didn't reset it) but there are no frames in SurfaceCache (because we deleted them, but either failed to start the redecode, or haven't yet finished it).

6) Boom, crash.
It is difficult to know if I really captured the crash sequence given I have not been able to reproduce myself :). Thankfully (??) we see this each day on recent nightlies from the crash reports, so if they stop after landing, that should be sufficient confirmation.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae0310a037a
Attachment #8791739 - Flags: review?(edwin)
Just for posterity; if the above patch doesn't fix it, this has a good chance of avoiding the crash (but doesn't answer why we are missing the next frame when decoding says we are done).
Attachment #8791741 - Attachment is obsolete: true
Status: NEW → ASSIGNED
It is possible bug 1296706 is related to this. The assert implies that AnimationState::mDoneDecoding is true but we are adding more frames. Perhaps the original decode ended early for some transient condition, but on a redecode, it was able to complete. Attachment 8791739 [details] [diff] would potentially fix this, although I think we would need to also reset the frame count if true, because if a redecode can give more frames, it could also give fewer in theory.
See Also: → 1296706
Comment on attachment 8791739 [details] [diff] [review]
Ensure consistent animated image state when redecoding, v1

Review of attachment 8791739 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds plausible.
Attachment #8791739 - Flags: review?(edwin) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c0021d576b3a
Ensure the animated image state is consistent if redecoding fails or is asynchronous. r=edwin
https://hg.mozilla.org/mozilla-central/rev/c0021d576b3a
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Well, at least you got one nightly, maybe that'll be enough to tell whether you're on the right track for fixing the crash.

Backed out in https://hg.mozilla.org/mozilla-central/rev/17242152a8fb for reftest failures, I'm not sure I'll find all of the ones we've filed but certainly all of bug 1303449 bug 1303490 bug 1303492 bug 1303505 bug 1303506 bug 1303491 and my favorite for the assertions overflowing the maximum log size bug 1303502.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla51 → ---
The crash reports confirmed that the issue (at least with my other patch breaking things) that this issue can happen on Linux as well as Mac OS X and Windows. The intermittent failures were likely a result of breaking multiple decode requests (mea culpa) in the normal case, not the failure case.

I have a new theory which extends from comment 9 which is probably the most likely scenario.

4) d) extends a) where by the first decoder is still active when the Draw fails (i.e. we started the animation because we have some frames but not all of them yet). As a result, we evict the old surface provider and create a new one, however the original decoding is still happening in the background. When the first decoder finishes, it sets AnimationState::mDoneDecoding to true (shared state between multiple decoders!) even though the surface provider for the new decode attempt may not have processed *any* frames yet.

The silver lining here is that the present state of AnimationState living in RasterImage instead of FrameAnimator and AnimationState living in AnimatedSurfaceProvider is temporary, as per Bug 1288040 Comment 0. That would be its very nature solve these races on the shared state between multiple decoders.

Given a much better understanding now of how we get into this state (and the fact that we know the code needs to be refactored anyways), I am much more comfortable with removing the mDoneDecoding check which is a very minor optimization given we already incurred the cost of hitting SurfaceCache for the next frame beforehand (which can also tell us what we want to know in a much safer manner).
In addition to the above, this change should be much safer and *not* break half the tests which require image decoding... ;).

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e44ce3071ef8
Attachment #8791739 - Attachment is obsolete: true
Attachment #8793336 - Flags: review?(edwin)
Let's do a try run this time to make sure, eh. ;-)
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a41bd18fe6c
Always check if the next frame is available before advancing an animation.
(In reply to Edwin Flores [:eflores] [:edwin] from comment #19)
> Let's do a try run this time to make sure, eh. ;-)

Yes. It had a clean run this time!
https://hg.mozilla.org/mozilla-central/rev/6a41bd18fe6c
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment on attachment 8793336 [details] [diff] [review]
Ignore decoding complete and always check next frame exists, v2

Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: May crash when displaying through an animation.
[Describe test coverage new/current, TreeHerder]: Only has been observed via crash reports as condition necessary to reproduce depends on the system releasing graphics buffers (known possibility but poorly handled for animated images). Crash reports have ceased in builds containing the patch since landing.
[Risks and why]: Very low risk. We crashed because we assumed if we have finished decoding, we must have all of the necessary frames, rather than check each frame before use. Now we no longer make that assumption and always check the frame before use. This makes the code path before and after finishing decoding the same, so it was already well tested.
[String/UUID change made/needed]: N/A
Attachment #8793336 - Flags: approval-mozilla-aurora?
Comment on attachment 8793336 [details] [diff] [review]
Ignore decoding complete and always check next frame exists, v2

Fix a crash. Take it in 51 aurora.
Attachment #8793336 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8793336 [details] [diff] [review]
Ignore decoding complete and always check next frame exists, v2

Further soaking in aurora and nightly hasn't revealed any problems as a result of this, and the crashes have stopped for aurora since uplifting. Still seeing recent reports from beta users.

Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: May crash when displaying through an animation.
[Describe test coverage new/current, TreeHerder]: Only has been observed via crash reports as condition necessary to reproduce depends on the system releasing graphics buffers (known possibility but poorly handled for animated images). Crash reports have ceased in nightly and aurora builds containing the patch since landing.
[Risks and why]: Very low risk. We crashed because we assumed if we have finished decoding, we must have all of the necessary frames, rather than check each frame before use. Now we no longer make that assumption and always check the frame before use. This makes the code path before and after finishing decoding the same, so it was already well tested.
[String/UUID change made/needed]: N/A
Attachment #8793336 - Flags: approval-mozilla-beta?
I highly suspect the crash signature [@ mozilla::image::FrameAnimator::AdvanceFrame ] is a variant of the crash tracked here, but is also fixed by this patch.
Crash Signature: [@ mozilla::image::imgFrame::GetAnimationData()] [@ mozilla::image::imgFrame::GetAnimationData() const ] [@ mozilla::image::imgFrame::GetAnimationData] [@ mozilla::image::imgFrame::GetAnimationData const ] → [@ mozilla::image::imgFrame::GetAnimationData()] [@ mozilla::image::imgFrame::GetAnimationData() const ] [@ mozilla::image::imgFrame::GetAnimationData] [@ mozilla::image::imgFrame::GetAnimationData const ] [@ mozilla::image::FrameAnimator::AdvanceFram…
Comment on attachment 8793336 [details] [diff] [review]
Ignore decoding complete and always check next frame exists, v2

Fixes a crash, has stabilized on Nightly and Aurora, Beta50+
Attachment #8793336 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1346510
Bug 1346510 provides a more likely seeming explanation for how we were hitting this case.
You need to log in before you can comment on or make changes to this bug.