Closed Bug 1293577 Opened 3 years ago Closed 3 years ago

Crash in mozilla::image::Decoder::HasProgress

Categories

(Core :: ImageLib, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: ting, Assigned: aosmond)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-1ba88cc4-d70a-4916-aecd-733e12160809.
=============================================================

#42 of 0807 Nightly on Windows, 3 crashes from 3 installations.
From the minidump, |mDecoder| [1] is null. But there's a null check in the beginning of DecodedSurfaceProvider::Run(), also only FinishDecoding() can set mDecoder null. Is it possible that the task is executed by two DecodePoolWorker, or do you have any other ideas?

[1] https://hg.mozilla.org/mozilla-central/annotate/d42aacfe34af/image/DecodedSurfaceProvider.cpp#l144
Flags: needinfo?(seth.bugzilla)
Duplicate of this bug: 1293556
Crash Signature: [@ mozilla::image::Decoder::HasProgress] → [@ mozilla::image::Decoder::HasProgress] [@ mozilla::image::DecodedSurfaceProvider::Run]
I believe you are correct, the same task gets executed by two workers.

In DecodedSurfaceProvider::Run, we call Decoder::Decode:

https://dxr.mozilla.org/mozilla-central/rev/720b5d2c84d5b253d4dfde4897e13384dc97a46a/image/DecodedSurfaceProvider.cpp#133

Decoder::Decode in turn will call DoDecode for a specific decoder which can call into StreamingLexer::Lex. SourceBufferIterator::AdvanceOrScheduleResume is called which can schedule IDecodingTask::Resume to be called later when there is more data.

https://dxr.mozilla.org/mozilla-central/rev/720b5d2c84d5b253d4dfde4897e13384dc97a46a/image/StreamingLexer.h#431

IDecodingTask::Resume itself will put the same DecodedSurfaceProvider object into the decode work queue. In theory all of this can happen before we return from the original Decoder::Decode call. The new DecodedSurfaceProvider::Run call can then finish processing the image and clear our reference.
Yep, that's right! Thanks for breaking it down, Andrew. Now that you made the mistake of posting in the bug, I'm going to ask you to review the patch. =p

This was an oversight on my part, which is troubling because I wrote all the code involved here and theoretically knew about this issue. It's pretty clear that after all the refactoring that has happened recently, this API (I'm specifically talking about the way IResumable works in conjunction with SourceBufferIterator) doesn't really match our current needs cleanly and has become a bit of a footgun. I have an idea of how to make it better, but it's probably better to wait until the other changes happening lately have died down before tackling that, as it's kinda invasive.

In the short term the simplest fix is just to add a lock, so that's what we'll do. It shouldn't be an issue from a performance point of view, as I expect the lock to be uncontended the vast, vast majority of the time.
Flags: needinfo?(seth.bugzilla)
Assignee: nobody → aosmond
Status: NEW → ASSIGNED
Attachment #8779508 - Flags: review?(seth.bugzilla)
Comment on attachment 8779508 [details] [diff] [review]
add mutex to DecodedSurfaceProvider and AnimationDecodingTask

Review of attachment 8779508 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Thanks for tackling this, Andrew!

Just had some style nits:

::: image/DecodedSurfaceProvider.h
@@ +81,5 @@
>  
>    /// The key under which we're stored as a cache entry in the surface cache.
>    SurfaceKey mSurfaceKey;
> +
> +  /// Mutex protecting access to the decoder

Nit: please end comments with a period. =)

Non-nit: rather than saying "the decoder", it's much better to say "mDecoder". That way, searching the code for mDecoder will turn this up. Also, please move this next to mDecoder (the usual pattern is right above it), again to make it more noticeable.

::: image/IDecodingTask.h
@@ +83,5 @@
>    virtual ~AnimationDecodingTask() { }
>  
>    NotNull<RefPtr<Decoder>> mDecoder;
> +
> +  Mutex mMutex;

Follow the same pattern here. I realize it's obvious in this case since there's only one other member, but it's important to always document which members a mutex is protecting. (Especially since things may change later.)
Attachment #8779508 - Flags: review?(seth.bugzilla) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/64863b1fcf6470de45a6f3979dd3b1709e4b10e8
Bug 1293577 - Protect the image decoding task path with a mutex to avoid race conditions. r=seth
https://hg.mozilla.org/mozilla-central/rev/64863b1fcf64
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Andrew, I realized after reviewing this that MetadataDecodingTask needs this treatment as well. I believe this might need to be uplifted to aurora (though I'd need to double check) so it'd probably be good to fix them all in one bug. Could you please add a followup to fix this for MetadataDecodingTask as well?
Flags: needinfo?(aosmond)
Seth, is it just for consistency purposes? I don't see a race in MetadataDecodingTask::Run, because if it reaches a terminal state, it should not have a pending task reference to run, and if it isn't terminal, there is nothing to race for.
Flags: needinfo?(aosmond) → needinfo?(seth.bugzilla)
(In reply to Andrew Osmond [:aosmond] from comment #11)
> Seth, is it just for consistency purposes? I don't see a race in
> MetadataDecodingTask::Run, because if it reaches a terminal state, it should
> not have a pending task reference to run, and if it isn't terminal, there is
> nothing to race for.

It's not just for consistency purposes. Decoder does no synchronization on its own and its methods aren't reentrant. If we allow reentrant calls to Decoder::Decode(), it will be extremely easy to introduce subtle issues that will potentially be a nightmare to debug. For example, consider what would happen if someone introduces code that reads or writes any Decoder state whatsoever here:

https://dxr.mozilla.org/mozilla-central/source/image/Decoder.cpp?q=Decoder.cpp&redirect_type=direct#133

Or someone adds code to read or write some StreamingLexer state here:

https://dxr.mozilla.org/mozilla-central/source/image/StreamingLexer.h?q=StreamingLexer.h&redirect_type=direct#436

Instant data race.

This is really not something we can allow to lurk as a footgun in the tree.
Flags: needinfo?(seth.bugzilla) → needinfo?(aosmond)
Exactly the same change to MetadataDecodingTask as to AnimationDecodingTask, as per comment 10 (so I took the liberty of carrying r+ :)).
Flags: needinfo?(aosmond)
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3eade32e18c4
Part 2. Add mutex to MetadataDecodingTask to protect from decoding races. r=seth
Thank you for taking care of that, Andrew. I really appreciate it!
You need to log in before you can comment on or make changes to this bug.