Assertion failure: NS_IsMainThread() and crash [@mozilla::image::Decoder::AllocateFrame]

RESOLVED FIXED in Firefox 28

Status

()

defect
--
critical
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: posidron, Assigned: seth)

Tracking

(6 keywords)

Trunk
mozilla29
x86_64
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox27 wontfix, firefox28+ verified, firefox29 fixed, firefox-esr2428+ fixed, b2g18 unaffected, b2g-v1.1hd unaffected, b2g-v1.2 fixed, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 unaffected)

Details

(Whiteboard: [fuzzblocker][adv-main28+][adv-esr24.4+], crash signature)

Attachments

(5 attachments)

Posted file callstack.txt
Load the attached testcase and click the "Ok" button for every two popup boxes, then wait a few seconds.

Assertion failure: NS_IsMainThread(), at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/image/src/Decoder.cpp:197

Tested with http://hg.mozilla.org/integration/mozilla-inbound/rev/9fd04fd70fc2
Posted file testcase.html
Posted file media.zip
Crash Signature: [@ mozilla::image::Decoder::AllocateFrame()]
If this only happens in ASan+debug builds, it might be related to bug 895845.
Posted file testcase-simplified
This is a very simplified testcase, the stack is the same.

During the fuzzing run I see this crash as:

###!!! ABORT: Calling InitDecoder() but already decoded!: '!mDecoded', file /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/image/src/RasterImage.cpp, line 1963


Tested against http://hg.mozilla.org/integration/mozilla-inbound/rev/f85f06761d52
Whiteboard: [fuzzblocker]
This also blocks WebGL fuzzing with Textures. Any chance we can get this bug some care?
Flags: needinfo?(milan)
Seth, how're you doing for time?
Flags: needinfo?(seth)
I can take this.
Assignee: nobody → seth
Flags: needinfo?(seth)
Flags: needinfo?(milan)
(In reply to Christoph Diehl [:cdiehl] from comment #5)
> This also blocks WebGL fuzzing with Textures. Any chance we can get this bug
> some care?

I've had the testcase running for quite some time without generating a crash.

This is something I should be able to fix without actually reproducing it, just by doing a conservative check in Decoder::Write that we're on the main thread before attempting to allocate frames.

However, I'd like to be able to reproduce this so I can get a better sense of what the sequence of events was leading up to this point. I suspect this is a case where we're calling SetSynchronous(true) on the decoder and either failing to call SetSynchronous(false) or wrongly doing this outside of the decoding lock.

Right now my suspicions fall right here:

https://mxr.mozilla.org/mozilla-central/source/image/src/RasterImage.cpp#2409

This looks obviously unsafe to me but it's hard to be sure this is actually what's causing _this_ issue. I'll write a patch for this regardless.
Heh, sorry, I never got around to my question. Christoph, can you reproduce this outside of the fuzzer or ASAN, just by viewing the testcase on a debug build?
Flags: needinfo?(cdiehl)
(In reply to Seth Fowler [:seth] from comment #8)
> I've had the testcase running for quite some time without generating a crash.

Try hitting cmd+r a couple of times.

(In reply to Seth Fowler [:seth] from comment #9)
> Heh, sorry, I never got around to my question. Christoph, can you reproduce
> this outside of the fuzzer or ASAN, just by viewing the testcase on a debug
> build?

Outside of the fuzzer - yes, outside of ASan - not sure haven't tested that.
Flags: needinfo?(cdiehl)
OK, I tried reloading a good deal and still didn't have luck. I'll try this on an ASan build tomorrow; I don't have access to a linux machine today.
Depends on: 940714
Bug 940714, which just hit m-c, took care of one possible cause. I still haven't gotten in front of a linux machine to test an ASan build, but I'll try to do so soon.
I have now diagnosed the problem. It boils down to one line, RasterImage.cpp:3032:

>  MutexAutoUnlock(mDecodingMutex);

When we unlock the mutex to send out notifications, all of our invariants are lost. Since we have semi-global variables in imagelib (for example, boolean determining whether a decoder is synchronous), things can go haywire when this happens.

I've pushed a try job with a simple fix for this problem, with the intention of generating an ASan build to verify a fix. The better solution is to eliminate 'mSynchronous' in the Decoder and simply pass it in. I'm going to try that now, but the structure of the code may make that difficult.

We are probably going to keep having problems until we get rid of this call to MutexAutoUnlock, though. The right thing is probably just to send those notifications asynchronously and take the perf hit. =(

 https://tbpl.mozilla.org/?tree=Try&rev=31a44ff30d09
This is a test of the patch to remove 'mSynchronous' from imagelib's Decoder class altogether. Synchronous decoding is instead determined statelessly by function arguments. Inter-thread interference thus can't happen.

This is one I'd be more interested in landing if the tests look green.

https://tbpl.mozilla.org/?tree=Try&rev=eeb99256c0c1
Nice! I'm pleased that removing mSynchronous turned out so well. I'm going to use try to generate an ASan build based upon the above patch and verify the fix.

https://tbpl.mozilla.org/?tree=Try&rev=a5e2d5ea01cf
I can no longer reproduce the problem with the build above. Will post a review-ready version of the patch shortly.
This patches removes the 'mSynchronous' variable from the Decoder class entirely, and all of the supporting machinery. Instead, callers to Decode::Write pass in a DecodeStrategy, which is either DECODE_SYNC or DECODE_ASYNC. This provides the same functionality, but it prevents different threads from interfering with each other.

This fixes the specific issue discussed in this bug, and IMO it makes the code cleaner. Work in other bugs will fix the more general issue described in comment #13.
Attachment #8346961 - Flags: review?(josh)
Now that this bug is secure, I'll state more clearly that bug 943804 is where the general issue in comment #13 will get fixed. This bug should've been hidden earlier, but I didn't immediately realize the connection between it and some of the other issues in imagelib lately.
Sorry, bug 943803.
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

Nice.
Attachment #8346961 - Flags: review?(josh) → review+
https://hg.mozilla.org/mozilla-central/rev/772ce6463fd8
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Was this trunk only? Otherwise, it should have got a security rating before going in and we may need to take it on branches.
Flags: needinfo?(seth)
(In reply to Al Billings [:abillings] from comment #23)
> Was this trunk only? Otherwise, it should have got a security rating before
> going in and we may need to take it on branches.

Apologies for the slow response! Finally getting my backlog cleared out.

Given that it was reported 7 months ago, I'd say it's not trunk-only. =) This is actually fallout from bug 716140, meaning AFAICT it goes all the way back to FF *22*.

Presumably that means the patch should go at least to ESR 24.
Flags: needinfo?(seth)
It can't get approved for ESR24 without a rating though (and probably not branches). 

It doesn't seem that the crash is exploitable, when it can be triggered. What are the security implications here?
(Never mind about branches, I didn't see when we'd fixed it.)
What's the sec rating here? Also, can we get an esr24 nom please? :)
Flags: needinfo?(seth)
Looks like this isn't going to be going to ESR24.
Whiteboard: [fuzzblocker] → [fuzzblocker][adv-main28+]
(In reply to Al Billings [:abillings] from comment #25)
> It can't get approved for ESR24 without a rating though (and probably not
> branches). 
> 
> It doesn't seem that the crash is exploitable, when it can be triggered.
> What are the security implications here?

I don't *think* this is exploitable. Bug 943803 fixed the more fundamental issue.
Flags: needinfo?(seth)
I guess what I just posted does not constitute a sec rating, though. Whenever I'm asked to assign a sec rating, I feel like I'm basically choosing arbitrarily. I don't really understand the nuances that exist between the various ratings other than sec-critical (and this isn't sec-critical, AFAICT).
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #31)
> Does this help?
> https://wiki.mozilla.org/Security_Severity_Ratings

That does help, but a bit of frustration remains for me because if I am not sure of the exact ramifications of a bug (as is frequently the case, because once you lose your invariants it's very difficult to reason about what else can happen) then the only reasonable rating I can give is sec-critical. By these definitions, very few security bugs I've dealt with could be rated any other way.

So I suppose sec-critical is my rating for this bug as well. In the future I will treat that as my default rating unless I am *extremely* convinced that a lower rating is warranted.
That's a valid conservative rating strategy. It is usually easier to fix the bug than "prove" it is or isn't exploitable in practice, and unless you're convinced a bug is not exploitable it's better to fix it than have someone else prove us wrong.

> Whenever I'm asked to assign a sec rating, I feel like I'm basically choosing arbitrarily.

We all feel that way. Our use of the word "triage" with its war-time connotations is not accidental. Take your best guess and we hope we're right more often than not. Don't spend too much time thinking about the rating because your time is more valuable spent on fixing.
If this is sec-crit, we're going to want it on esr24. Please nominate it as such :). Up to the release drivers if it's landed for this cycle or the next at this point.
It is too late for this cycle.
Al, if we're going to be landing Pwn2Own fixes for 28/24.4 still, should we take this while we're at it since we're treating it as a sec-crit?
Flags: needinfo?(abillings)
This applies with a little bit of care to 28, but <28 will need a more targeted patch.
Flags: needinfo?(seth)
The beta Try run was green.
Yes, please get this in, especially since it was supposed to go in. At least for 28.
Flags: needinfo?(abillings)
If it is critical, I would be happy to have it for ESR 24.
It missed the boat for this release since it is in a few days.
Keywords: sec-critical
Let's get this sec-critical fix landed on mozilla-release and esr24 branch so that it can ride along with the pwn2own landings happening today for the 28 rc respin.
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 716140
User impact if declined: sec-crit
Testing completed (on m-c, etc.): on trunk for 3 months
Risk to taking this patch (and alternatives if risky): passes on try
String or IDL/UUID changes made by this patch: none
Attachment #8346961 - Flags: approval-mozilla-beta?
This applies much better to <28 with bug 940714 also applied. I've got Try runs going for that right now.
Blocks: 716140
Keywords: regression
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #45)
> This applies much better to <28 with bug 940714 also applied. I've got Try
> runs going for that right now.

I'll request uplift for that too. It's kinda a strange situation because bug 940714 is pretty much completely obliterated by the changes in this bug, so we're mainly uplifting it to make this patch easier to apply, but I think it makes sense to do that.
Flags: needinfo?(seth)
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: It's sec-crit.
User impact if declined: Crashes which may be exploitable.
Fix Landed on Version: 29
Risk to taking this patch (and alternatives if risky): Very low. On m-c for 3 months with no issues. Affects a section of the code which hasn't churned much since the ESR release, so surprising interactions are unlikely.
String or UUID changes made by this patch: None.
Attachment #8346961 - Flags: approval-mozilla-release?
Attachment #8346961 - Flags: approval-mozilla-esr24?
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

Approving per lsblakk comments.
Attachment #8346961 - Flags: approval-mozilla-release?
Attachment #8346961 - Flags: approval-mozilla-release+
Attachment #8346961 - Flags: approval-mozilla-esr24?
Attachment #8346961 - Flags: approval-mozilla-esr24+
Attachment #8346961 - Flags: approval-mozilla-beta?
Attachment #8346961 - Flags: approval-mozilla-beta+
I tried to reproduce this issue on an ASAN build from a couple of days ago but I don't get any assertion failures or crashes. I don't get any "ok" buttons either, pop-up dialogs just keep appearing and disappearing very fast, but I don't see any buttons on them (the text is something like this: "Generating a private key. Please wait...").

I am testing on Ubuntu 12.10 64bit. Is this Mac OS X specific? (if it is, please direct me to some builds to reproduce/verify this fix with).
Flags: needinfo?(seth)
(In reply to Ioana Budnar, QA [:ioana] from comment #50)
> I tried to reproduce this issue on an ASAN build from a couple of days ago
This is actually "a couple of months ago". To be more specific, I tested with the Firefox 29.0a1 ASAN build from 12/10.

I tested this again today with an RC ASAN from 03/15. The difference compared to the pre-fix build when using the attached testcase is that here I get a dialog for setting a password for the Software Security Device. The pop-up dialog that asks for the password is then re-displayed after each "ok" or "cancel". The simplified testcase freezes the browser. It doesn't get to any crashes or other issues, it just stays frozen until the user closes it (it can be closed from the [x] button, you don't need to kill the process).
Matt, can you please help in verifying this is fixed ahead of release tomorrow?
Flags: needinfo?(mwobensmith)
It takes a minute or two of running with these test cases to see a crash. I wasn't able to do it with a non-ASan build - these are local ASan builds.

Confirmed crash in 24.3.0esrpre, 2014-03-12.
Verified fixed in Fx28, 2014-03-17.

Still testing other branches.
Flags: needinfo?(mwobensmith)
Whiteboard: [fuzzblocker][adv-main28+] → [fuzzblocker][adv-main28+][adv-esr24.4+]
Flags: needinfo?(seth)
QA Contact: mwobensmith
Group: core-security
This issue appears to be an issue Qanalyst is unable to verify. Marking QAExclude in QA Whiteboard.
QA Whiteboard: QAExclude
Flags: needinfo?(jmercado)
Flags: needinfo?(jmercado)
You need to log in before you can comment on or make changes to this bug.