Assertion failure: NS_IsMainThread() and crash [@mozilla::image::Decoder::AllocateFrame]

RESOLVED FIXED in Firefox 28, Firefox OS v1.2

Status

()

Core
ImageLib
--
critical
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: posidron, Assigned: seth)

Tracking

(6 keywords)

Trunk
mozilla29
x86_64
Mac OS X
assertion, crash, regression, sec-critical, testcase, verifyme
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox27 wontfix, firefox28+ verified, firefox29 fixed, firefox-esr2428+ fixed, b2g18 unaffected, b2g-v1.1hd unaffected, b2g-v1.2 fixed, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 unaffected)

Details

(Whiteboard: [fuzzblocker][adv-main28+][adv-esr24.4+], crash signature)

Attachments

(5 attachments)

(Reporter)

Description

5 years ago
Created attachment 778962 [details]
callstack.txt

Load the attached testcase and click the "Ok" button for every two popup boxes, then wait a few seconds.

Assertion failure: NS_IsMainThread(), at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/image/src/Decoder.cpp:197

Tested with http://hg.mozilla.org/integration/mozilla-inbound/rev/9fd04fd70fc2
(Reporter)

Comment 1

5 years ago
Created attachment 778963 [details]
testcase.html
(Reporter)

Comment 2

5 years ago
Created attachment 778964 [details]
media.zip

Updated

5 years ago
Crash Signature: [@ mozilla::image::Decoder::AllocateFrame()]
Blocks: 896470

Comment 3

5 years ago
If this only happens in ASan+debug builds, it might be related to bug 895845.
(Reporter)

Comment 4

5 years ago
Created attachment 784928 [details]
testcase-simplified

This is a very simplified testcase, the stack is the same.

During the fuzzing run I see this crash as:

###!!! ABORT: Calling InitDecoder() but already decoded!: '!mDecoded', file /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/image/src/RasterImage.cpp, line 1963


Tested against http://hg.mozilla.org/integration/mozilla-inbound/rev/f85f06761d52

Updated

5 years ago
Whiteboard: [fuzzblocker]
(Reporter)

Comment 5

5 years ago
This also blocks WebGL fuzzing with Textures. Any chance we can get this bug some care?
(Reporter)

Updated

5 years ago
Flags: needinfo?(milan)
Seth, how're you doing for time?
Flags: needinfo?(seth)
(Assignee)

Comment 7

5 years ago
I can take this.
Assignee: nobody → seth
Flags: needinfo?(seth)
Flags: needinfo?(milan)
(Assignee)

Comment 8

5 years ago
(In reply to Christoph Diehl [:cdiehl] from comment #5)
> This also blocks WebGL fuzzing with Textures. Any chance we can get this bug
> some care?

I've had the testcase running for quite some time without generating a crash.

This is something I should be able to fix without actually reproducing it, just by doing a conservative check in Decoder::Write that we're on the main thread before attempting to allocate frames.

However, I'd like to be able to reproduce this so I can get a better sense of what the sequence of events was leading up to this point. I suspect this is a case where we're calling SetSynchronous(true) on the decoder and either failing to call SetSynchronous(false) or wrongly doing this outside of the decoding lock.

Right now my suspicions fall right here:

https://mxr.mozilla.org/mozilla-central/source/image/src/RasterImage.cpp#2409

This looks obviously unsafe to me but it's hard to be sure this is actually what's causing _this_ issue. I'll write a patch for this regardless.
(Assignee)

Comment 9

5 years ago
Heh, sorry, I never got around to my question. Christoph, can you reproduce this outside of the fuzzer or ASAN, just by viewing the testcase on a debug build?
Flags: needinfo?(cdiehl)
(Reporter)

Comment 10

5 years ago
(In reply to Seth Fowler [:seth] from comment #8)
> I've had the testcase running for quite some time without generating a crash.

Try hitting cmd+r a couple of times.

(In reply to Seth Fowler [:seth] from comment #9)
> Heh, sorry, I never got around to my question. Christoph, can you reproduce
> this outside of the fuzzer or ASAN, just by viewing the testcase on a debug
> build?

Outside of the fuzzer - yes, outside of ASan - not sure haven't tested that.
Flags: needinfo?(cdiehl)
(Assignee)

Comment 11

5 years ago
OK, I tried reloading a good deal and still didn't have luck. I'll try this on an ASan build tomorrow; I don't have access to a linux machine today.
(Assignee)

Updated

5 years ago
Depends on: 940714
(Assignee)

Comment 12

5 years ago
Bug 940714, which just hit m-c, took care of one possible cause. I still haven't gotten in front of a linux machine to test an ASan build, but I'll try to do so soon.
(Assignee)

Comment 13

5 years ago
I have now diagnosed the problem. It boils down to one line, RasterImage.cpp:3032:

>  MutexAutoUnlock(mDecodingMutex);

When we unlock the mutex to send out notifications, all of our invariants are lost. Since we have semi-global variables in imagelib (for example, boolean determining whether a decoder is synchronous), things can go haywire when this happens.

I've pushed a try job with a simple fix for this problem, with the intention of generating an ASan build to verify a fix. The better solution is to eliminate 'mSynchronous' in the Decoder and simply pass it in. I'm going to try that now, but the structure of the code may make that difficult.

We are probably going to keep having problems until we get rid of this call to MutexAutoUnlock, though. The right thing is probably just to send those notifications asynchronously and take the perf hit. =(

 https://tbpl.mozilla.org/?tree=Try&rev=31a44ff30d09
(Assignee)

Comment 14

5 years ago
This is a test of the patch to remove 'mSynchronous' from imagelib's Decoder class altogether. Synchronous decoding is instead determined statelessly by function arguments. Inter-thread interference thus can't happen.

This is one I'd be more interested in landing if the tests look green.

https://tbpl.mozilla.org/?tree=Try&rev=eeb99256c0c1
(Assignee)

Comment 15

5 years ago
Nice! I'm pleased that removing mSynchronous turned out so well. I'm going to use try to generate an ASan build based upon the above patch and verify the fix.

https://tbpl.mozilla.org/?tree=Try&rev=a5e2d5ea01cf
(Assignee)

Comment 16

5 years ago
I can no longer reproduce the problem with the build above. Will post a review-ready version of the patch shortly.
(Assignee)

Comment 17

5 years ago
Created attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

This patches removes the 'mSynchronous' variable from the Decoder class entirely, and all of the supporting machinery. Instead, callers to Decode::Write pass in a DecodeStrategy, which is either DECODE_SYNC or DECODE_ASYNC. This provides the same functionality, but it prevents different threads from interfering with each other.

This fixes the specific issue discussed in this bug, and IMO it makes the code cleaner. Work in other bugs will fix the more general issue described in comment #13.
Attachment #8346961 - Flags: review?(josh)
(Assignee)

Comment 18

5 years ago
Now that this bug is secure, I'll state more clearly that bug 943804 is where the general issue in comment #13 will get fixed. This bug should've been hidden earlier, but I didn't immediately realize the connection between it and some of the other issues in imagelib lately.
(Assignee)

Comment 19

5 years ago
Sorry, bug 943803.
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

Nice.
Attachment #8346961 - Flags: review?(josh) → review+
https://hg.mozilla.org/mozilla-central/rev/772ce6463fd8
Status: NEW → RESOLVED
Last Resolved: 5 years ago
status-firefox28: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Was this trunk only? Otherwise, it should have got a security rating before going in and we may need to take it on branches.
Flags: needinfo?(seth)
status-b2g18: --- → ?
status-b2g-v1.2: --- → ?
status-b2g-v1.3: --- → fixed
status-b2g-v1.3T: --- → fixed
status-b2g-v1.4: --- → unaffected
status-firefox-esr24: --- → ?
(Assignee)

Comment 24

4 years ago
(In reply to Al Billings [:abillings] from comment #23)
> Was this trunk only? Otherwise, it should have got a security rating before
> going in and we may need to take it on branches.

Apologies for the slow response! Finally getting my backlog cleared out.

Given that it was reported 7 months ago, I'd say it's not trunk-only. =) This is actually fallout from bug 716140, meaning AFAICT it goes all the way back to FF *22*.

Presumably that means the patch should go at least to ESR 24.
Flags: needinfo?(seth)
It can't get approved for ESR24 without a rating though (and probably not branches). 

It doesn't seem that the crash is exploitable, when it can be triggered. What are the security implications here?
(Never mind about branches, I didn't see when we'd fixed it.)
status-b2g18: ? → affected
status-b2g-v1.1hd: --- → affected
status-b2g-v1.2: ? → affected
status-firefox27: --- → wontfix
status-firefox-esr24: ? → affected
What's the sec rating here? Also, can we get an esr24 nom please? :)
status-b2g18: affected → unaffected
status-b2g-v1.1hd: affected → unaffected
status-firefox29: --- → fixed
Flags: needinfo?(seth)
Looks like this isn't going to be going to ESR24.
Whiteboard: [fuzzblocker] → [fuzzblocker][adv-main28+]
(Assignee)

Comment 29

4 years ago
(In reply to Al Billings [:abillings] from comment #25)
> It can't get approved for ESR24 without a rating though (and probably not
> branches). 
> 
> It doesn't seem that the crash is exploitable, when it can be triggered.
> What are the security implications here?

I don't *think* this is exploitable. Bug 943803 fixed the more fundamental issue.
Flags: needinfo?(seth)
(Assignee)

Comment 30

4 years ago
I guess what I just posted does not constitute a sec rating, though. Whenever I'm asked to assign a sec rating, I feel like I'm basically choosing arbitrarily. I don't really understand the nuances that exist between the various ratings other than sec-critical (and this isn't sec-critical, AFAICT).
(Assignee)

Comment 32

4 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #31)
> Does this help?
> https://wiki.mozilla.org/Security_Severity_Ratings

That does help, but a bit of frustration remains for me because if I am not sure of the exact ramifications of a bug (as is frequently the case, because once you lose your invariants it's very difficult to reason about what else can happen) then the only reasonable rating I can give is sec-critical. By these definitions, very few security bugs I've dealt with could be rated any other way.

So I suppose sec-critical is my rating for this bug as well. In the future I will treat that as my default rating unless I am *extremely* convinced that a lower rating is warranted.
That's a valid conservative rating strategy. It is usually easier to fix the bug than "prove" it is or isn't exploitable in practice, and unless you're convinced a bug is not exploitable it's better to fix it than have someone else prove us wrong.

> Whenever I'm asked to assign a sec rating, I feel like I'm basically choosing arbitrarily.

We all feel that way. Our use of the word "triage" with its war-time connotations is not accidental. Take your best guess and we hope we're right more often than not. Don't spend too much time thinking about the rating because your time is more valuable spent on fixing.
If this is sec-crit, we're going to want it on esr24. Please nominate it as such :). Up to the release drivers if it's landed for this cycle or the next at this point.
It is too late for this cycle.
BTW, this never landed on 28.
status-b2g-v1.3: fixed → affected
status-b2g-v1.3T: fixed → affected
status-firefox28: fixed → wontfix
Al, if we're going to be landing Pwn2Own fixes for 28/24.4 still, should we take this while we're at it since we're treating it as a sec-crit?
Flags: needinfo?(abillings)
This applies with a little bit of care to 28, but <28 will need a more targeted patch.
Flags: needinfo?(seth)
Keywords: branch-patch-needed
The beta Try run was green.
Yes, please get this in, especially since it was supposed to go in. At least for 28.
Flags: needinfo?(abillings)
If it is critical, I would be happy to have it for ESR 24.
It missed the boat for this release since it is in a few days.
Keywords: sec-critical
Let's get this sec-critical fix landed on mozilla-release and esr24 branch so that it can ride along with the pwn2own landings happening today for the 28 rc respin.
status-firefox28: wontfix → affected
tracking-firefox28: --- → +
tracking-firefox-esr24: --- → 28+
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 716140
User impact if declined: sec-crit
Testing completed (on m-c, etc.): on trunk for 3 months
Risk to taking this patch (and alternatives if risky): passes on try
String or IDL/UUID changes made by this patch: none
Attachment #8346961 - Flags: approval-mozilla-beta?
This applies much better to <28 with bug 940714 also applied. I've got Try runs going for that right now.
Blocks: 716140
Keywords: regression
(Assignee)

Comment 46

4 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #45)
> This applies much better to <28 with bug 940714 also applied. I've got Try
> runs going for that right now.

I'll request uplift for that too. It's kinda a strange situation because bug 940714 is pretty much completely obliterated by the changes in this bug, so we're mainly uplifting it to make this patch easier to apply, but I think it makes sense to do that.
Flags: needinfo?(seth)
(Assignee)

Comment 47

4 years ago
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: It's sec-crit.
User impact if declined: Crashes which may be exploitable.
Fix Landed on Version: 29
Risk to taking this patch (and alternatives if risky): Very low. On m-c for 3 months with no issues. Affects a section of the code which hasn't churned much since the ESR release, so surprising interactions are unlikely.
String or UUID changes made by this patch: None.
Attachment #8346961 - Flags: approval-mozilla-release?
Attachment #8346961 - Flags: approval-mozilla-esr24?
Comment on attachment 8346961 [details] [diff] [review]
Use a stateless approach to synchronous image decoding.

Approving per lsblakk comments.
Attachment #8346961 - Flags: approval-mozilla-release?
Attachment #8346961 - Flags: approval-mozilla-release+
Attachment #8346961 - Flags: approval-mozilla-esr24?
Attachment #8346961 - Flags: approval-mozilla-esr24+
Attachment #8346961 - Flags: approval-mozilla-beta?
Attachment #8346961 - Flags: approval-mozilla-beta+
Keywords: verifyme

Comment 50

4 years ago
I tried to reproduce this issue on an ASAN build from a couple of days ago but I don't get any assertion failures or crashes. I don't get any "ok" buttons either, pop-up dialogs just keep appearing and disappearing very fast, but I don't see any buttons on them (the text is something like this: "Generating a private key. Please wait...").

I am testing on Ubuntu 12.10 64bit. Is this Mac OS X specific? (if it is, please direct me to some builds to reproduce/verify this fix with).
status-b2g-v1.3T: affected → fixed

Updated

4 years ago
Flags: needinfo?(seth)

Comment 51

4 years ago
(In reply to Ioana Budnar, QA [:ioana] from comment #50)
> I tried to reproduce this issue on an ASAN build from a couple of days ago
This is actually "a couple of months ago". To be more specific, I tested with the Firefox 29.0a1 ASAN build from 12/10.

I tested this again today with an RC ASAN from 03/15. The difference compared to the pre-fix build when using the attached testcase is that here I get a dialog for setting a password for the Software Security Device. The pop-up dialog that asks for the password is then re-displayed after each "ok" or "cancel". The simplified testcase freezes the browser. It doesn't get to any crashes or other issues, it just stays frozen until the user closes it (it can be closed from the [x] button, you don't need to kill the process).
Matt, can you please help in verifying this is fixed ahead of release tomorrow?
Flags: needinfo?(mwobensmith)
It takes a minute or two of running with these test cases to see a crash. I wasn't able to do it with a non-ASan build - these are local ASan builds.

Confirmed crash in 24.3.0esrpre, 2014-03-12.
Verified fixed in Fx28, 2014-03-17.

Still testing other branches.
status-firefox28: fixed → verified
Flags: needinfo?(mwobensmith)
Whiteboard: [fuzzblocker][adv-main28+] → [fuzzblocker][adv-main28+][adv-esr24.4+]
(Assignee)

Updated

4 years ago
Flags: needinfo?(seth)

Updated

4 years ago
QA Contact: mwobensmith
Group: core-security
This issue appears to be an issue Qanalyst is unable to verify. Marking QAExclude in QA Whiteboard.
QA Whiteboard: QAExclude
Flags: needinfo?(jmercado)
Flags: needinfo?(jmercado)
You need to log in before you can comment on or make changes to this bug.