Closed Bug 1425930 Opened 3 years ago Closed 2 years ago

Crash in @0x0 | nsScreenAndroid::GetDensity

Categories

(GeckoView :: General, defect, P1)

58 Branch
Unspecified
Android
defect

Tracking

(firefox-esr6061+ fixed, firefox57 unaffected, firefox58 wontfix, firefox59 wontfix, firefox60 wontfix, firefox61+ fixed, firefox62+ fixed)

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr60 61+ fixed
firefox57 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 + fixed
firefox62 + fixed

People

(Reporter: philipp, Assigned: jesup)

References

Details

(4 keywords, Whiteboard: [adv-main61+][adv-esr60.1+][post-critsmash-triage])

Crash Data

Attachments

(2 files, 2 obsolete files)

This bug was filed from the Socorro interface and is
report bp-59230ead-9e9c-43d7-9664-b70fe0171206.
=============================================================

Top 10 frames of crashing thread:

0  @0x0 
1 libxul.so nsScreenAndroid::GetDensity widget/android/nsScreenManagerAndroid.cpp
2 libxul.so mozilla::GeckoScreenOrientation::OnOrientationChange widget/android/GeckoScreenOrientation.h:47
3 libxul.so mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short>::operator widget/android/jni/Natives.h:418
4 libxul.so mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::PutEventInternal xpcom/threads/ThreadEventQueue.cpp:116
5 libxul.so mozilla::detail::RunnableFunction<mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short> >::Run xpcom/threads/nsThreadUtils.h:529
6 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1037
7 libxul.so mozilla::detail::VariantImplementation<unsigned char, 0, const int, const char*, void > > mfbt/Variant.h:219
8 libxul.so nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::DestructRange xpcom/ds/nsTArray.h:562
9 libxul.so wcsrtombs 

=============================================================

crashes with this signature are newly appearing in fennec 58 - they are fairly low volume though.

the user comment to this crash report says: "occurred when the phone unlocked and rotated simultaneously"
the stack of this signature looks similar
Crash Signature: [@ @0x0 | nsScreenAndroid::GetDensity] → [@ @0x0 | nsScreenAndroid::GetDensity] [@ mozilla::ObserverList<T>::Broadcast ]
http://bit.ly/2BmOHxA shows a mix of devices, including some Samsung. ni on Ioana to see if rotating the device after unlocking reproduces the issue (assuming you have one of the devices listed in the crash reports)
Flags: needinfo?(ioana.chiorean)
Tried with 58.0b14 build 2 on the following devices: 
- Galaxy J7 ( saw J5 in the report but we don't have it)  Android 6.0.1 
- Galaxy note 3 ( saw Note 4 in the report but we don't have it) Android 5.0 
but I was not able to reproduce it at all. 

Steps: 
- rotate the device immediately after unlocking phone
Flags: needinfo?(ioana.chiorean)
I was able to reproduce this with Samsung Galaxy Note 4 (Android 5.0.1) on latest Nightly build (01/10) and beta 58.0b15.

Steps to reproduce:
1. Go to https://browserengine.net/category/firefox/ and wait for page to load;
2. Change device orientation. (1-2 times) 

Expected result:
Firefox opens the page without issues.

Actual result:
After changing device orientation the page becomes white and Firefox crashes. 

Notes:
- crash report Nightly: https://crash-stats.mozilla.com/report/index/2ac0dbb2-d710-43e8-a546-9d78b0180111
- crash report Beta: https://crash-stats.mozilla.com/report/index/90a808f0-44b7-41bd-af78-030d00180111
Assignee: nobody → snorp
Neither of the crashes in comment #4 are the crash signature for this bug, and I can't seem to reproduce it locally.

GeckoScreenOrientation::OnOrientationChange does not call nsScreenManagerAndroid::GetDensity(), so I think the stack may not quite be right. The stack also indicates that we're trying to deref 0x0, which is not possible in GetDensity() as it only dereferences this+mDensity and this+mDisplayType. Also, we guard against a null screen in OnOrientationChange().

Investigating some more...
Crash Signature: [@ @0x0 | nsScreenAndroid::GetDensity] [@ mozilla::ObserverList<T>::Broadcast ] → [@ @0x0 | nsScreenAndroid::GetDensity] [@ mozilla::ObserverList<T>::Broadcast ] [@ mozilla::GeckoScreenOrientation::OnOrientationChange ]
The crash signature I just added has more strange stacks. It sorta looks like hal::NotifyScreenConfigurationChange() goes off into the weeds somehow?
We need industrial strength debugging support in the form of jchen...
Flags: needinfo?(nchen)
I see a lot of possible use-after-free in the crash reports.
Group: firefox-core-security
I think mozilla::ObserverList<T> is just prone to use-after-free. If an observer is ever removed from the list during a broadcast, we could end up using the observer after it's been freed.
Flags: needinfo?(nchen)
Crash Signature: [@ @0x0 | nsScreenAndroid::GetDensity] [@ mozilla::ObserverList<T>::Broadcast ] [@ mozilla::GeckoScreenOrientation::OnOrientationChange ] → [@ @0x0 | nsScreenAndroid::GetDensity] [@ mozilla::ObserverList<T>::Broadcast ] [@ mozilla::GeckoScreenOrientation::OnOrientationChange ] [@ @0x0 | mozilla::GeckoScreenOrientation::OnOrientationChange]
I'll note the crash rate seems to be spiking... jchen?  thoughts?
Flags: needinfo?(nchen)
I haven't looked at this recently. Someone should verify if comment 9 is accurate or not.
Flags: needinfo?(nchen)
(In reply to Jim Chen [:jchen] [:darchons] from comment #9)
> I think mozilla::ObserverList<T> is just prone to use-after-free. If an
> observer is ever removed from the list during a broadcast, we could end up
> using the observer after it's been freed.

ObserverList<T> clones the list into another nsCOMArray before walking it to do notifies - i.e. removal from the list during NotifyObservers() is totally safe.
Flags: needinfo?(nchen)
I think you're talking about `nsObserverList`? `ObserverList<T>` is defined in xpcom/ds/Observer.h. The problem with `ObserverList<T>` is that it doesn't ensure validity of the observer pointers, so even though it makes a copy of the array during broadcast [1], the pointers themselves are not guaranteed to be valid throughout the broadcast.

[1] https://searchfox.org/mozilla-central/rev/00dd116638001772fe354b081353b73f1cad405d/xpcom/ds/Observer.h#70
Flags: needinfo?(nchen)
Ah.  Wonderful naming there..... :-(

and none of this is refcounted...  It's only used in midi and Hal.cpp.  

There are two options: block RemoveObserver() from within Notify(), and handle it safely.  (And it's not sure this is the cause of the bug, but it's probably good to do one of those.)
Note that blocking it will likely result in a bunch of assertion crashes, given the signature for Broadcast... very clear UAFs.  Though it would be nice to try to provoke it and see why we're looping through to RemoveObserver, and if it's a bug or expected/correct.
Note: patches aren't tested yet, but they compile (modulo I changed some comments since compiling)
Sigh, bug 678694 comment 52 and bug 678694 comment 53 said we were aware of this sort of problem, but we didn't do anything about it.

I'm assuming the crashes are coming because we're iterating through the copy, and somebody RemoveObserver()'s an Observer that's not been notified yet.  Then we get to the removed observer in our copy of the observers, and we crash?

I'm not super keen on attachment 8976406 [details] [diff] [review] blocking notifications, because we'll just keep crashing in this scenario, AIUI.  Unless we want to commit it as a way of figuring out what exactly is going on (and protect ourselves from future misuses of the API?) and then we can either decide to fix the uses of ObserverList and/or fix ObserverList itself to not UAF?  Or can we push that patch to try and attempt to reproduce the crash with a patched build so we can understand the problem?

I think attachment 8976407 [details] [diff] [review] (or something very much like it) is the right way to go (or is at least a reasonable attempt at fixing the bug), but I'm not quite sure what we're trying to do here...
Flags: needinfo?(rjesup)
This is currently the #10 topcrash on Fennec 61 Beta, FWIW.
Blocking (first patch) would be if we decide the API is "you aren't allowed to call RemoveObserver() inside Notify()", and would help us catch any mis-use (and be safe if one escapes to the wild).

Handling RemoveObserver in Notify (second patch) would make it allowed and safe.  Probably this is a better solution, overall.  It does seem likely that the bug is, in fact, RemoveObserver() called within Notify from looking at the reports, then we crash.

There is one other possibility here: this is not a thread-safe interface in any way.  If another thread removes an observer while we're broadcasting, this fails badly in all sorts of ways.  The Blocking patch might help diagnose how that's occurring, if it is.  We could also add thread-safety checks - however, all the Remove/AddObserver calls in Hal.cpp have an AssertMainThread(), so I don't think this is happening.  Also most (but not all) Broadcast calls are preceded by an AssertMainThread().

One comment that does concern me however is in NotifySwitchChange: // When callback this notification, main thread may call unregister function first. We should check if this pointer is valid.
Others that don't assert: NotifyScreenConfigurationChange(), NotifyNetworkChange().

We could assert mainthread on those and see what happens....

I haven't looked at the Midi code
Flags: needinfo?(rjesup)
This affects ESR60 (do we do ESR for Fennec??).  I suppose it might affect the Midi code, if that's enabled.  We should fix this in 61 beta.
Asserting on removals during iterating so we could get more information sounds attractive, but given that this isn't really happening on nightly or beta, I think we should just deal with removal during iteration.  If we want to add some thread assertions somewhere (here or in a separate bug) that's fine with me.
Comment on attachment 8976407 [details] [diff] [review]
Handle Broadcast()->Notify() calling RemoveObserver()

Review of attachment 8976407 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/platform-linux-android.cpp
@@ +306,5 @@
>    if (mSamplerTid == -1) {
>      mSamplerTid = gettid();
>    }
>    int sampleeTid = aRegisteredThread.Info()->ThreadId();
> +  MOZ_ASSERT(sampleeTid != mSamplerTid);

I don't understand this change; was this for a separate bug?

::: xpcom/ds/Observer.h
@@ +56,5 @@
>     * @return Whether the observer has been found in the list.
>     */
>    bool RemoveObserver(Observer<T>* aObserver)
>    {
> +    if (mObservers.RemoveElement(aObserver)) {

Do we care if somebody adds the same observer twice?  (Presumably they would remove it twice as well...?
Attachment #8976407 - Flags: review?(nfroyd) → review+
Comment on attachment 8976406 [details] [diff] [review]
Block Broadcast()->Notify() calling RemoveObserver()

See comment 23 for why we're not taking this path.
Attachment #8976406 - Flags: review?(nfroyd)
> ::: tools/profiler/core/platform-linux-android.cpp

> > +  MOZ_ASSERT(sampleeTid != mSamplerTid);
> 
> I don't understand this change; was this for a separate bug?

Oops, ignore, change got swept up in qnew.  I'll split that off to the other patch I was working on.

> > +    if (mObservers.RemoveElement(aObserver)) {
> 
> Do we care if somebody adds the same observer twice?  (Presumably they would
> remove it twice as well...?

I don't see a need to arbitrarily block it, though it's unlikely to be useful barring a few edge cases.  If they add it twice, they need to remove it twice, so all the logic holds.  It isn't blocked today.  In general with other lists of this nature we're dealing with reference-counted objects, so people generally assume it's safe to have the same item in the list twice (and remove it twice), and these are unordered lists.
Attachment #8976407 - Attachment is obsolete: true
Assignee: snorp → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8977030 [details] [diff] [review]
Handle Broadcast()->Notify() calling RemoveObserver()

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: UAFs

Fix Landed on Version: will be 62 and 61

Risk to taking this patch (and alternatives if risky): Low

String or UUID changes made by this patch: none


Approval Request Comment
[Feature/Bug causing the regression]: bug 678694 (Battery API)

[User impact if declined]: Android UAF; might be hard to exploit if it's on shutdown

[Is this code covered by automated tests?]: yes (never caught this)

[Has the fix been verified in Nightly?]: no

[Needs manual test from QE? If yes, steps to reproduce]: no str

[List of other uplifts needed for the feature/fix]: none

[Is the change risky?]: No

[Why is the change risky/not risky?]: Adds safety checks, simple patch.

[String changes made/needed]: none


[Security approval request comment]
How easily could an exploit be constructed based on the patch? Not clear; they'd have to figure out how to leverage a Remove-during-Notify.  If these are only on shutdown, tough.  If causable at other times, might be possible especially given lack of e10s.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  More or less they make it clear what can happen (especially the comment, but not hard to infer from the code why it's doing this).

Which older supported branches are affected by this flaw? All

If not all supported branches, which bug introduced the flaw? 

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?  Should apply directly

How likely is this patch to cause regressions; how much testing does it need?  Low risk.  Should get some normal field exposure before release, which will also help verify the fix.
Attachment #8977030 - Flags: sec-approval?
Attachment #8977030 - Flags: review+
Attachment #8977030 - Flags: approval-mozilla-esr60?
Attachment #8977030 - Flags: approval-mozilla-beta?
sec-approval+ and I'll give the other approvals.
Attachment #8977030 - Flags: sec-approval?
Attachment #8977030 - Flags: sec-approval+
Attachment #8977030 - Flags: approval-mozilla-esr60?
Attachment #8977030 - Flags: approval-mozilla-esr60+
Attachment #8977030 - Flags: approval-mozilla-beta?
Attachment #8977030 - Flags: approval-mozilla-beta+
Attachment #8976406 - Attachment is obsolete: true
Backed out for build bustages at ../testing/xpcshell/selftest.py::XPCShellTestsTests::testAddTaskSki

Push that caused the failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=9d111a95cfbbe2dea1427d22ab342a584d9c8af6

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=179510872&repo=mozilla-inbound&lineNumber=36667

Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/1f0b7e9a6450c28f816ca52d8809c41b6b87b3e5

The XPCShell test logs seem to indicate that we're removing the element from mObservers, and then we're trapping in an invalid array index access in mBroadcastCopy.  Maybe we're *also* adding observers during Broadcast? (!)

I think the straightforward fix is to check for NoIndex.  But I'd like to understand whether the HAL code is doing normal things or dodgy things before applying that fix, which is why apavel and I decided to back it out, rather than apply the bandaid.
Flags: needinfo?(rjesup)
Also a lot of Android test crashes, at the new MOZ_ASSERT and related places.
Obviously gets merged when relanding.  Passes xpcshell tests now locally
Attachment #8979310 - Flags: review?(nfroyd)
Comment on attachment 8979310 [details] [diff] [review]
add missing check that we're in a Broadcast

Review of attachment 8979310 [details] [diff] [review]:
-----------------------------------------------------------------

Doh.  Should have thought of that myself.
Attachment #8979310 - Flags: review?(nfroyd) → review+
Since we don't ship ESR android builds, and (if I'm correct) won't ship MIDI enabled in ESR 60, we actually wouldn't need to land this on ESR 60.  It's only used in Hal.h/cpp and dom/midi/* (for two lists).  This would potentially affect desktop MIDI, if it's enabled (depending on what the observers do).
https://hg.mozilla.org/mozilla-central/rev/a314710b0acd
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
Sorina, as you were able to reproduce it, can you test it for verification too?
Thanks!
Flags: needinfo?(sorina.florean)
Depends on: 1463494
Tested with Samsung Galaxy S6(Android 7.0), Samsung Galaxy Tab S3 (Android 7.0), Samsung Galaxy Note 8 (Android 8.0) and the issue is not reproducible on latest Nightly build. Followed the steps from comment 4 on 58 as well, but wasn't able to reproduce with the devices mentioned above. 

Tested also with Samsung Galaxy Note 4 (Android 5.0.1) on latest Nightly from Play Store (62.0a1 - 05/23) and after changing device orientation 2-3 times, Fennec stops and no crash report is sent. 
Here is a video: https://drive.google.com/open?id=1w0Rhj7Mc8U6Ld8mf8SPEEeDstf0MxgE-

Tested on a different build downloaded from archive.mozilla.org - http://archive.mozilla.org/pub/mobile/nightly/2018/05/2018-05-23-11-03-27-oak-android-api-16/ and after changing device orientation 7-8 times, Fennec crashes and this is the report: https://crash-stats.mozilla.com/report/index/48fed957-dcde-4d39-8558-636b00180524
Flags: needinfo?(sorina.florean)
Whiteboard: [adv-main61+][adv-esr60.1+]
Flags: qe-verify-
Whiteboard: [adv-main61+][adv-esr60.1+] → [adv-main61+][adv-esr60.1+][post-critsmash-triage]
Product: Firefox for Android → GeckoView
Version: Firefox 58 → 58 Branch
Target Milestone: Firefox 62 → mozilla62
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.