Closed Bug 1476250 Opened Last year Closed Last year

Crash in mozilla::ObserverList<T>::Broadcast

Categories

(Core :: Hardware Abstraction Layer (HAL), defect, critical)

62 Branch
Unspecified
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 --- wontfix
firefox63 --- fixed
firefox64 --- fixed

People

(Reporter: calixte, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is
report bp-8f0f1550-2cbb-4bf2-9c44-042630180715.
=============================================================

Top 10 frames of crashing thread:

0 libxul.so mozilla::ObserverList<mozilla::dom::MIDIPortList>::Broadcast xpcom/ds/Observer.h:76
1 libxul.so mozilla::GeckoScreenOrientation::OnOrientationChange widget/android/GeckoScreenOrientation.h:47
2 libxul.so wcsrtombs 
3 libxul.so mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short>::operator widget/android/jni/Natives.h:418
4 libxul.so mozilla::detail::RunnableFunction<mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short> >::Run xpcom/threads/nsThreadUtils.h:552
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1051
6 libc.so libc.so@0x195ab 
7 libxul.so wcsrtombs 
8 libxul.so void mozilla::detail::VariantImplementation<unsigned char, 0u, int const, char const*, void  mfbt/Variant.h:219
9 libxul.so mozilla::CooperativeThreadPool::CooperativeThread::CooperativeThread xpcom/threads/CooperativeThreadPool.cpp:138

=============================================================

There are 11 crashes (from 11 installations) in 62.0b7 with buildid 20180709172241. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1469914.
Since :gsvelto is in pto, :froydnj, could you investigate please ?

[1] https://hg.mozilla.org/releases/mozilla-beta/rev?node=3da147dac512
Flags: needinfo?(nfroyd)
snorp might have some ideas here?
Flags: needinfo?(nfroyd) → needinfo?(snorp)
This is so low volume in 62 and 61 (though it was very high in 60 release) that I'm marking it fix-optional for 62.
I don't see an obvious problem, but doesn't look like many folks are hitting it.
Flags: needinfo?(snorp)
I'm back so I'm looking into this. There seems to be an use-after-free issue as many reports are segfaulting on an address that looks like '0xe5e5e5e5' which IIRC is our poison value for freed memory. I don't have my Android development device handy but I'll have it tomorrow so I'll see if I can reproduce.
If the stack traces on the bug are to be believed this is the place where we're crashing:

https://hg.mozilla.org/releases/mozilla-beta/annotate/88222b8b526d8a6a0f2e4ca099f425af17f7d10d/xpcom/ds/nsTArray.h#l371

That would indicate that the observer array has been reallocated and the contents moved while we were iterating over it. It's weird because it's exactly the kind of situation that nsTObserverArray should prevent.
I think I know what's going on. This code is rather quirky because of the way observers are unregistered, see bug 1473824 for more information. What I believe is happening is the following. We're calling ObserversManager::Broadcast() is being called for the ScreenConfigurationObserversManager and one of the callback that is invoked removes the last observer. This deletes the observer list in [1]. Once we come back from the callback we try to iterate over the array, but the array is not there anymore so we hit poisoned memory, or memory that has been assigned to another object. The solution here is to fix bug 1473824 and then make sure that the arrays are only deleted during shutdown and not as soon as they're empty.

[1] https://searchfox.org/mozilla-central/rev/196560b95f191b48ff7cba7c2ba9237bba6b5b6a/hal/Hal.cpp#205
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Note that the sensor observers have some "magic" precisely to avoid this issue, which the screen configuration observers haven't:

https://searchfox.org/mozilla-central/rev/196560b95f191b48ff7cba7c2ba9237bba6b5b6a/hal/Hal.cpp#445
This patch initializes some HAL components greedily so that we can get rid of
lazy initializers within the code. Observers are still lazily initialized
because they can be instanced within content processes but that doesn't always
happen and we don't want to pay the memory price for structures we don't use.

Shutdown is now happening at a fixed time for all HAL components save
WakeLocks. This ensures that we don't destroy an object while still iterating
over it, something that could happen before.

Finally a workaround for a compiler limitation has been removed.
Depends on: 1482753
Attachment #8999139 - Attachment is obsolete: true
Attachment #8999139 - Attachment is obsolete: false
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs

Nathan Froyd [:froydnj] has approved the revision.
Attachment #8999139 - Flags: review+
As I expected once I added the assertions to ensure we don't register/unregister observers too late things started blowing up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8e1d6b18b35ee4d0a65a1e4f64399c46fdefec4

I'll have to rework the patch.
As I investigated the problem above I found out that the issue is not always what I thought: the GPU process does not initialize XPCOM and thus never gets to initialize the HAL, but uses it. This is one of those typical cases where we used lazy initialization (and no explicit shutdown) because we didn't know where to stick it. While this is going to take a while more I intend to fix this properly, otherwise we're going to run into this kind of issues more and more.
This bug turned out a lot more complicated than I had anticipated. I'm filing two dependent bugs that will need to be fixed first before I can finish the cleanup here. Unfortunately every nsIAppShell implementation is handling the HAL in a slightly different way and this makes both startup and shutdown tricky.
Depends on: 1486772
I've got all startup and shutdown issues ironed out on Windows and Linux but I'm still encountering problems with Mac marionette tests and Android mochitests.
Nathan can you have a look at the updated patch? It contains assertions to ensure that we can't register observers before startup and after shutdown, that we don't init or shutdown twice, etc... Additionally the initialization and shutdown calls have been moved to the platform-specific nsIAppShell implementations. I had to do it this way because every platform use the HAL in a slightly different way and thus needed slightly different startup and shutdown points.

I'll also do some further cleanup because on some platforms weird stuff is happening: for example the Android nsIAppShell implementation is keeping (and using) an XPCOM object alive well past XPCOM shutdown, ugh.
Flags: needinfo?(nfroyd)
(In reply to Gabriele Svelto [:gsvelto] from comment #14)
> Nathan can you have a look at the updated patch? It contains assertions to
> ensure that we can't register observers before startup and after shutdown,
> that we don't init or shutdown twice, etc... Additionally the initialization
> and shutdown calls have been moved to the platform-specific nsIAppShell
> implementations. I had to do it this way because every platform use the HAL
> in a slightly different way and thus needed slightly different startup and
> shutdown points.

Where is the updated patch?  The only patches I see are the r+'d one and the one or try that turned a bunch of tests orange.
Flags: needinfo?(nfroyd)
The one on phabricator, it still shows up as "Accepted" but it's been updated, if you click on the Lando link you'll see that it says that the current version hasn't been accepted yet, it's kind of confusing. This is the new diff with your review comments addressed and the new initialization/shutdown couples in the widget code:

https://phabricator.services.mozilla.com/D3100?id=12236
Sigh, phabricator.

OK, I think all of that looks reasonable, and it might even be OK to have things in the app shell anyway--processes not running an app shell for some reason don't need the HAL overhead.
Thanks! I'll land this right away and hopefully this (and another crasher) will go away.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1616abd4c12f
Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs r=froydnj
https://hg.mozilla.org/mozilla-central/rev/1616abd4c12f
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
We weren't seeing these crashes on Nightly, so I think a Beta uplift approval is appropriate here to see if it helps.
Flags: needinfo?(gsvelto)
Yeah, but we'll need to pull in bug 1482753 too.
Flags: needinfo?(gsvelto)
That already landed for 63, no?
Yeah, I had missed that.
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs

Approval Request Comment
[Feature/Bug causing the regression]: The issue was always present in this code but was made more apparent by bug 1469914
[User impact if declined]: Fennec users might run into crashes when changing the orientation of their phones
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: The change is moderately risky
[Why is the change risky/not risky?]: The change affects a significant amount of code (~200 lines) so while the patch was thoroughly tested it still carries some risk to the breadth of the changes it contains.
[String changes made/needed]: None
Attachment #8999139 - Flags: approval-mozilla-beta?
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs

Uplift approved for 63 beta 5
Attachment #8999139 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
There haven't been crash reports with this signature for the beta release which includes the fix. It's probably to early to be sure it's gone though.
You need to log in before you can comment on or make changes to this bug.