Closed
Bug 1476250
Opened 6 years ago
Closed 6 years ago
Crash in mozilla::ObserverList<T>::Broadcast
Categories
(Core :: Hardware Abstraction Layer (HAL), defect)
Tracking
()
RESOLVED
FIXED
mozilla64
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | unaffected |
firefox61 | --- | unaffected |
firefox62 | --- | wontfix |
firefox63 | --- | fixed |
firefox64 | --- | fixed |
People
(Reporter: calixte, Assigned: gsvelto)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
46 bytes,
text/x-phabricator-request
|
froydnj
:
review+
pascalc
:
approval-mozilla-beta+
|
Details | Review |
This bug was filed from the Socorro interface and is
report bp-8f0f1550-2cbb-4bf2-9c44-042630180715.
=============================================================
Top 10 frames of crashing thread:
0 libxul.so mozilla::ObserverList<mozilla::dom::MIDIPortList>::Broadcast xpcom/ds/Observer.h:76
1 libxul.so mozilla::GeckoScreenOrientation::OnOrientationChange widget/android/GeckoScreenOrientation.h:47
2 libxul.so wcsrtombs
3 libxul.so mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short>::operator widget/android/jni/Natives.h:418
4 libxul.so mozilla::detail::RunnableFunction<mozilla::jni::detail::ProxyNativeCall<mozilla::GeckoScreenOrientation, mozilla::java::GeckoScreenOrientation, true, false, short, short> >::Run xpcom/threads/nsThreadUtils.h:552
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1051
6 libc.so libc.so@0x195ab
7 libxul.so wcsrtombs
8 libxul.so void mozilla::detail::VariantImplementation<unsigned char, 0u, int const, char const*, void mfbt/Variant.h:219
9 libxul.so mozilla::CooperativeThreadPool::CooperativeThread::CooperativeThread xpcom/threads/CooperativeThreadPool.cpp:138
=============================================================
There are 11 crashes (from 11 installations) in 62.0b7 with buildid 20180709172241. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1469914.
Since :gsvelto is in pto, :froydnj, could you investigate please ?
[1] https://hg.mozilla.org/releases/mozilla-beta/rev?node=3da147dac512
Flags: needinfo?(nfroyd)
Comment 1•6 years ago
|
||
snorp might have some ideas here?
Flags: needinfo?(nfroyd) → needinfo?(snorp)
Comment 2•6 years ago
|
||
This is so low volume in 62 and 61 (though it was very high in 60 release) that I'm marking it fix-optional for 62.
I don't see an obvious problem, but doesn't look like many folks are hitting it.
Flags: needinfo?(snorp)
Assignee | ||
Comment 4•6 years ago
|
||
I'm back so I'm looking into this. There seems to be an use-after-free issue as many reports are segfaulting on an address that looks like '0xe5e5e5e5' which IIRC is our poison value for freed memory. I don't have my Android development device handy but I'll have it tomorrow so I'll see if I can reproduce.
Assignee | ||
Comment 5•6 years ago
|
||
If the stack traces on the bug are to be believed this is the place where we're crashing:
https://hg.mozilla.org/releases/mozilla-beta/annotate/88222b8b526d8a6a0f2e4ca099f425af17f7d10d/xpcom/ds/nsTArray.h#l371
That would indicate that the observer array has been reallocated and the contents moved while we were iterating over it. It's weird because it's exactly the kind of situation that nsTObserverArray should prevent.
Assignee | ||
Comment 6•6 years ago
|
||
I think I know what's going on. This code is rather quirky because of the way observers are unregistered, see bug 1473824 for more information. What I believe is happening is the following. We're calling ObserversManager::Broadcast() is being called for the ScreenConfigurationObserversManager and one of the callback that is invoked removes the last observer. This deletes the observer list in [1]. Once we come back from the callback we try to iterate over the array, but the array is not there anymore so we hit poisoned memory, or memory that has been assigned to another object. The solution here is to fix bug 1473824 and then make sure that the arrays are only deleted during shutdown and not as soon as they're empty.
[1] https://searchfox.org/mozilla-central/rev/196560b95f191b48ff7cba7c2ba9237bba6b5b6a/hal/Hal.cpp#205
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Assignee | ||
Comment 7•6 years ago
|
||
Note that the sensor observers have some "magic" precisely to avoid this issue, which the screen configuration observers haven't:
https://searchfox.org/mozilla-central/rev/196560b95f191b48ff7cba7c2ba9237bba6b5b6a/hal/Hal.cpp#445
Assignee | ||
Comment 8•6 years ago
|
||
This patch initializes some HAL components greedily so that we can get rid of
lazy initializers within the code. Observers are still lazily initialized
because they can be instanced within content processes but that doesn't always
happen and we don't want to pay the memory price for structures we don't use.
Shutdown is now happening at a fixed time for all HAL components save
WakeLocks. This ensures that we don't destroy an object while still iterating
over it, something that could happen before.
Finally a workaround for a compiler limitation has been removed.
Updated•6 years ago
|
Attachment #8999139 -
Attachment is obsolete: true
Updated•6 years ago
|
Attachment #8999139 -
Attachment is obsolete: false
Comment 9•6 years ago
|
||
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs
Nathan Froyd [:froydnj] has approved the revision.
Attachment #8999139 -
Flags: review+
Assignee | ||
Comment 10•6 years ago
|
||
As I expected once I added the assertions to ensure we don't register/unregister observers too late things started blowing up:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8e1d6b18b35ee4d0a65a1e4f64399c46fdefec4
I'll have to rework the patch.
Assignee | ||
Comment 11•6 years ago
|
||
As I investigated the problem above I found out that the issue is not always what I thought: the GPU process does not initialize XPCOM and thus never gets to initialize the HAL, but uses it. This is one of those typical cases where we used lazy initialization (and no explicit shutdown) because we didn't know where to stick it. While this is going to take a while more I intend to fix this properly, otherwise we're going to run into this kind of issues more and more.
Assignee | ||
Comment 12•6 years ago
|
||
This bug turned out a lot more complicated than I had anticipated. I'm filing two dependent bugs that will need to be fixed first before I can finish the cleanup here. Unfortunately every nsIAppShell implementation is handling the HAL in a slightly different way and this makes both startup and shutdown tricky.
Assignee | ||
Comment 13•6 years ago
|
||
I've got all startup and shutdown issues ironed out on Windows and Linux but I'm still encountering problems with Mac marionette tests and Android mochitests.
Assignee | ||
Comment 14•6 years ago
|
||
Nathan can you have a look at the updated patch? It contains assertions to ensure that we can't register observers before startup and after shutdown, that we don't init or shutdown twice, etc... Additionally the initialization and shutdown calls have been moved to the platform-specific nsIAppShell implementations. I had to do it this way because every platform use the HAL in a slightly different way and thus needed slightly different startup and shutdown points.
I'll also do some further cleanup because on some platforms weird stuff is happening: for example the Android nsIAppShell implementation is keeping (and using) an XPCOM object alive well past XPCOM shutdown, ugh.
Flags: needinfo?(nfroyd)
Comment 15•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #14)
> Nathan can you have a look at the updated patch? It contains assertions to
> ensure that we can't register observers before startup and after shutdown,
> that we don't init or shutdown twice, etc... Additionally the initialization
> and shutdown calls have been moved to the platform-specific nsIAppShell
> implementations. I had to do it this way because every platform use the HAL
> in a slightly different way and thus needed slightly different startup and
> shutdown points.
Where is the updated patch? The only patches I see are the r+'d one and the one or try that turned a bunch of tests orange.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 16•6 years ago
|
||
The one on phabricator, it still shows up as "Accepted" but it's been updated, if you click on the Lando link you'll see that it says that the current version hasn't been accepted yet, it's kind of confusing. This is the new diff with your review comments addressed and the new initialization/shutdown couples in the widget code:
https://phabricator.services.mozilla.com/D3100?id=12236
Comment 17•6 years ago
|
||
Sigh, phabricator.
OK, I think all of that looks reasonable, and it might even be OK to have things in the app shell anyway--processes not running an app shell for some reason don't need the HAL overhead.
Assignee | ||
Comment 18•6 years ago
|
||
Thanks! I'll land this right away and hopefully this (and another crasher) will go away.
Comment 19•6 years ago
|
||
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1616abd4c12f
Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs r=froydnj
Comment 20•6 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox64:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Updated•6 years ago
|
Comment 21•6 years ago
|
||
We weren't seeing these crashes on Nightly, so I think a Beta uplift approval is appropriate here to see if it helps.
Flags: needinfo?(gsvelto)
Assignee | ||
Comment 22•6 years ago
|
||
Yeah, but we'll need to pull in bug 1482753 too.
Flags: needinfo?(gsvelto)
Updated•6 years ago
|
Comment 23•6 years ago
|
||
That already landed for 63, no?
Assignee | ||
Comment 24•6 years ago
|
||
Yeah, I had missed that.
Assignee | ||
Comment 25•6 years ago
|
||
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs
Approval Request Comment
[Feature/Bug causing the regression]: The issue was always present in this code but was made more apparent by bug 1469914
[User impact if declined]: Fennec users might run into crashes when changing the orientation of their phones
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: The change is moderately risky
[Why is the change risky/not risky?]: The change affects a significant amount of code (~200 lines) so while the patch was thoroughly tested it still carries some risk to the breadth of the changes it contains.
[String changes made/needed]: None
Attachment #8999139 -
Flags: approval-mozilla-beta?
Comment 26•6 years ago
|
||
Comment on attachment 8999139 [details]
Bug 1476250 - Simplify HAL initialization and shutdown to reduce the chance of leaks and UAFs
Uplift approved for 63 beta 5
Attachment #8999139 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 27•6 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 28•6 years ago
|
||
There haven't been crash reports with this signature for the beta release which includes the fix. It's probably to early to be sure it's gone though.
Assignee | ||
Comment 29•6 years ago
|
||
Also note that this signature:
https://crash-stats.mozilla.com/signature/?signature=wcsrtombs%20%7C%20InvalidArrayIndex_CRASH%20%7C%20nsTArray_Impl%3CT%3E%3A%3AElementAt%20%7C%20mozilla%3A%3AObserverList%3CT%3E%3A%3ABroadcast
And this one:
https://crash-stats.mozilla.com/signature/?signature=InvalidArrayIndex_CRASH%20%7C%20nsTArray_Impl%3CT%3E%3A%3AElementAt%20%7C%20mozilla%3A%3AObserverList%3CT%3E%3A%3ARemoveObserver
Are also most likely manifestations of the issue that was fixed here so the aggregated crash rate was probably higher than what we saw in this bug.
You need to log in
before you can comment on or make changes to this bug.
Description
•