Crash in [@ IPCError-browser | ShutDownKill] in mozilla::mscom::Interceptor::~Interceptor()

VERIFIED FIXED in Firefox -esr60

Status

()

defect
--
critical
VERIFIED FIXED
Last year
11 months ago

People

(Reporter: philipp, Assigned: Jamie)

Tracking

({crash, regression})

61 Branch
mozilla63
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr6062+ verified, firefox61+ verified, firefox62+ verified, firefox63+ verified)

Details

(crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is
report bp-46a52cd3-77c9-4787-9791-3e4750180629.
=============================================================

Top 10 frames of crashing thread:

0 ntdll.dll ZwWaitForKeyedEvent 
1 ntdll.dll RtlAcquireSRWLockExclusive 
2 xul.dll mozilla::mscom::Interceptor::~Interceptor ipc/mscom/Interceptor.cpp:289
3 xul.dll mozilla::mscom::Interceptor::`scalar deleting destructor' 
4 xul.dll mozilla::mscom::WeakReferenceSupport::Release ipc/mscom/WeakRef.cpp:175
5 xul.dll mozilla::mscom::Interceptor::Create ipc/mscom/Interceptor.cpp:265
6 xul.dll mozilla::mscom::CreateInterceptor<IAccessible> ipc/mscom/Interceptor.h:186
7 xul.dll mozilla::mscom::MainThreadHandoff::WrapInterface<IAccessible> ipc/mscom/MainThreadHandoff.h:50
8 xul.dll mozilla::a11y::CreateHolderFromAccessible accessible/ipc/win/COMPtrTypes.cpp:52
9 xul.dll mozilla::a11y::DocAccessible::DoInitialUpdate accessible/generic/DocAccessible.cpp:1476

=============================================================

the changes from bug 1364624 seem to have caused an increase in the [@ IPCError-browser | ShutDownKill] content shutdown signature with stacks similar like the one above. on the beta channel this started in 61.0b6 when the patch got uplifted, on 61 release it's currently accounting for 2.5% of content crash reports.

since the [@ IPCError-browser | ShutDownKill] signature is a bit hard to track, here are queries for this particular issue:
*crash reports: https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~mozilla%3A%3Amscom%3A%3AInterceptor%3A%3A~Interceptor&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2018-05-01#reports
*graph: https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~mozilla%3A%3Amscom%3A%3AInterceptor%3A%3A~Interceptor&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2018-05-01#graphs
We might need help with these graphs from a sighted peer. Jamie, Aaron, I have not seen this one myself at all. Would be interesting to see which clients are instanciating accessibility here.
Flags: needinfo?(jteh)
Flags: needinfo?(aklotz)
the "Accessibility in proc client" field in these crash reports has the following make-up:
0x400 	690 	83.33 %
0x800 	134 	16.18 %
0xA00 	2 	0.24 %
0x200 	1 	0.12 %
0x801 	1 	0.12 %
Bug 1364624 switched over to using SRWLock on Windows which does not allow re-entrancy.

It looks like we grab a lock in `Interceptor::Create` [1], and continue to hold it when `RefPtr<Interceptor> intcp` [2] is destructed when the function exits. `Interceptor::~Interceptor()` then tries to acquire the lock [3] leading to a deadlock.

[1] https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/ipc/mscom/Interceptor.cpp#245
[2] https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/ipc/mscom/Interceptor.cpp#264
[3] https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/ipc/mscom/Interceptor.cpp#289
I don't suppose it'd be overly optimistic of me to think that this is possibly related to bug 1471824?
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #3)
> Bug 1364624 switched over to using SRWLock on Windows which does not allow
> re-entrancy.
> 
> [2] is destructed when the function exits.
> 

That isn't actually the case. GetInitialInterceptorForIID has QueryInterface semantics, so the outparam is a new reference to that interceptor.
Though I suppose that is correct in the event of failure, which might be why the count is relatively low...
(In reply to Aaron Klotz [:aklotz] from comment #6)
> Though I suppose that is correct in the event of failure, which might be why
> the count is relatively low...

Eek, yeah, that makes sense.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #4)
> I don't suppose it'd be overly optimistic of me to think that this is
> possibly related to bug 1471824?

Bug 1471824 comment 8 suggests that the affected users hit the diagnostic crashes removed in bug 1455745 when they run 60 nightly builds. So, it's reasonable to think they might hit failures when creating interceptors, thus exposing this deadlock. Thus, I'd say there's a pretty good chance these bugs are related.
Flags: needinfo?(jteh)
Assignee: nobody → jteh
Component: Disability Access APIs → IPC: MSCOM
One thing worth noting with this patch is that there are theoretically cases where PublishTarget succeeds (and releases the lock), but a subsequent call (e.g. QueryInterface or GetInterceptorForIID) fails. In that case, the ExecuteWhen cleanup code will call Unlock again, thus triggering MOZ_ASSERT(mLiveSet) in LiveSetAutoLock::Unlock. I figure ENSURE_HR_SUCCEEDED would already have asserted in that case, so we're already "broken". However, if this is a concern, I guess I could add a flag which prevents the cleanup if PublishTarget succeeded.
Oh. Also, I tested this locally and everything seems to work as expected, but I can't really confirm whether it fixes the issue, since I can't reproduce the issue in the first place.
(In reply to James Teh [:Jamie] from comment #10)
> Oh. Also, I tested this locally and everything seems to work as expected,
> but I can't really confirm whether it fixes the issue, since I can't
> reproduce the issue in the first place.

Hi James, could you help provide a build version? If it related to bug 1471824 I think I can find the user to help test.
I guess a simpler fix might be to set a flag on the instance telling the destructor that the target was never published and thus not to acquire the lock, etc., but that feels like tackling the problem from the wrong point, since we really should have released the lock in the first place.
(In reply to yxu from comment #11)
> Hi James, could you help provide a build version?

I've just triggered a try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=46064a000fba9488d1a7edbaad03564b3a6ba0a3&selectedJob=185895727
It should be done in an hour or so. Note that users will need to install the build (target.exe), not just extract it. Otherwise, an accessibility component won't get registered and it won't be an equivalent test. Just to be sure, they should also verify that "Accessibility handler used" reports "true" in about:support. Thanks very much for your help (and to the relevant users for their help) in tracking this down.
Once this lands, we'll want QA to spend some time looking for regressions also.
Flags: qe-verify+
I used the repackage-signing-win64/opt target.installer.exe provided by James in Comment 13 to install the user who had the problem before. Comfirmed that "Accessibility handler used" reports "true" in about:support after installation. They didn't found problem during use.
That's great news, thanks for your help with this, yxu! Once this gets r+, we can proceed with getting this uplifted for the 61.0.1 dot release and start the builds.
Comment on attachment 8989103 [details]
Bug 1472137: Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails.

https://reviewboard.mozilla.org/r/254166/#review261350

::: ipc/mscom/Utils.h:74
(Diff revision 1)
> + * whenever a call returns a failure HRESULT.
> + * Both the condition and cleanup code are provided as functions (usually
> + * lambdas).
> + */
> +template <typename CondFnT, typename ExeFnT>
> +class MOZ_RAII ExecuteWhen final

We need to include mozilla/Attributes.h
Attachment #8989103 - Flags: review?(aklotz) → review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3100524d3bcc
Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails. r=aklotz
In the interests of keeping this bug moving for the dot release, I applied Aaron's review comment and pushed this to inbound. Jamie, can you please nominate this for Beta and Release approval at your convenience? Thanks!
Flags: needinfo?(aklotz) → needinfo?(jteh)
https://hg.mozilla.org/mozilla-central/rev/3100524d3bcc
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Comment on attachment 8989103 [details]
Bug 1472137: Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1364624.
[User impact if declined]: Some users using software which enables Firefox accessibility (even if it isn't traditional "accessibility" software) will be unable to load pages and/or will experience crashes/hangs.
[Is this code covered by automated tests?]: No, because we have no mechanism for testing platform accessibility.
[Has the fix been verified in Nightly?]: No, but someone experiencing the issue did verify it in a try build.
[Needs manual test from QE? If yes, steps to reproduce]: Ideally needs to be tested by Chinese users experiencing this in bug 1471824. See bug 1471824 comment 0.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: "Correctness" patch which ensures a lock is released under certain conditions such that it isn't reentered.
[String changes made/needed]: None.
Flags: needinfo?(jteh)
Attachment #8989103 - Flags: approval-mozilla-beta?
Comment on attachment 8989103 [details]
Bug 1472137: Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1364624.
[User impact if declined]: Some users using software which enables Firefox accessibility (even if it isn't traditional "accessibility" software) will be unable to load pages and/or will experience crashes/hangs.
[Is this code covered by automated tests?]: No, because we have no mechanism for testing platform accessibility.
[Has the fix been verified in Nightly?]: No, but someone experiencing the issue did verify it in a try build.
[Needs manual test from QE? If yes, steps to reproduce]: Ideally needs to be tested by Chinese users experiencing this in bug 1471824. See bug 1471824 comment 0.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: "Correctness" patch which ensures a lock is released under certain conditions such that it isn't reentered.
[String changes made/needed]: None.
Attachment #8989103 - Flags: approval-mozilla-release?
Comment on attachment 8989103 [details]
Bug 1472137: Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails.

Approved for 62.0b6 and 60.0.1. Thanks everyone!
Attachment #8989103 - Flags: approval-mozilla-release?
Attachment #8989103 - Flags: approval-mozilla-release+
Attachment #8989103 - Flags: approval-mozilla-esr60?
Attachment #8989103 - Flags: approval-mozilla-beta?
Attachment #8989103 - Flags: approval-mozilla-beta+
We have tried to reproduce the issue without success.
Tests were performed under Windows 10x64 and Windows 7x64, updating from Firefox 59.0.2, Firefox 60.0 and Firefox 60.0.2 and STR from bug 1471824 comment 0. 
Note that both x86 and x64 builds were used for this testing.
Also, we used the en-US, zh-CN and zh-TW localizations. 
Since the bug was verified in Bug 1472137 Comment 15 and by Bug 1471824 comment 12, I'm removing the qe-verify+ flag.

We managed to performed a smoke testing focusing on a11y tools and screen readers under Windows 10x64 and Windows 7x64, using Firefox 61. 
Note that for this round of testing two localizations were used, en-US and zh-CN and no issues were encountered.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
I managed to verify this issue also on Firefox 62.0b5 and Firefox 63.0a1(2018-07-05) under Windows 7x64 and Windows 10x64.
Tests were performed using both x86 and x64 build versions. Also the tests were performed on en-US and zh-CN localizations using both .zip and .exe installation files.
This round of testing was focused on a11y tools and screen readers.
Since no issues were encountered, I'm marking the Tracking Flags accordingly.
See Also: → 1474007
Comment on attachment 8989103 [details]
Bug 1472137: Prevent mutex reentry in mscom::Interceptor::Create if GetInitialInterceptorForIID fails.

Approved for ESR 60.2 also.
Attachment #8989103 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
I've also tested this issue on Firefox 60.2.0ESR, under Windows 10x86 and Windows 10x64 without being able to reproduce the issue. 
The updates were performed from Firefox 60.0.2.
You need to log in before you can comment on or make changes to this bug.