Closed Bug 1424505 Opened 6 years ago Closed 6 years ago

Crash in I_RpcNegotiateTransferSyntax

Categories

(Core :: IPC: MSCOM, defect, P1)

x86_64
Windows 7
defect

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
relnote-firefox --- 59+
firefox-esr52 --- unaffected
firefox-esr60 --- verified
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 blocking verified
firefox60 blocking verified
firefox61 blocking verified

People

(Reporter: jseward, Assigned: Jamie)

References

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(10 files)

This bug was filed from the Socorro interface and is
report bp-3dba594f-d3ce-47b4-a785-6a92f0171209.
=============================================================

This is topcrash #16 in the windows nightly 20171207220423, with
14 crashes from 8 different installations.  Unfortunately, there
are no frames in Gecko code AFAICS, so I can't guess what part of
the system this has to do with.  dmajor, do you know?

It's thread #72 in the process, but doesn't have a name.  #72 is nestled in
between a mixture of "DNS Resolver #<n>", "StreamTrans #<n>", and "SSL Cert
<n>".  Maybe it is related to one of those?

Top 10 frames of crashing thread:

0 rpcrt4.dll I_RpcNegotiateTransferSyntax 
1 rpcrt4.dll LRPC_BIND_CCALL::BaseBind 
2 rpcrt4.dll LRPC_BIND_CCALL::Bind 
3 rpcrt4.dll LRPC_BASE_BINDING_HANDLE::DriveStateForwardAsyncCallback 
4 rpcrt4.dll LrpcBHDriveStateForwardRoutine 
5 rpcrt4.dll CheckLrpcBinding 
6 rpcrt4.dll LRPC_BASE_RETRY_THUNK::CompleteOnProvider 
7 rpcrt4.dll LRPC_CLIENT_IO::ThreadPoolInvokeCallback 
8 ntdll.dll TppSimplepExecuteCallback 
9 ntdll.dll TpPostTask 

=============================================================
Flags: needinfo?(dmajor)
> Unfortunately, there
> are no frames in Gecko code AFAICS, so I can't guess what part of
> the system this has to do with.  dmajor, do you know?

It looks like a dedicated Windows RPC thread (but I have no idea what it's doing).

These are 100% accessibility active... maybe Aaron can find some clues.
Flags: needinfo?(dmajor) → needinfo?(aklotz)
Priority: -- → P2
Whiteboard: [necko-triaged]
this issue seems to be regressing in volume in firefox 59. all the crashes are from 64bit builds on windows 7.

Accessibility client facet
1 	osk.exe|6.1.7601.18512 	99 	38.52 %
2 	osk.exe|6.1.7600.16385 	30 	11.67 %
3 	nvda.exe|2017.3.0.14335 	3 	1.17 %
4 	osk.exe|6.1.7601.23403 	3 	1.17 %
Keywords: regression
Hardware: Unspecified → x86_64
Tracking to keep an eye on this for beta 59.
Too late to fix in 58 though.
#3 overall browser crash in early Beta 3 returns.
Component: Networking → IPC: MSCOM
Flags: needinfo?(aklotz)
December 26. A Tuesday. My guess is a Patch Tuesday push caused this spike.
90% Windows 7 SP1, 10% Windows 7 RTM.

I didn't land anything during the break. I don't think that Jamie did either (but ni? to confirm).

I can probably open a discussion on our MSFT list.
Flags: needinfo?(jteh)
the graph of this signature is just starting on december 26, because we purged all crash data before that date in bug 1427111 (that's the same with all crash signature), so i don't think it's due to any patch day. the signature starts increasing in volume during the 59 cycle on nightly & now that it has gone into beta there as well. this indicates that something on our part made the crash more common.
Flags: needinfo?(aklotz)
Ah crap, I forgot about that.
Flags: needinfo?(jteh)
Flags: needinfo?(aklotz)
Aaron, did you have any luck contacting MSFT? This is over 10% of our browser crashes on Beta at the moment.
Flags: needinfo?(aklotz)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #9)
> Aaron, did you have any luck contacting MSFT? This is over 10% of our
> browser crashes on Beta at the moment.

Awaiting response.
Flags: needinfo?(aklotz)
Note: Even though this crash shows up in Socorro as rpcrt4!I_RpcNegotiateTransferSyntax, when you examine the actual dump files, the stack is as follows:

rpcrt4!LRPC_CASSOCIATION::Bind+0x3d2
rpcrt4!LRPC_BIND_CCALL::BaseBind+0x85
rpcrt4!LRPC_BIND_CCALL::Bind+0x1a
rpcrt4!LRPC_BASE_BINDING_HANDLE::DriveStateForwardAsyncCallback+0x4a
rpcrt4!LrpcBHDriveStateForwardRoutine+0x5c
rpcrt4!CLIENT_IO_PROVIDER::Complete+0x1ce
rpcrt4!LRPC_BASE_RETRY_THUNK::CompleteOnProvider+0x22
rpcrt4!LRPC_CLIENT_IO::ThreadPoolInvokeCallback+0x16
ntdll!TppSimplepExecuteCallback+0x91
ntdll!TppWorkerThread+0x893

Communication with MSFT is obviously using the "real" stack.
This is very high volume, enough that we should block the release on it. 
It seems to affect users from Italy (based on the swearing in the crash-stats comments)

Is there anything we can do to mitigate the problem until we hear from MSFT?
Flags: needinfo?(aklotz)
The vast majority of crashes are triggered by osk.exe. Based on some local testing, that program still works even when a11y is force-disabled, so I'd suggest trying it out on the UIA blocklist.
Flags: needinfo?(aklotz)
Attachment #8952478 - Flags: review?(jteh)
Comment on attachment 8952478 [details] [diff] [review]
Potential work around: block osk.exe from UIA instantiation

Review of attachment 8952478 [details] [diff] [review]:
-----------------------------------------------------------------

::: accessible/windows/msaa/CompatibilityUIA.cpp
@@ +133,5 @@
>    return false;
>  }
>  
> +static const char* gBlockedUiaClients[] = {
> +  "osk.exe"

You noted that the on-screen keyboard seems to work even when blocked. Just to clarify, did you check this on both Windows 7 and Windows 10? I'm just concerned they might be different; a lot of Microsoft a11y stuff changed significantly between 7 and 10.

::: accessible/windows/msaa/LazyInstantiator.cpp
@@ +232,5 @@
>    nsCOMPtr<nsIFile> clientExe;
>    if (!a11y::GetInstantiator(getter_AddRefs(clientExe))) {
>      return true;
>    }
> +

nit: I'm all for white space fixes in the surrounding area :), but this is the only change in this file, which makes the patch look like it touches this file when it doesn't really.
Attachment #8952478 - Flags: review?(jteh) → review+
Yep, works on Win10.
Can you request uplift ? Thanks
Flags: needinfo?(aklotz)
Comment on attachment 8952478 [details] [diff] [review]
Potential work around: block osk.exe from UIA instantiation

Approval Request Comment
[Feature/Bug causing the regression]: a11y+e10s on Windows
[User impact if declined]: Crashes
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: It's in Nightly, but it needs to spend some time "in the wild" for us to know whether or not it is effective.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Low
[Why is the change risky/not risky?]: Simple patch that activates existing mechanisms.
[String changes made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8952478 - Flags: approval-mozilla-beta?
Comment on attachment 8952478 [details] [diff] [review]
Potential work around: block osk.exe from UIA instantiation

Approved and uplifted for 59b12. Thanks!

https://hg.mozilla.org/releases/mozilla-beta/rev/abc5f0d35082
Attachment #8952478 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Noting that since we didn't see a lot of crashes here in nightly, we wanted to rush this fix directly to beta. That way we can see clear result over the weekend.
it doesn't look like the workaround was effective - the crash signature is still around 10% of browser crashes in early data for 59.0b12 and osk.exe is recorded as a11y clients in most of these reports.
Flags: needinfo?(aklotz)
Assignee: nobody → jmathies
Priority: P2 → P1
(In reply to Jim Mathies [:jimm] from comment #23)
> - UIAUTOMATION api use
> - common clients - osk.exe|6.1.7600.16385, osk.exe|6.1.7601.18512

Do these stats only include builds with the osk.exe UIA blocking (59b12 and whatever Nightly)? If so, does this mean that patch isn't working for some reason?

I wonder if UIA just repeatedly tries to go in-proc for some reason. From what we've seen, it resorts to out-proc if it can't go in-proc. However, if it *did* keep trying for some reason, it would exceed kMaxUiaAttemps (currently 5) in UiaHookProc and then it wouldn't get blocked after that. (The reason for kMaxUiaAttempts is performance, so we can't just get rid of it.)
That's a good point. Perhaps we should only charge an attempt against kMaxUiaAttempts if the result was None as opposed to Some(true) or Some(false).
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] from comment #25)
> That's a good point. Perhaps we should only charge an attempt against
> kMaxUiaAttempts if the result was None as opposed to Some(true) or
> Some(false).

The only problem is that if UIA just keeps re-trying and getting blocked every time, each re-try will hurt performance. I guess it depends how many times this thing re-tries, and unfortunately, we don't have an answer to that question.
(In reply to James Teh [:Jamie] from comment #24)
> (In reply to Jim Mathies [:jimm] from comment #23)
> > - UIAUTOMATION api use
> > - common clients - osk.exe|6.1.7600.16385, osk.exe|6.1.7601.18512
> 
> Do these stats only include builds with the osk.exe UIA blocking (59b12 and
> whatever Nightly)? If so, does this mean that patch isn't working for some
> reason?

My stats were across all builds and channels. Here's just 59b12, which still has the crash in it for both osk versions -

https://crash-stats.mozilla.com/search/?signature=%3DI_RpcNegotiateTransferSyntax&version=59.0b12&product=Firefox&date=%3E%3D2018-02-01T10%3A50%3A00.000Z&date=%3C2018-02-26T10%3A50%3A30.000Z&_sort=-date&_facets=signature&_facets=version&_facets=accessibility&_facets=accessibility_client&_facets=accessibility_in_proc_client&_facets=cpu_count&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-accessibility_client


So basically the patch that landed here doesn't appear to be blocking this client.
STR:

1) Install the free desktop version of Sticky Password
2) Launch Sticky Password
3) launch a instance of Firefox
4) In the Sticky Password main app window, select menu on the upper right, then select settings
5) In the web accounts, Find Firefox and click Install
6) After the Sticky Addon is installed and running, go to a search engine and submit a search

result: this crash

https://crash-stats.mozilla.com/report/index/97eb55b8-f09a-4e5e-9f7d-7f4770180228
https://crash-stats.mozilla.com/report/index/71b1fb38-ce3b-46e9-aca9-15d460180228
https://crash-stats.mozilla.com/report/index/cbcb9fdf-cc6d-421b-8e78-7221a0180228
(In reply to Jim Mathies [:jimm] from comment #28)
> STR:
> 
> 1) Install the free desktop version of Sticky Password
> 2) Launch Sticky Password
> 3) launch a instance of Firefox
> 4) In the Sticky Password main app window, select menu on the upper right,
> then select settings
> 5) In the web accounts, Find Firefox and click Install
> 6) After the Sticky Addon is installed and running, go to a search engine
> and submit a search
> 
> result: this crash
> 
> https://crash-stats.mozilla.com/report/index/97eb55b8-f09a-4e5e-9f7d-
> 7f4770180228
> https://crash-stats.mozilla.com/report/index/71b1fb38-ce3b-46e9-aca9-
> 15d460180228
> https://crash-stats.mozilla.com/report/index/cbcb9fdf-cc6d-421b-8e78-
> 7221a0180228

#5 - correction, choose the 'web browsers' menu item, not 'web accounts'.
(In reply to Aaron Klotz [:aklotz] from comment #11)
> Note: Even though this crash shows up in Socorro as
> rpcrt4!I_RpcNegotiateTransferSyntax, when you examine the actual dump files,
> the stack is as follows:
> 
> rpcrt4!LRPC_CASSOCIATION::Bind+0x3d2
> rpcrt4!LRPC_BIND_CCALL::BaseBind+0x85
> rpcrt4!LRPC_BIND_CCALL::Bind+0x1a
> rpcrt4!LRPC_BASE_BINDING_HANDLE::DriveStateForwardAsyncCallback+0x4a
> rpcrt4!LrpcBHDriveStateForwardRoutine+0x5c
> rpcrt4!CLIENT_IO_PROVIDER::Complete+0x1ce
> rpcrt4!LRPC_BASE_RETRY_THUNK::CompleteOnProvider+0x22
> rpcrt4!LRPC_CLIENT_IO::ThreadPoolInvokeCallback+0x16
> ntdll!TppSimplepExecuteCallback+0x91
> ntdll!TppWorkerThread+0x893
> 
> Communication with MSFT is obviously using the "real" stack.

Caught this finally after multiple tries - 

rpcrt4.dll!LRPC_CASSOCIATION::Bind(struct _RPC_CLIENT_INTERFACE *,int,class RPCP_ALPC_MESSAGE_ATTRIBUTES *,class LRPC_BASE_CCALL *,unsigned long,int,unsigned __int64,int,class LRPC_BINDING * *,struct tagLRPC_BIND_DATA *)  Unknown
rpcrt4.dll!LRPC_BIND_CCALL::BaseBind(int,unsigned long,int,unsigned __int64,int)  Unknown
rpcrt4.dll!LRPC_BIND_CCALL::Bind(int,unsigned long,int)  Unknown
rpcrt4.dll!LRPC_BASE_BINDING_HANDLE::DriveStateForwardAsyncCallback(int,unsigned long,class LRPC_BIND_CCALL *,long,int)  Unknown
rpcrt4.dll!LrpcBHDriveStateForwardRoutine(class LRPC_CLIENT_IO *)  Unknown
rpcrt4.dll!CLIENT_IO_PROVIDER::Complete(class MUTEX *,class LRPC_CLIENT_IO *,int,int,long)  Unknown
rpcrt4.dll!LRPC_BASE_RETRY_THUNK::CompleteOnProvider(class LRPC_CLIENT_IO *,int)  Unknown
rpcrt4.dll!LRPC_CLIENT_IO::ThreadPoolInvokeCallback(struct _TP_CALLBACK_INSTANCE *,void *)  Unknown
ntdll.dll!TppSimplepExecuteCallback()  Unknown
ntdll.dll!TppWorkerThread()  Unknown
00000000000f00a3()  Unknown
0000000000000008()  Unknown
0000000100010007()  Unknown
00000000003bdc70()  Unknown
00000000127da460()  Unknown
Not having much luck reproducing in a local release build.
STR updates:

pre-setup:

1) Make sure the pre-allocated process manager is enabled via prefs (dom.ipc.processPrelaunch.enabled)
2) Make sure you have dom.ipc.processCount set to a value > 1
3) Disable session restore, and set your browser up to open a new tab page on launch

Setting a new profile as your default for this may help.

str:

1) Install the free desktop version of Sticky Password
2) Launch the desktop client for Sticky Password
3) In the Sticky Password main app window, select menu on the upper right, then select settings
5) Choose the 'web browsers' menu item and click the Install button for Firefox

This step will either install an extension directly or navigate Firefox to the web site to install the Sticky Passwords web extension. Go ahead and install this.

6) After the Sticky Addon is installed, restart Firefox
7) Launch Firefox (new tab page should be open)
8) Attach a debugger
9) Navigate the open new tab page to any remote web site (triggering child process launch)

*crash*

This works in local builds as well. 

While debugging, I managed to catch this next exception right before the crash we end up reporting to crash reporter. This exception was caught and dealt with internally.

Aaron, any idea what's going on here?

ntdll.dll!RtlPcToFileHeader() Unknown
KernelBase.dll!BasepGetModuleHandleExW() Unknown
KernelBase.dll!GetModuleHandleExW() Unknown
UIAutomationCore.dll!GetProviderInfo(struct IUnknown *) Unknown
UIAutomationCore.dll!Error::GeneralProviderError(struct IUnknown *,unsigned short const *,long,struct HWND__ *,int,int) Unknown
UIAutomationCore.dll!AccNavUtils::HitTestOrFocus(struct IAccessible *,int,struct tagPOINT,struct HWND__ *,struct AccPair *) Unknown
UIAutomationCore.dll!MsaaProxy::GetFocus(struct IRawElementProviderFragment * *) Unknown
UIAutomationCore.dll!UiaNode::ProviderDrillForPointOrFocus(int,int,int,double,double,struct IRawElementProviderSimple * *,class IUiaNode * *,struct UiaCacheRequest *,struct UiaCacheResponse *,int *) Unknown
UIAutomationCore.dll!UiaNode::CrossProcess_DrillForPointOrFocus(int,int,int,double,double,struct UiaCacheRequest *,struct UiaCacheResponse *) Unknown
UIAutomationCore.dll!RemoteUiaNodeStub::Incoming_DrillForPointOrFocus(class UiaNode *,class ITargetContextInvoker *,class IServerConnection *,class MessageParser &,class MessageBuilder &) Unknown
UIAutomationCore.dll!RemoteUiaNodeStub::OnMessage(struct IUnknown *,class ITargetContextInvoker *,class IServerConnection *,enum Protocol_MethodId,class MessageParser &,class MessageBuilder &) Unknown
UIAutomationCore.dll!InvokeOnCorrectContext_Callback(void *) Unknown
UIAutomationCore.dll!ProcessIncomingRequest(class MessageParser &,class MessageBuilder &,class IServerConnection *) Unknown
UIAutomationCore.dll!HookBasedServerConnectionManager::HookCallback(void *,unsigned long,void * *,unsigned long *,void * *) Unknown
UIAutomationCore.dll!HookUtil<&HookBasedClientConnection::HookCallback(void *,unsigned long,void * *,unsigned long *,void * *),0>::CallOut(void *,unsigned long,void * *,unsigned long *,void * *) Unknown
UIAutomationCore.dll!HandleHookMessage(struct tagCWPSTRUCT *,unsigned long,void (*)(void *,unsigned long,void * *,unsigned long *,void * *),void (*)(int,void *)) Unknown
UIAutomationCore.dll!HookUtil<&HookBasedClientConnection::HookCallback(void *,unsigned long,void * *,unsigned long *,void * *),0>::CallWndProc(int,unsigned __int64,__int64) Unknown
user32.dll!DispatchHookW() Unknown
user32.dll!fnHkINLPCWPSTRUCTW() Unknown
user32.dll!__fnDWORD() Unknown
ntdll.dll!KiUserCallbackDispatcherContinue() Unknown
user32.dll!ZwUserPeekMessage() Unknown
user32.dll!PeekMessageW() Unknown
msctf.dll!CThreadInputMgr::PeekMessageW(struct tagMSG *,struct HWND__ *,unsigned int,unsigned int,unsigned int,int *) Unknown
> xul.dll!mozilla::widget::WinUtils::PeekMessageW(tagMSG * aMsg, HWND__ * aWnd, unsigned int aFirstMessage, unsigned int aLastMessage, unsigned int aOption) Line 699 C++
xul.dll!nsAppShell::ProcessNextNativeEvent(bool mayWait) Line 504 C++
xul.dll!nsBaseAppShell::DoProcessNextNativeEvent(bool mayWait) Line 140 C++
xul.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal * thr, bool mayWait) Line 290 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 955 C++
xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Line 517 C++
xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Line 97 C++
xul.dll!MessageLoop::RunHandler() Line 320 C++
xul.dll!MessageLoop::Run() Line 300 C++
xul.dll!nsBaseAppShell::Run() Line 159 C++
xul.dll!nsAppShell::Run() Line 403 C++
xul.dll!nsAppStartup::Run() Line 289 C++
xul.dll!XREMain::XRE_mainRun() Line 4686 C++
xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4814 C++
xul.dll!XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4906 C++
firefox.exe!do_main(int argc, char * * argv, char * * envp) Line 232 C++
firefox.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 306 C++
firefox.exe!wmain(int argc, wchar_t * * argv) Line 132 C++
firefox.exe!__scrt_common_main_seh() Line 253 C++
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
Flags: needinfo?(aklotz)
My best guess is that UIAutomationCore.dll!AccNavUtils::HitTestOrFocus probably calls AccessibleWrap::accHitTest and/or AccessibleWrap::get_accFocus, and then something bad happened.

I wonder what triggers that UIAutomationCore.dll!Error::GeneralProviderError code path. Was that because of an error code or does UIA wrap its calls into us with a SEH handler or something...

ni? Jamie since he has some UIA contacts. Maybe they can give us a hint?
Flags: needinfo?(aklotz) → needinfo?(jteh)
(In reply to Aaron Klotz [:aklotz] from comment #33)
> I wonder what triggers that UIAutomationCore.dll!Error::GeneralProviderError
> code path. Was that because of an error code or does UIA wrap its calls into
> us with a SEH handler or something...

Looks like it's due to FAILED(HRESULT).

From what I see it looks like GeneralProviderError just logs information about the provider to ETW.
More detail on how to reproduce.

Setup:

1) Make sure the pre-allocated process manager is enabled via prefs (dom.ipc.processPrelaunch.enabled)
2) Make sure you have dom.ipc.processCount set to a value 1
3) Disable session restore, and set your browser up to open a new tab page on launch
4) Install the free desktop version of Sticky Password and set up a local password database.
5) In the Sticky Password main app window, select menu on the upper right, then select settings
6) Choose the 'web browsers' menu item and click the Install button for Firefox

This step will either install an extension directly or navigate Firefox to the web site to install the Sticky Passwords web extension. Go ahead and install this. Note if you have multiple browsers installed, appears the app tries to pick the default install for launch.

7) After the Sticky Addon is installed, close Firefox

STR:

1) Launch Firefox -
 check: new tab page should be open
 check: Sticky Password extension toolbar button should indicate an 'inactive' session via a grey icon.
2) Attach a debugger to all Firefox processes (helps if you use one that auto attaches to new child processes).
3) Click on the Sticky Password button in the Firefox toolbar
 check: desktop client should launch
 check: toolbar icon should change to a 'active' (colorful) icons
4) Navigate the open new tab page to any remote web site (triggering child process launch)

*crash*
I sent an email to my UIA contact at Microsoft, but haven't seen a response yet.

I tried reproducing this myself in case there's some additional info I can offer. Unfortunately, the Sticky Password app is horribly inaccessible and I can't even get to the menu to install the browser extension. Can someone perhaps provide the direct URL for the extension?
Flags: needinfo?(jteh)
Crash Signature: [@ I_RpcNegotiateTransferSyntax] → [@ I_RpcNegotiateTransferSyntax] [@ rpcrt4.dll@0x499ca] [@ rpcrt4.dll@0x49518]
(In reply to [:philipp] from comment #38)
> Created attachment 8956535 [details]
> sticky passwords webextension

This doesn't install correctly, fyi.
So far I've managed to reproduce this about three times. Currently trying again today.

Aaron, would you mind taking a crack at reproducing as well please. This might block 59.
Flags: needinfo?(aklotz)
Yeah, I'll dust off a Win7 box and see what I can find...
Flags: needinfo?(aklotz)
I have everything set up via Comment 35 but I have yet to reproduce the crash...
Also note as an addendum to Comment 35: Sticky Password does not play well with multiple instances of Firefox (any channel) or multiple profiles.

I could not get it to recognize the extension installation without removing all other instances of Firefox and Nightly except for the one I was working with, and I had to set a default profile.
Wild thought: I wonder if this is similar to bug 1294903, where file picker stuff was running in the COM MTA and the MTA was unexpectedly shutdown while code was still running. That bug was also specific to Windows 7.

In the case of the parent process, the MTA is not currently guaranteed to remain running for the entire lifetime of the COM process. In particular, I see a lot of CoInitializeEx/CoUninitialize pairs on background threads that will come and go. If the timing is just right, the MTA could be completely torn down.

This could explain the difficulty of reproducing this problem: The various players need to race such that something is running in the MTA concurrently with the last reference to the MTA being CoUninitialize'd.

The easiest way to land an experimental patch would be to instantiate an instance of mozilla::mscom::EnsureMTA sometime after XPCOM and the observer service are fired up (EnsureMTA depends on those components). That would guarantee that at least one thread is maintaining the MTA for the entire time that the process is running.
I think we should try this. Feel free to try locally first.
Flags: needinfo?(jmathies)
Attachment #8956659 - Flags: review?(jmathies)
Attachment #8956659 - Flags: review?(jmathies) → review+
(In reply to Aaron Klotz [:aklotz] from comment #45)
> Created attachment 8956659 [details] [diff] [review]
> Keep MTA continuously running in the background for entire life of parent
> process
> 
> I think we should try this. Feel free to try locally first.

Running a local build with this patch. Unfortunately I've not had much luck reproducing the crash in a local build but FWIW so far no crash. Will keep testing.
Flags: needinfo?(jmathies)
Hey Aaron, lets get this into nightly, I should be able to confirm pretty easily then.
Flags: needinfo?(aklotz)
https://hg.mozilla.org/integration/mozilla-inbound/rev/35129c889938eeefd8ccc81cf78a223e045e7f00
Bug 1424505: Ensure at least one consistent reference to the COM MTA at all times on Windows 7; r=jimm
The above push is inside a NIGHTLY_BUILD include guard until we've had a chance to assess its impact on Nightly.
Flags: needinfo?(aklotz)
(In reply to Cosmin Sabou [:cosmin_s] from comment #50)
> https://hg.mozilla.org/mozilla-central/rev/35129c889938

This didn't fix the issue, still crashing this morning.
Lets try this. In testing I found that the pre-allocated process manager [1] seemed to exacerbate this. That class has one unique feature, it recycles [2] content processes. My theory is that this is probably bad practice when we have accessibility clients reaching into these processes via COM doing weird unexpected things. Disabling the ppm fixed the crash locally as well, which seems to confirm this.

Note I think this could create negative side effects due to higher content process turnover rates.

[1] https://searchfox.org/mozilla-central/search?q=PreallocatedProcessManager&path=
[2] https://searchfox.org/mozilla-central/rev/588d8120aa11738657da93e09a03378bcd1ba8ec/dom/ipc/ContentParent.cpp#1809
Attachment #8957390 - Flags: review?(aklotz)
I eventually managed to get this installed, but can't repro the crash. :(

(In reply to Jim Mathies [:jimm] from comment #52)
> it recycles
> [2] content processes. My theory is that this is probably bad practice when
> we have accessibility clients reaching into these processes via COM doing
> weird unexpected things.

The thing is that those clients should only be able to make accessibility calls, and if we had a bug in our a11y code somewhere, I'd expect that to crash the content process, not the parent.

Are you able to tell me the last COM call you see with COM logging enabled just before the crash occurs?

The other thing to try might be to log every single call that hits the Handler. Unfortunately, that would be manual work - we don't have a framework for that - but it might be enlightening. It's possible a call hits the Handler but never ends up hitting the content process, at which point we'd know what call was causing the problem and be able to narrow it down at least.
This tab process had been recycled about three times before it crashed loading a bugzilla page.
Doesn't look like much in that log. Jamie, any thoughts?
Flags: needinfo?(jteh)
Attachment #8957412 - Attachment mime type: text/x-log → text/plain
Nope. :( Just a few QueryInterfaces on two distinct objects, probably just for retrieval of the top level document. (We do quite a few QueryInterfaces internally to cache common interfaces.)

Jim, can you set a parent process breakpoint for AccessibleHandler!mozilla::a11y::AccessibleHandler::ResolveIA2, perhaps just outputting the first few frames of the stack rather than breaking? Something like:
bp AccessibleHandler!mozilla::a11y::AccessibleHandler::ResolveIA2 "kp 5; g"
Most handler calls end up calling that function, so it's a good way to catch most of them. The last one before the crash may (or may not) be interesting.
Flags: needinfo?(jteh)
Comment on attachment 8957390 [details] [diff] [review]
disable process recycling v.1

Review of attachment 8957390 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +2916,5 @@
>    }
>  #ifdef ACCESSIBILITY
>    else if (aData && !strcmp(aTopic, "a11y-init-or-shutdown")) {
>      if (*aData == '1') {
> +      // Shut down the preallocated process manager to avoid recycled

Given that this bug is specific to Win7, should we gate this on !IsWin8OrNewer()?
Attachment #8957390 - Flags: review?(aklotz) → review+
(In reply to Aaron Klotz [:aklotz] from comment #58)
> Comment on attachment 8957390 [details] [diff] [review]
> disable process recycling v.1
> 
> Review of attachment 8957390 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/ipc/ContentParent.cpp
> @@ +2916,5 @@
> >    }
> >  #ifdef ACCESSIBILITY
> >    else if (aData && !strcmp(aTopic, "a11y-init-or-shutdown")) {
> >      if (*aData == '1') {
> > +      // Shut down the preallocated process manager to avoid recycled
> 
> Given that this bug is specific to Win7, should we gate this on
> !IsWin8OrNewer()?

Considered this. My guess is this is showing up on 8 and up just under different signatures so I didn't gate.
That's a fair point.
Here ya go!
Attachment #8957639 - Flags: feedback?(jteh)
Attachment #8957639 - Attachment description: accessible handler tracve 1 → accessible handler trace 1
Attachment #8957639 - Attachment is patch: false
Spoke with Aaron, the plan is to land the work around for confirmation over the weekend. If we come up with something else early next week we can back it out.
Pushed by jmathies@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cb808626c8df
Avoid recycling content process when accessibility is active. r=aklotz
We can leave this as something to try and fix in a dot release, but at this point, I'm not going to block the 59.0 release. 

Our overall crash rate for the 59 release is low.  The crash rate for this signature is much higher on 59 than on 58, but for Win 7 only, and happens repeatedly for particular users,  rather than affecting all Win7 users. So let's aim for a 59.0.x dot release to give us time for testing. 

Aaron, is there any workaround for users we can add in SUMO or can you suggest wording for a "known issue" release note?
Flags: needinfo?(aklotz)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #64)
> We can leave this as something to try and fix in a dot release, but at this
> point, I'm not going to block the 59.0 release. 
> 
> Our overall crash rate for the 59 release is low.  The crash rate for this
> signature is much higher on 59 than on 58, but for Win 7 only, and happens
> repeatedly for particular users,  rather than affecting all Win7 users. So
> let's aim for a 59.0.x dot release to give us time for testing. 

SGTM!

> 
> Aaron, is there any workaround for users we can add in SUMO or can you
> suggest wording for a "known issue" release note?

Yes. If the users are not using a legit a11y client (and most of our crash reports indicate this to be the case), they can disable accessibility in their preferences.
Flags: needinfo?(aklotz)
Adding to release notes for 59 as a known issue:

On-screen keyboard on some Windows 7 devices may cause a crash. Users can work around this issue by turning off accessibility features in Preferences, in the Privacy and Security Permissions section 

Joni, I know it's last minute, but can you add a page in SUMO for us to link to? That way, I can write a shorter release note and link to the page for the details.
Flags: needinfo?(jsavage)
Actually, never mind, I'll link to this existing page: https://support.mozilla.org/en-US/kb/accessibility-services#w_how-do-i-disable-firefox-accessibility-service
Flags: needinfo?(jsavage)
(In reply to Noemi Erli[:noemi_erli] from comment #70)
> https://hg.mozilla.org/mozilla-central/rev/cb808626c8df

Rats, still hit this today. 

The stacks in the report below are pretty good, the main thread is clearly digging for focus related information. The COM thread crashes in the background.

https://crash-stats.mozilla.com/report/index/99ab4ca0-cbcf-4aad-8549-c3f190180311#allthreads
One of the things I've noticed about the sticky passwords app, they mess with focus by placing non-firefox process UI on top of the firefox window in cases where they generate drop down menus and such. This tends to trigger top level window focus changes.
Took a look at the trace and the crash dump. Both are definitely related to calls to the content process to answer a focus query. However, the trace seems to suggest it was an accParent call after an accFocus call, where the dump suggests the crash occurred on the accFocus call.

Jim, in your STR (comment 35), you noted:

> 4) Navigate the open new tab page to any remote web site (triggering child process launch)

However, from what I can see, the new tab page is itself loaded in a content process. In fact, the same content process gets used for me (though I guess it probably doesn't with the process recycling disabled).

Anyway, it would seem that this is somehow related to a focus query hitting a remote document when it is either in the process of dying or in the process of being created. The former seems more likely, though I don't really understand why it would happen. Even if the document was shut down just before or while we were marshaling it (which is possible because shut down happens in the main thread while marshaling occurs in an MTA thread), we're still holding a reference, so it shouldn't be destroyed... and any calls to fill the handler cache should return CO_E_OBJNOTCONNECTED.
Comment on attachment 8957639 [details]
accessible handler trace 1

Thanks. See comment 73.
Attachment #8957639 - Flags: feedback?(jteh) → feedback+
(In reply to James Teh [:Jamie] from comment #73)
> Took a look at the trace and the crash dump. Both are definitely related to
> calls to the content process to answer a focus query. However, the trace
> seems to suggest it was an accParent call after an accFocus call, where the
> dump suggests the crash occurred on the accFocus call.
> 
> Jim, in your STR (comment 35), you noted:
> 
> > 4) Navigate the open new tab page to any remote web site (triggering child process launch)
> 
> However, from what I can see, the new tab page is itself loaded in a content
> process. In fact, the same content process gets used for me (though I guess
> it probably doesn't with the process recycling disabled).
> 

Doesn't make any difference I guess, process recycling doesn't seem to be involved.

Shorter STR that I use now:

1) have the sp apps installed but not logged in
2) open firefox, open a new tab page, close other tabs
3) click the sb toolbar icon, this should launch the desktop app
4) log into the desktop app
5) Navigate the new tab page to a bugzilla bug page

*crash*
Crash Signature: [@ I_RpcNegotiateTransferSyntax] [@ rpcrt4.dll@0x499ca] [@ rpcrt4.dll@0x49518] → [@ I_RpcNegotiateTransferSyntax] [@ rpcrt4.dll@0x499ca] [@ rpcrt4.dll@0x49518] [@ rpcrt4.dll@0x45e6a]
different approach:
this signature is clearly regressing in volume during the 59 cycle and is pretty common on nightly too - unfortunately we've purged all crash reports submitted before december 26, so it's not possible to get a regression range by looking at crash stats.

however this bug was filed as part of the nightly crash triage roster for build 20171207220423 and i'd assume the signature was caught just shortly after something regressed it (because it's occurring rather frequently).

in the week before build 20171207220423, there were just four patches landing in Core::Disability Access AP 
https://bugzilla.mozilla.org/buglist.cgi?f1=cf_status_firefox58&list_id=14059141&o1=notequals&resolution=FIXED&chfieldto=2017-12-07&chfield=target_milestone&query_format=advanced&chfieldfrom=2017-12-01&chfieldvalue=mozilla59&bug_status=RESOLVED&bug_status=VERIFIED&v1=fixed&component=Disability%20Access%20APIs&product=Core

because comment #73 mentioned a focus query, perhaps bug 1421144 is related?
See Comment 76.
Flags: needinfo?(jteh)
It's related in the sense that before bug 1421144 landed, it wasn't possible for an accessibility client to query the focus on demand. If a client missed the event (or if one wasn't fired as is the case in some obscure circumstances), tough luck; it couldn't retrieve the focus. So, before that point, the call to accFocus would simply return nothing. Now, if a remote document is focused, it correctly returns the focused remote accessible.

I'd already started to investigate this avenue, which is why I was speculating about a focus query hitting a dying document in comment 73:

> Anyway, it would seem that this is somehow related to a focus query hitting a remote document when it is either in the process of dying or in the process of being created. The former seems more likely, though I don't really understand why it would happen. Even if the document was shut down just before or while we were marshaling it (which is possible because shut down happens in the main thread while marshaling occurs in an MTA thread), we're still holding a reference, so it shouldn't be destroyed... and any calls to fill the handler cache should return CO_E_OBJNOTCONNECTED.
Flags: needinfo?(jteh)
I discussed this with Jamie and Jim on IRC. In the interest of hopefully stopping the bleeding on 59, I created a Try push of bug 1421144 backed out on top of mozilla-release so we can see if that helps. If it does, that would at least give us a path forward for 59.0.2 while we work to determine the root cause for 60+. For anybody that can consistently reproduce these crashes, feedback on the below builds would be greatly appreciated!

Win32:
https://queue.taskcluster.net/v1/task/ZfFIt3G8Rfq7AZaSMDysYQ/runs/0/artifacts/public/build/target.zip

Win64:
https://queue.taskcluster.net/v1/task/Nz7NUddlSfqnAguwd-jh1A/runs/0/artifacts/public/build/target.zip
(In reply to Ryan VanderMeulen [:RyanVM] from comment #79)
> I discussed this with Jamie and Jim on IRC. In the interest of hopefully
> stopping the bleeding on 59, I created a Try push of bug 1421144 backed out
> on top of mozilla-release so we can see if that helps. If it does, that
> would at least give us a path forward for 59.0.2 while we work to determine
> the root cause for 60+. For anybody that can consistently reproduce these
> crashes, feedback on the below builds would be greatly appreciated!
> 
> Win32:
> https://queue.taskcluster.net/v1/task/ZfFIt3G8Rfq7AZaSMDysYQ/runs/0/
> artifacts/public/build/target.zip
> 
> Win64:
> https://queue.taskcluster.net/v1/task/Nz7NUddlSfqnAguwd-jh1A/runs/0/
> artifacts/public/build/target.zip

Surfing in this build now, no crashes yet. Will report back in a bit.
> > Win64:
> > https://queue.taskcluster.net/v1/task/Nz7NUddlSfqnAguwd-jh1A/runs/0/
> > artifacts/public/build/target.zip
> 
> Surfing in this build now, no crashes yet. Will report back in a bit.

Newer hopefully more reliable STR:

Setup:
1) Install Sticky Passwords desktop app and the Firefox WebExtension (see comment 32)
2) Launch Firefox and open this bug page
3) Log into Sticky Passwords in Firefox
4) Click on the toolbar icon and select bookmarks
5) Add a Sticky Passwords bookmark to this bugzilla page
6) Lock the Sticky Passwords desktop app by right clicking on the desktop tray icon and select Lock
7) Configure Firefox to open a single new tab page, disable session restore and all pinned tabs

Repro:
1) Open Firefox 
2) Open a simple remote web page (www.techmeme.com)
3) Open a new tab page tab
4) Close the remote tab
6) click on the SB toolbar button, select bookmarks, and navigate to the bugzilla bookmark

result: *crash*

I've reproduced using this in nightly and tested with the try builds here. So far I haven't reproduced in the try build.
a user at https://www.reddit.com/r/firefox/comments/85zfc8/5901_64bit_win_7_random_crashes_since_last_week/ also reported back that the instability stopped when using the try build from comment 79
Let's move forward with the revert option for 59 then. Jamie, can you please give this your official blessing per our IRC conversation yesterday?
Attachment #8961084 - Flags: review?(jteh)
Attachment #8961084 - Flags: approval-mozilla-release?
Just realized I've been surfing with this build now for about eight hours. No issues.
(In reply to Noemi Erli[:noemi_erli] from comment #70)
> https://hg.mozilla.org/mozilla-central/rev/cb808626c8df

Just a note to self, I need to back this change out.
Flags: needinfo?(jmathies)
Comment on attachment 8961084 [details] [diff] [review]
revert bug 1421144

Consider it blessed. I realise no one is suggesting otherwise, but just to be super clear, this must only be considered a short-term bandaid. Reverting this long-term is not an acceptable option and I'm probably going to need some help fixing this properly, since I still can't reproduce it myself.
Attachment #8961084 - Flags: review?(jteh)
Attachment #8961084 - Flags: review+
Attachment #8961084 - Flags: review+
Comment on attachment 8961084 [details] [diff] [review]
revert bug 1421144

Yeah, I'm only treating this as a stop-gap for 59 for now. We can decide what to do for 60+ later in the cycle depending on how the investigation goes.
Attachment #8961084 - Flags: approval-mozilla-release? → approval-mozilla-release+
I'm going to track this as blocking for 60 too, to make sure we address this in some way before we ship.
So far, it appears that the backout had the intended affect on the crash rate for 59.0.2.
(In reply to Jim Mathies [:jimm] from comment #85)
> (In reply to Noemi Erli[:noemi_erli] from comment #70)
> > https://hg.mozilla.org/mozilla-central/rev/cb808626c8df
> 
> Just a note to self, I need to back this change out.

Try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1735c496b8abd67254a6d125596f1bb2bed57588
Should we land the stop-gap on 60 as well if we don't have a resolution by, say, a week from now?
Hello, new non-tech person here.  I followed the link from this page:

  https://www.mozilla.org/en-US/firefox/59.0.2/releasenotes/

...regarding issues with the extension "Sticky Password".  Hoping to help, here's a link from the forum on the StickyPassword.com site that may be related:

  https://www.stickypassword.com/forum/viewtopic.php?f=133&t=91447

...especially/possibly the final comment (as of this writing).

That's it -- thanks to all for your significant efforts.
Per IRC discussion with Jim, we're going to backout rev cb808626c8df from all affected channels (Beta/Trunk) and also bug 1421144 from Beta to work around the crashes before 60 goes to release given where we are in the cycle.

https://hg.mozilla.org/mozilla-central/rev/d5bf9378b0802f7b10c551594bb6bd516f5040c3

https://hg.mozilla.org/releases/mozilla-beta/rev/8e7c044b2e71af8b7dba06ced0298fe50446c575
https://hg.mozilla.org/releases/mozilla-beta/rev/7d6400c2038cf35259e42d30541654a94e7a7b9a
Attachment #8961084 - Flags: approval-mozilla-beta+
Given that this crash only occurs on Windows 7, would it be possible to get bug 1421144 reinstated for 60 but disabled for Windows 7? (I would provide a trivial patch on top of bug 1421144.) That way, users on later versions of Windows would get the fix for bug 1421144, but we wouldn't crash on Windows 7. Especially given that 60 is an ESR, it'd be nice if users not running Windows 7 didn't have to wait for another release to get this fix.

Jim, what do you think?
(In reply to James Teh [:Jamie] from comment #97)
> Given that this crash only occurs on Windows 7, would it be possible to get
> bug 1421144 reinstated for 60 but disabled for Windows 7? (I would provide a
> trivial patch on top of bug 1421144.) That way, users on later versions of
> Windows would get the fix for bug 1421144, but we wouldn't crash on Windows
> 7. Especially given that 60 is an ESR, it'd be nice if users not running
> Windows 7 didn't have to wait for another release to get this fix.
> 
> Jim, what do you think?

Sure assuming that fix we keep backing out was valuable vs. trivial.

What we really need do here is figure out a way to debug this. My attempts to reproduce are mixed but if I try long enough I can. Problem is there's not much you can tell from the actual crash.

I wonder about these users who appear to be on Windows 7, with active on-screen keyboards. What type of user has this setup?
(In reply to Jim Mathies [:jimm] from comment #98)
> > Given that this crash only occurs on Windows 7, would it be possible to get
> > bug 1421144 reinstated for 60 but disabled for Windows 7?
> Sure assuming that fix we keep backing out was valuable vs. trivial.

It fixes a bug that users won't encounter that often (users don't tend to use the System menu that much), but it's rather annoying/confusing when they do hit it. Having said that, I don't know of a user aside from me who has reported this, so that might suggest it's not worth the effort for 60. I still think it needs to be fixed going forward, though.

The fact that this is so difficult to reproduce is super frustrating. I'm going to put some thought into how we can possibly reproduce this without Sticky Password, I think.
I'm trying to find a way to repro this without Sticky Password, since that's out of our control. Bug 1421144 introduced code which makes accFocus work for remote documents, and it is this which seems to expose these crashes on Windows 7. So, I just hacked together this test code which runs accFocus on the root accessible every 50 ms.

Unfortunately, I still can't reproduce this. In addition, my vm was single core, so I tried changing it to have 4 cores, but still couldn't reproduce it.

Compile with:
cl accFocusLoop.cpp /EHsc oleacc.lib
Then, run:
accFocusLoop hwnd
where hwnd is the HWND of the Firefox main window (in decimal).

Jim, I'm curious as to whether you can repro with this test code?
We need to make a decision this week about what to do for 61+ here. We've gone with the backout for the previous 2 cycles and can't keep doing this every cycle.
Flags: needinfo?(jteh)
Similar to the accFocus test, but this one uses the UI Automation client API to retrieve the focus in a loop. We believe this is triggered by a UIA client, so I thought this worth a shot. Compile with:
cl uiaFocusLoop.cpp /EHsc
Unfortunately, no joy: I still can't reproduce it. :(

I did notice that UIA doesn't behave correctly at all for accFocus in remote documents in Windows 7. It just returns the top level accessible, as if the IAccessible::accFocus call failed. It works fine for documents hosted in the parent process, though. This is certainly suspicious. Perhaps this has something to do with RPC_E_CANTCALLOUT_ININPUTSYNCCALL, since we don't enable that hack for UIA. It works correctly in Windows 10, though.

I also tried having Firefox return RPC_E_DISCONNECTED whenever it had to do a remote accFocus, since that error code could show up if a remote document dies while we're handling accFocus and I thought perhaps UIA couldn't handle it properly. Nope; still no crash.
Flags: needinfo?(jteh)
Assignee: jmathies → jteh
Comment on attachment 8972245 [details]
Bug 1424505: Don't try to retrieve the accessible focus from remote documents on Windows 7.

https://reviewboard.mozilla.org/r/240916/#review246740

Looks OK. Sorry if this impacts Windows 7 users.
Attachment #8972245 - Flags: review?(eitan) → review+
Pushed by jteh@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/68bfd841265f
Don't try to retrieve the accessible focus from remote documents on Windows 7. r=eeejay
Looks like Jamie's patch is holding up well for 61/62. This bug has become a bit of a mess - should we consider closing this one out and moving the follow-up work to a new one?
See Also: → 1461237
 (In reply to Ryan VanderMeulen [:RyanVM] from comment #107)
> Looks like Jamie's patch is holding up well for 61/62. This bug has become a
> bit of a mess - should we consider closing this one out and moving the
> follow-up work to a new one?

I guess that makes sense. It'd essentially be the same bug, just with less urgency. :)

Filed bug 1461237.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jteh)
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
I think we can call this verified despite the low ongoing crash rate given how it's still a couple orders of magnitude lower than it was previously.
Status: RESOLVED → VERIFIED
Flags: needinfo?(jmathies)
You need to log in before you can comment on or make changes to this bug.