Closed Bug 1308397 Opened 3 years ago Closed 3 years ago

Crash in mozilla::a11y::AccessibleWrap::GetIAccessibleFor

Categories

(Core :: Disability Access APIs, defect, critical)

52 Branch
Unspecified
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox50 --- unaffected
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: MarcoZ, Assigned: aklotz)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-bebe2999-04df-4933-8c3f-7c0d02161007.
=============================================================

I have been encountering this crash while attempting to switch to my IRCCloud tab. The situation was as follows:

1. Started Nightly, and it restored two tabs from my previous session: Gmail and IRCCloud.
2. Worked in the Gmail tab, opened a bunch of bugs in new tabs and closed these when done.
3. After finishing, pressed Ctrl+Tab to switch to the IRCCloud tab.
4. Crash as indicated above.

This all is with E10S on. First build I'm seeing this in is the 2016-10-06 Nightly build, which contains the fix for bug 1304449.

More reports:
bp-5aaf0269-aaa6-4a4b-a56d-d9ecd2161007
bp-c3298423-d526-40ed-a21f-972d62161007

And the last one I got before I could restore with just a tab that didn't crash me, which is the Gmmail tab, is this, which actually crashed in the injected NVDA module:
bp-07e9a725-71e1-4902-82c7-93d712161007
Needless to say, I could load IRCCloud in a fresh tab without problems and haven't yet encountered that stack overflow crash any more. And in case this wasn't obvious from the crashes: It's a browser chrome crash, not a content one. Firefox completely shuts down, not just the content tab.
Attached patch Fix (obsolete) — Splinter Review
We only need to lazily resolve the COM proxy in the case where a ProxyAccessible method is called in a test. In the general case, we don't want that. I have separated that code out so that the public GetCOMInterface method just produces the interface, while the LazyResolveCOMInterface does the lazy resolution and is private.
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Attachment #8798942 - Flags: review?(tbsaunde+mozbugs)
OS: Windows 10 → Windows
(In reply to Aaron Klotz [:aklotz] from comment #2)
> Created attachment 8798942 [details] [diff] [review]
> Fix
> 
> We only need to lazily resolve the COM proxy in the case where a
> ProxyAccessible method is called in a test. In the general case, we don't
> want that. I have separated that code out so that the public GetCOMInterface
> method just produces the interface, while the LazyResolveCOMInterface does
> the lazy resolution and is private.

It seems worth being clear that we only need to get at the native interface for top level documents, and in that case we already have it in mCOMInterface because of SendCOMProxy() so we don't need the lazy resolution.

Given that and this will make calling GetNativeInterface() on proxy accessibles usually, but not always fail it seems like it would be good to assert it is only called on top level document accessible proxies.
(In reply to Trevor Saunders (:tbsaunde) from comment #3)
> It seems worth being clear that we only need to get at the native interface
> for top level documents, and in that case we already have it in
> mCOMInterface because of SendCOMProxy() so we don't need the lazy resolution.

With the new MSAA ID implementation we still create ProxyAccessibles for content subtrees that are sent up with ShowEvent, we just don't attach COM pointers to them and navigate with them. My motivation with the lazy acquisition of the COM interfaces is for the case that an XPCOM test could call a method on one of those ProxyAccessibles, which then needs to resolve that interface in order to proceed. In other words, for general use, only ProxyAccessibles represent top-level docs require the interface, but as soon as XPCOM enters the mix, then they all do.

Unless I'm missing something here. Feel free to pick this apart! ;-)
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Trevor Saunders (:tbsaunde) from comment #3)
> (In reply to Aaron Klotz [:aklotz] from comment #2)
> > Created attachment 8798942 [details] [diff] [review]
> > Fix
> > 
> > We only need to lazily resolve the COM proxy in the case where a
> > ProxyAccessible method is called in a test. In the general case, we don't
> > want that. I have separated that code out so that the public GetCOMInterface
> > method just produces the interface, while the LazyResolveCOMInterface does
> > the lazy resolution and is private.
> 
> It seems worth being clear that we only need to get at the native interface
> for top level documents, and in that case we already have it in
> mCOMInterface because of SendCOMProxy() so we don't need the lazy resolution.

actually this raises a question given the above this patch should be unnecessary because the problematic thing calling GetCOMInterface on a ProxyAccessible with null mCOMInterface should never happen.  However that would seem to be untrue, so what's going on?
ga bugzilla should have noticed my comment didn't actually go through.
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Aaron Klotz [:aklotz] from comment #4)
> (In reply to Trevor Saunders (:tbsaunde) from comment #3)
> > It seems worth being clear that we only need to get at the native interface
> > for top level documents, and in that case we already have it in
> > mCOMInterface because of SendCOMProxy() so we don't need the lazy resolution.
> 
> With the new MSAA ID implementation we still create ProxyAccessibles for
> content subtrees that are sent up with ShowEvent, we just don't attach COM
> pointers to them and navigate with them. My motivation with the lazy
> acquisition of the COM interfaces is for the case that an XPCOM test could
> call a method on one of those ProxyAccessibles, which then needs to resolve
> that interface in order to proceed. In other words, for general use, only
> ProxyAccessibles represent top-level docs require the interface, but as soon
> as XPCOM enters the mix, then they all do.
> 
> Unless I'm missing something here. Feel free to pick this apart! ;-)

that sounds right, I think this patch is safe, but I'm not sure I understand how we got into the problem case in the first place.
Flags: needinfo?(aklotz)
(In reply to Trevor Saunders (:tbsaunde) from comment #5)
> 
> actually this raises a question given the above this patch should be
> unnecessary because the problematic thing calling GetCOMInterface on a
> ProxyAccessible with null mCOMInterface should never happen.  However that
> would seem to be untrue, so what's going on?

Good question... Unclear.

Here is the bottom of the stack before we start recursing:


6504 xul!mozilla::a11y::AccessibleWrap::GetIAccessibleFor
6505 xul!mozilla::a11y::ProxyAccessible::GetCOMInterface
6506 xul!GetProxiedAccessibleInSubtree
6507 xul!mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor
6508 xul!mozilla::a11y::AccessibleWrap::GetIAccessibleFor
6509 xul!mozilla::a11y::AccessibleWrap::get_accChild
650a xul!mozilla::a11y::ChildIDThunk::OnCall
650b ole32!Interceptor::CallIndirect
650c ole32!COMInterceptor
650d oleacc!AccWrap_Annotate::get_accState
650e UIAutomationCore!AccUtils::get_accState
650f UIAutomationCore!MsaaProxy::GetPropertyValue
6510 UIAutomationCore!UiaNode::ProviderGetPropertyValue
6511 UIAutomationCore!InProcClientAPIStub::UiaNode_GetPropertyValues
6512 UIAutomationCore!ComInvoker::CallTarget
6513 UIAutomationCore!InProcClientAPIStub::InvokeInProcAPI
6514 UIAutomationCore!UiaNode::CrossProcess_GetPropertyValues
6515 UIAutomationCore!RemoteUiaNodeStub::Incoming_GetPropertyValues
6516 UIAutomationCore!RemoteUiaNodeStub::OnMessage
6517 UIAutomationCore!InvokeOnCorrectContext_Callback
6518 UIAutomationCore!ComInvoker::CallTarget
6519 UIAutomationCore!ProcessIncomingRequest
651a UIAutomationCore!HookBasedServerConnectionManager::HookCallback
651b UIAutomationCore!HookUtil<&HookBasedClientConnection::HookCallback,0>::CallOut
651c UIAutomationCore!HandleHookMessage
651d UIAutomationCore!HookUtil<&HookBasedClientConnection::HookCallback,0>::CallWndProc
651e user32!GetMessageA
651f user32!MapWindowPoints
6520 ntdll!KiUserCallbackDispatcherContinue
6521 win32u!NtUserPeekMessage
6522 user32!PeekMessageW
6523 msctf!CThreadInputMgr::PeekMessageW
6524 xul!mozilla::widget::WinUtils::PeekMessageW
6525 xul!nsAppShell::ProcessNextNativeEvent
6526 xul!nsBaseAppShell::DoProcessNextNativeEvent
6527 xul!nsBaseAppShell::OnProcessNextEvent
6528 xul!nsThread::ProcessNextEvent
6529 xul!NS_ProcessNextEvent
652a xul!mozilla::ipc::MessagePump::Run
652b xul!MessageLoop::RunHandler
652c xul!MessageLoop::Run
652d xul!nsBaseAppShell::Run
652e xul!nsAppShell::Run
652f xul!nsAppStartup::Run
6530 xul!XREMain::XRE_mainRun
6531 xul!XREMain::XRE_main
6532 xul!XRE_main
6533 firefox!do_main
6534 firefox!NS_internal_main
6535 firefox!wmain
6536 firefox!invoke_main
6537 firefox!__scrt_common_main_seh
6538 kernel32!BaseThreadInitThunk
6539 ntdll!RtlUserThreadStart

So we've gone through ChildIDThunk whose removal hadn't landed yet. The debugger is giving me an incorrect value for the child ID -- I know this because its VARIANT type is invalid. OTOH, we check for invalid variant types in our code, so if that VARIANT type was actually what the debugger said it was, we wouldn't recurse.
Flags: needinfo?(aklotz)
Okay, I looked at a more recent dump from a crash that occurred after we removed ChildIDThunk. In this case the problem is that IEnumVARIANT must have been queried on a top-level DocAccessible in the chrome process. The children are all ProxyAccessibles with null COM proxies, so when the client calls IEnumVARIANT::Next, we need to lazily resolve the children's COM proxies.

This patch as-is will fix the stack overflow; any COM interfaces returned by the enumerator will point directly to content. This patch is also nice because it will properly handle any other instances of resolving children using the generic Accessible* interface.

A further mitigation specific to this particular problem could be to have any QueryInterface(IID_IEnumVARIANT) on a top-level DocAccessible in chrome resolve to an enumerator that originates from content, but I think if we wanted that I would do so in a follow-up bug.
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Aaron Klotz [:aklotz] from comment #9)
> Okay, I looked at a more recent dump from a crash that occurred after we
> removed ChildIDThunk. In this case the problem is that IEnumVARIANT must
> have been queried on a top-level DocAccessible in the chrome process. The
> children are all ProxyAccessibles with null COM proxies, so when the client
> calls IEnumVARIANT::Next, we need to lazily resolve the children's COM
> proxies.

ok, so your saying that a client has a proxy to to a remote doc accessible right? then it QIs that to IEnumVariant, and calls next() on the result?  Shouldn't the QI result in another COM proxy to the remote process, and so next() should be handled in the child?

> This patch as-is will fix the stack overflow; any COM interfaces returned by
> the enumerator will point directly to content. This patch is also nice
> because it will properly handle any other instances of resolving children
> using the generic Accessible* interface.

which Accessible* interface do you mean?
I'm not sure what other instances you mean, or how they might be broken.

> A further mitigation specific to this particular problem could be to have
> any QueryInterface(IID_IEnumVARIANT) on a top-level DocAccessible in chrome
> resolve to an enumerator that originates from content, but I think if we
> wanted that I would do so in a follow-up bug.

ok, so that answers the first question, but I don't understand why the QI isn't returning a COM proxy, given that it was called on one...
Flags: needinfo?(tbsaunde+mozbugs)
Flags: needinfo?(aklotz)
Another crash bp-a961e54b-f113-4d61-a809-c67842161012

This time, the situation was as follows:

1. In the last session, I had 3 or 4 tabs open: Gmail, IRCCloud, a news site, and Github. When I closed Nightly, the Github page, the last tab, was in the foreground.
2. I have Nightly set to restore tabs upon restart. So when I next started it, it returned to the Github page. I worked there for a bit, since I wanted to collect information I needed to attach to a Github issue.
3. After I was done, I pressed Ctrl+Tab, which would wrap around to the Gmail tab, the *first* in the previous series.
4. This is when the above crash occurred.

The other crashes may have also occurred when I tried to go from the last tab of the previous session to the first one. Because when I next restarted Nightly after the above crash, I pressed Ctrl+Shift+Tab from the Github page instead, and this time it didn't crash, and I could in turn restore all tabs without crashing.
Crash Signature: [@ mozilla::a11y::AccessibleWrap::GetIAccessibleFor] → [@ mozilla::a11y::AccessibleWrap::GetIAccessibleFor], [@ mozilla::a11y::ProxyAccessible::GetCOMInterface]
Flags: needinfo?(aklotz)
Oops, re-requesting NI from Aaron, don't know why this got cancelled upon my previous comment. :-( Didn't purposely do that.
Flags: needinfo?(aklotz)
Mozilla's Gmail instance started permanently causing that crash for me, so I had to disable E10S to be able to continue working. More reports:

bp-9c938a42-5800-4e48-859f-8d88e2161012
bp-40d33a73-a4ee-4749-acfa-2b00e2161012
bp-55301746-3c6b-4fb9-8714-eb86c2161012
bp-2eaa20ac-f50f-4567-abe7-0cf0b2161012
bp-487fb470-e47b-4696-b75d-29a3c2161012
(In reply to Trevor Saunders (:tbsaunde) from comment #10)
> > This patch as-is will fix the stack overflow; any COM interfaces returned by
> > the enumerator will point directly to content. This patch is also nice
> > because it will properly handle any other instances of resolving children
> > using the generic Accessible* interface.
> 
> which Accessible* interface do you mean?
> I'm not sure what other instances you mean, or how they might be broken.

I'm referring to any code that would call Accessible::GetChildAt(index) to obtain a child instead of using IAccessible::get_accChild(). The returned child is a ProxyAccessible whose COM proxy would need to be resolved.

> 
> > A further mitigation specific to this particular problem could be to have
> > any QueryInterface(IID_IEnumVARIANT) on a top-level DocAccessible in chrome
> > resolve to an enumerator that originates from content, but I think if we
> > wanted that I would do so in a follow-up bug.
> 
> ok, so that answers the first question, but I don't understand why the QI
> isn't returning a COM proxy, given that it was called on one...

OK, now I'm questioning this theory too... you're right, the QI should have been invoked against the COM proxy which should have returned an enumerator from content... OK, more investigation needed.

(Thanks for your patience, BTW... I'm still loaded up on cold meds and my brain is obviously not exactly running at full capacity ATM ;-)
Flags: needinfo?(aklotz)
(In reply to Marco Zehe (:MarcoZ) from comment #13)
> Mozilla's Gmail instance started permanently causing that crash for me, so I
> had to disable E10S to be able to continue working. More reports:
> 
> bp-9c938a42-5800-4e48-859f-8d88e2161012

This one is from a call to AccessibleWrap::accHitTest. We call ChildAtPoint to get the XPAccessible, and then we need to resolve its proxy to return as the result.

I think that this patch needs further work.
Attached patch Fix (r2) (obsolete) — Splinter Review
From looking at multiple dumps, it becomes clear how this should work:

At all times but for one exception, we should try to lazily resolve a ProxyAcccessible's COM proxy if it is null.

That exception is when calling AccessibleWrap::GetIAccessibleFor with CHILDID_SELF. In that case, attempting to resolve will cause the stack overflow because the lazy COM proxy resolution itself calls AccessibleWrap::GetIAccessibleFor(CHILDID_SELF). Instead, we should be resolving the absolute MSAA ID of the object and searching on that (which we already do, provided that we avoid lazy resolution in this exceptional case).

Unfortunately I don't have enough information to determine how an enumerator ended up walking ProxyAccessibles. I'll probably have to add some asserts to a local build to try to figure that one out.
Attachment #8798942 - Attachment is obsolete: true
Attachment #8798942 - Flags: review?(tbsaunde+mozbugs)
Attachment #8800358 - Flags: review?(tbsaunde+mozbugs)
(In reply to Aaron Klotz [:aklotz] from comment #15)
> (In reply to Marco Zehe (:MarcoZ) from comment #13)
> > Mozilla's Gmail instance started permanently causing that crash for me, so I
> > had to disable E10S to be able to continue working. More reports:
> > 
> > bp-9c938a42-5800-4e48-859f-8d88e2161012
> 
> This one is from a call to AccessibleWrap::accHitTest. We call ChildAtPoint
> to get the XPAccessible, and then we need to resolve its proxy to return as
> the result.

Oh, the mouse pointer! Since I don't use the mouse, but touch the trackpad randomly during the day, or sometimes need to emulate mouse clicks, the mouse can end up anywhere on the screen, either triggering hit testing because NVDA wants to see what's actually currently under the mouse pointer, or not if the mouse pointer points to the chrome, or even outside the Firefox window.
> At all times but for one exception, we should try to lazily resolve a
> ProxyAcccessible's COM proxy if it is null.

I'm not convinced there is a case where we should have to do that, but it shouldn't hurt.

> That exception is when calling AccessibleWrap::GetIAccessibleFor with
> CHILDID_SELF. In that case, attempting to resolve will cause the stack
> overflow because the lazy COM proxy resolution itself calls
> AccessibleWrap::GetIAccessibleFor(CHILDID_SELF). Instead, we should be
> resolving the absolute MSAA ID of the object and searching on that (which we
> already do, provided that we avoid lazy resolution in this exceptional case).

yeah, that seems reasonable, and I can see why it would fix this.

> Unfortunately I don't have enough information to determine how an enumerator
> ended up walking ProxyAccessibles. I'll probably have to add some asserts to
> a local build to try to figure that one out.

that'd be nice, because that probably shouldn't be happening.
(In reply to Aaron Klotz [:aklotz] from comment #14)
> (In reply to Trevor Saunders (:tbsaunde) from comment #10)
> > > This patch as-is will fix the stack overflow; any COM interfaces returned by
> > > the enumerator will point directly to content. This patch is also nice
> > > because it will properly handle any other instances of resolving children
> > > using the generic Accessible* interface.
> > 
> > which Accessible* interface do you mean?
> > I'm not sure what other instances you mean, or how they might be broken.
> 
> I'm referring to any code that would call Accessible::GetChildAt(index) to
> obtain a child instead of using IAccessible::get_accChild(). The returned
> child is a ProxyAccessible whose COM proxy would need to be resolved.

I don't think Acccessible::GetChildAt() will do something useful if you call it on a proxy, so code that needs that to work is probably broken anyway.

> > 
> > > A further mitigation specific to this particular problem could be to have
> > > any QueryInterface(IID_IEnumVARIANT) on a top-level DocAccessible in chrome
> > > resolve to an enumerator that originates from content, but I think if we
> > > wanted that I would do so in a follow-up bug.
> > 
> > ok, so that answers the first question, but I don't understand why the QI
> > isn't returning a COM proxy, given that it was called on one...
> 
> OK, now I'm questioning this theory too... you're right, the QI should have
> been invoked against the COM proxy which should have returned an enumerator
> from content... OK, more investigation needed.
> 
> (Thanks for your patience, BTW... I'm still loaded up on cold meds and my
> brain is obviously not exactly running at full capacity ATM ;-)

no worries, I've been slow and this is kind of tricky.
Comment on attachment 8800358 [details] [diff] [review]
Fix (r2)

># HG changeset patch
># User Aaron Klotz <aklotz@mozilla.com>
># Date 1475857893 21600
>#      Fri Oct 07 10:31:33 2016 -0600
># Node ID 4967881a08cab78e82b48c3e8e56e6addb734a88
># Parent  445f114323f9b32bb225367513bb83d30dd6b174
>Bug 1308397: Fix stack overflow by ensuring that AccessibleWrap::GetIAccessibleFor does not attempt to lazily resolve a COM proxy when the child id is CHILDID_SELF; r=tbsaunde

well, it does resolve it, it just fixes the id first.

>-ProxyAccessible::GetCOMInterface(void** aOutAccessible) const
>+ProxyAccessible::GetCOMInterfaceNoLazyResolve(void** aOutAccessible) const

I might name it GetCachedCOMInterface, but whatever.

>   if (varChild.lVal == CHILDID_SELF) {
>     *aIsDefunct = IsDefunct();
>     if (*aIsDefunct) {
>       return nullptr;
>     }
>-    GetNativeInterface(getter_AddRefs(result));
>+    if (IsProxy()) {
>+      // We do not want to attempt to lazily resolve a null COM proxy in the
>+      // case where we're looking for CHILDID_SELF, as that would call back
>+      // into this function and we'd eventually overflow the stack.
>+      mBits.proxy->GetCOMInterfaceNoLazyResolve(getter_AddRefs(result));
>+    } else {
>+      GetNativeInterface(getter_AddRefs(result));
>+    }

kind of gross special casing, but meh.

Looking at the callers of GetIAccessible there are 3.  ResolveChild() doesn't try calling GetIAccessible() with CHILDID_SELF.  I think, but am not sure accChild will never need to resolve a proxy for CHILDID_SELF, and shouldn't be called on a proxyied accessible anyway.  The other caller is ProxyAccessible::GetCOMinterface() which always passes CHILDID_SELF, which now works, but feels slightly hacky? maybe we can just change that caller to pass wrapper->GetExistingID()?
>     if (result) {
>       return result.forget();
>     }
>     // If we're not a proxy, there's nothing more we can do to attempt to
>     // resolve the IAccessible, so we just fail.
>     if (!IsProxy()) {
>       return nullptr;
>     }
Attachment #8800358 - Flags: review?(tbsaunde+mozbugs) → review+
Attached patch Fix (r3)Splinter Review
(In reply to Trevor Saunders (:tbsaunde) from comment #20)
> Comment on attachment 8800358 [details] [diff] [review]
> Fix (r2)
> The other caller is
> ProxyAccessible::GetCOMinterface() which always passes CHILDID_SELF, which
> now works, but feels slightly hacky? maybe we can just change that caller to
> pass wrapper->GetExistingID()?

You're right, that's the real fix. As long as we pass the real ID we shouldn't need to have separate functions and any of that other stuff. Patch updated to reflect that.
Attachment #8800358 - Attachment is obsolete: true
Attachment #8801829 - Flags: review+
(In reply to Aaron Klotz [:aklotz] from comment #21)
> Created attachment 8801829 [details] [diff] [review]
> Fix (r3)
> 
> (In reply to Trevor Saunders (:tbsaunde) from comment #20)
> > Comment on attachment 8800358 [details] [diff] [review]
> > Fix (r2)
> > The other caller is
> > ProxyAccessible::GetCOMinterface() which always passes CHILDID_SELF, which
> > now works, but feels slightly hacky? maybe we can just change that caller to
> > pass wrapper->GetExistingID()?
> 
> You're right, that's the real fix. As long as we pass the real ID we
> shouldn't need to have separate functions and any of that other stuff. Patch
> updated to reflect that.

yeah, that's more better.
https://hg.mozilla.org/integration/mozilla-inbound/rev/6edec02ffc983c7a1ef65b7e7e4ce61c084e0b67
Bug 1308397: Fix stack overflow by ensuring that ProxyAccessible::GetCOMInterface uses its real MSAA ID to lazily resolve its COM proxy, instead of using CHILDID_SELF; r=tbsaunde
https://hg.mozilla.org/mozilla-central/rev/6edec02ffc98
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment on attachment 8801829 [details] [diff] [review]
Fix (r3)

Approval Request Comment
[Feature/regressing bug #]: bug 1304449
[User impact if declined]: This needs to ride along with bug 1304449. If not, users running a11y on e10s on Windows will be hit with stability problems.
[Describe test coverage new/current, TreeHerder]: Landed on m-c, no more crashes for this signature.
[Risks and why]: Low, trivial patch.
[String/UUID change made/needed]: None
Attachment #8801829 - Flags: approval-mozilla-aurora?
Comment on attachment 8801829 [details] [diff] [review]
Fix (r3)

Fix a crash. Take it in 51 aurora.
Attachment #8801829 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.