Closed Bug 857348 Opened 11 years ago Closed 5 years ago

crash in mozilla::a11y::XULButtonAccessible::ContainsMenu

Categories

(Core :: Disability Access APIs, defect, P5)

x86
Windows NT
defect

Tracking

()

RESOLVED WORKSFORME
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: wsmwk, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [tbird crash] a11y:crash-tree)

Crash Data

similar to one of the other bug reports regarding windows file dialog

This bug was filed from the Socorro interface and is 
report bp-bb4584e0-0f5f-4f8f-b986-dcce12130329 .
============================================================= 
0 	xul.dll 	mozilla::a11y::XULButtonAccessible::ContainsMenu 	accessible/src/xul/XULFormControlAccessible.cpp:233
1 	xul.dll 	mozilla::a11y::XULButtonAccessible::NativeState 	accessible/src/xul/XULFormControlAccessible.cpp:118
2 	xul.dll 	Accessible::State 	accessible/src/generic/Accessible.cpp:1440
3 	xul.dll 	AccessibleWrap::get_accState 	accessible/src/msaa/AccessibleWrap.cpp:468
4 	rpcrt4.dll 	Invoke 	
5 	rpcrt4.dll 	NdrStubCall2 	
6 	ole32.dll 	NdrpCreateStub 	
7 	oleaut32.dll 	CUnivStubWrapper::Invoke 	
8 	ole32.dll 	SyncStubInvoke
Component: Disability Access → Disability Access APIs
Product: Thunderbird → Core
Version: 17 → unspecified
this looks like our old friend the mysterious null mContent that we can't come up with an explaination for...
(In reply to Trevor Saunders (:tbsaunde) from comment #1)
> this looks like our old friend the mysterious null mContent that we can't
> come up with an explaination for...

if that was cycle collected object again then cc unlink should call Shutdown (i.e. set up eIsDefunct bit). Trev?
Whiteboard: [tbird crash]
(In reply to alexander :surkov from comment #2)
> (In reply to Trevor Saunders (:tbsaunde) from comment #1)
> > this looks like our old friend the mysterious null mContent that we can't
> > come up with an explaination for...
> 
> if that was cycle collected object again then cc unlink should call Shutdown
> (i.e. set up eIsDefunct bit). Trev?

as I may have said on irc I can't see how we'd be here if the accessible had been unlinked by cc since if the  cc decided it was garbage then nobody outside could have held a ref, but clearly they did to make the call into us that crashes.
it's reasonable but I don't see other way how we can null out mContent and stay not defunct. Wouldn't it safer/correct to call Shutdown on unlink?
Crash Signature: [@ mozilla::a11y::XULButtonAccessible::ContainsMenu()] → [@ mozilla::a11y::XULButtonAccessible::ContainsMenu()] [@ mozilla::a11y::XULButtonAccessible::ContainsMenu]
still there - 58 [1], still no explanation, how we get run out of a content.

1) It'd be nice to make State/NativeState a const methods, so we are guaranteed that we don't Shutdown() after IsDefunct() call. I don't think it's the case though.
2) We should probably also add a guard that guarantees (or asserts) us that mContent and eIsDefunct flag are always in sync.

Jamie, anything else in approaching this?

[1] https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Aa11y%3A%3AXULButtonAccessible%3A%3AContainsMenu&date=%3E%3D2018-02-06T13%3A20%3A48.000Z&date=%3C2018-02-13T13%3A20%3A48.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
Flags: needinfo?(jteh)
(In reply to alexander :surkov from comment #5)
> 1) It'd be nice to make State/NativeState a const methods, so we are
> guaranteed that we don't Shutdown() after IsDefunct() call. I don't think
> it's the case though.

Agreed that it's unlikely, but the const makes sense.

> 2) We should probably also add a guard that guarantees (or asserts) us that
> mContent and eIsDefunct flag are always in sync.

Any thoughts on where this would be added?

> Jamie, anything else in approaching this?

Am I right that your suggestion in comment 4 about calling Shutdown on unlink got implemented in 0fe2b3363c9e (bug 1240893)?

I took a look at the raw dump in WinDBG. I wanted to inspect mStateFlags and mContent, but unfortunately, those values were unavailable in the dump.

I really don't have a clue, but here are a few points for thought:
1. I wonder whether |this| actually got deleted already. Of course, that raises the question of how - that should be impossible - but it's another possible explanation for mContent being null I guess. Or not, maybe... :)
2. I see this is a re-entrant call. A client is fetching a remote (content) accessible, and while that is happening, UIA comes along and tries to query a local XUL accessible. I don't see how that could break anything in this case, but it makes me wonder, for example, whether there's any mutation that doesn't complete fully until a platform a11y call returns. Again, I very much doubt it.
Flags: needinfo?(jteh)
(In reply to James Teh [:Jamie] from comment #6)
> (In reply to alexander :surkov from comment #5)
> > 1) It'd be nice to make State/NativeState a const methods, so we are
> > guaranteed that we don't Shutdown() after IsDefunct() call. I don't think
> > it's the case though.
> 
> Agreed that it's unlikely, but the const makes sense.

ok, filed bug 1438193

> > 2) We should probably also add a guard that guarantees (or asserts) us that
> > mContent and eIsDefunct flag are always in sync.
> 
> Any thoughts on where this would be added?

If that's the case then it either mContent was null from beginning, or it was nullified without eIsDefunct flag setting on. Both cases are unlikely, because I believe we don't pass null into constructor, and we nullify mContent in Shutdown() only. So at least it makes sense to assert in IsDefunct() if we've got out of sync. If we hit it, then we can dig it into more.

> > Jamie, anything else in approaching this?
> 
> Am I right that your suggestion in comment 4 about calling Shutdown on
> unlink got implemented in 0fe2b3363c9e (bug 1240893)?

yes

> I took a look at the raw dump in WinDBG. I wanted to inspect mStateFlags and
> mContent, but unfortunately, those values were unavailable in the dump.
> 
> I really don't have a clue, but here are a few points for thought:
> 1. I wonder whether |this| actually got deleted already. Of course, that
> raises the question of how - that should be impossible - but it's another
> possible explanation for mContent being null I guess. Or not, maybe... :)

so we killed an object while somebody keeps it, which would point to either UIA bug, which keeps a raw not addrefed pointer, or we kill non cyclely-zero addrefed object? Btw, there's ole32 crash on Firefox 52, which probably doesn't speaks in favor of UIA bug theory.

> 2. I see this is a re-entrant call. A client is fetching a remote (content)
> accessible, and while that is happening, UIA comes along and tries to query
> a local XUL accessible. I don't see how that could break anything in this
> case, but it makes me wonder, for example, whether there's any mutation that
> doesn't complete fully until a platform a11y call returns. Again, I very
> much doubt it.

Not sure, I caught the not yet completed mutation case, could you elaborate it a bit please? Also curious if overlapping of these crashes tells anything to you.
(In reply to alexander :surkov from comment #7)
> So at least it makes sense to assert in
> IsDefunct() if we've got out of sync. If we hit it, then we can dig it into
> more.

Makes perfect sense. Ship it. :)

> > 1. I wonder whether |this| actually got deleted already. Of course, that
> > raises the question of how - that should be impossible - but it's another
> > possible explanation for mContent being null I guess. Or not, maybe... :)
> so we killed an object while somebody keeps it, which would point to either
> UIA bug, which keeps a raw not addrefed pointer,

It's not possible for it to get a pointer without it being addRefed, so the bug would have to be that Release got called inappropriately. Same thing at the end of the day; just clarifying.

> or we kill non cyclely-zero addrefed object?

Yeah, though it does seem unlikely.

> > 2. I see this is a re-entrant call. A client is fetching a remote (content)
> > accessible, and while that is happening, UIA comes along and tries to query
> > a local XUL accessible. I don't see how that could break anything in this
> > case, but it makes me wonder, for example, whether there's any mutation that
> > doesn't complete fully until a platform a11y call returns.
> Not sure, I caught the not yet completed mutation case, could you elaborate
> it a bit please?

It's tricky because I don't have any ideas about what this might be, so it's hard to come up with an example. The best example I can think of is bug 1336971. The issue there was that during a call to AccessibleWrap::GetRemoteIAccessibleFor, the array of remote docs mutated while we were walking it. So, we made a COM call for one of the remote docs, and during that call, COM pumped messages. One of those messages somehow caused the remote docs array to mutate. We still don't really understand what caused it, even though a band-aid patch was landed to work around it. As Aaron noted in bug 1336971 comment 2:

> I wouldn't expect that to happen given that the parent main thread is blocked on a remote COM call. ie, it shouldn't be able to dispatch any incoming IPDL while waiting for a reply from COM.
> OTOH, since the main thread is in the single-threaded apartment, COM is continuously pumping messages while waiting for that reply. If we handle a message in such a way that we inadvertently trigger a nested event loop that ends up dispatching IPDL, then I guess it could happen.

So, this raises two questions:
1. What message (or COM call) could possibly be (perhaps indirectly) dispatching ipdl messages during our outer COM call?
2. How could some ipdl message end up causing this non-defunct null mContent behaviour? If an ipdl message arrived telling us to shut down an accessible, well, the defunct bit should have been set. Is there some call that only partially mutates state and then requires a subsequent queued call to finish up?
Another disturbing possibility: could something that executes as part of the call to Accessible::State (after the call to IsDefunct) end up triggering a nested event loop which dispatches ipdl? If that were to happen, the accessible might get shut down *after* IsDefunct is called but before Accessible::State returns. That would, unfortunately, also probably mean that making State const wouldn't solve the problem.
(In reply to James Teh [:Jamie] from comment #9)
> Another disturbing possibility: could something that executes as part of the
> call to Accessible::State (after the call to IsDefunct) end up triggering a
> nested event loop which dispatches ipdl? If that were to happen, the
> accessible might get shut down *after* IsDefunct is called but before
> Accessible::State returns. That would, unfortunately, also probably mean
> that making State const wouldn't solve the problem.

Yep, a good and scary theory. Being unfamiliar to outer calls and inner loops, do 1 and 2 from comment 8 suggest different scenarios from this one?

I guess we could set wrap each Accessible class made by MSAA, to set a flag before a call and unset it after, and crash if we hit Shutdown() in between, which is ugly, but option.

Is there a single entry point to a triggering a nested event loop, and if so, then can we ban all 'unauthorized' callers threre? I don't know the subject well, so sorry if it sounds vague.
(In reply to alexander :surkov from comment #10)
> > Another disturbing possibility: could something that executes as part of the
> > call to Accessible::State (after the call to IsDefunct) end up triggering a
> > nested event loop which dispatches ipdl? If that were to happen, the
> > accessible might get shut down *after* IsDefunct is called but before
> > Accessible::State returns.
> Yep, a good and scary theory. Being unfamiliar to outer calls and inner
> loops, do 1 and 2 from comment 8 suggest different scenarios from this one?

I'm not that familiar with IPDL dispatch in Gecko either, unfortunately, so a lot of this is partially educated speculation. That said, I'm starting to think my idea from comment 8 isn't possible. As you've noted, I don't see how mContent could become null without a call to Shutdown, which would also set the defunct flag.

> I guess we could set wrap each Accessible class made by MSAA, to set a flag
> before a call and unset it after, and crash if we hit Shutdown() in between,
> which is ugly, but option.

I think that might be the only way we're going to track this down. Note that this would have to be state shared across Accessibles, not just the same Accessible instance. We really don't want *any* Accessible being shut down during an accessibility call, since it could be a related accessible (e.g. a child) that we're about to query.

> Is there a single entry point to a triggering a nested event loop, and if
> so, then can we ban all 'unauthorized' callers threre?

Unfortunately, I don't know either. I only know that dispatching IPDL within a nested event loop is perhaps possible because of Aaron's bug 1336971 comment 2. However, even Aaron didn't seem to know what might actually trigger this.
https://hg.mozilla.org/mozilla-central/rev/1db3a600ea01
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
not yet fixed, just one assertion from comment #8
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to alexander :surkov from comment #15)
> here's another reincarnation of the bug I believe: bug 857339.
 
not bug 857339, because that bug has os x crash stacks, these bugs just have same signature.

> https://crash-stats.mozilla.com/signature/
> ?signature=mozilla%3A%3Aa11y%3A%3AFocusManager%3A%3AIsFocused&date=%3E%3D2017
> -12-22T10%3A39%3A21.000Z&date=%3C2018-03-22T12%3A39%3A21.
> 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum
> ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-
> date&page=1

the latest crash is 58, taking into account that esr is 60 now, then it probably should also go to wontfixes
Priority: -- → P5
Whiteboard: [tbird crash] → [tbird crash] a11y:crash-tree

Closing because no crashes reported for 12 weeks.

Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.