Closed Bug 1332690 Opened 7 years ago Closed 7 years ago

Crash in IPCError-browser | PDocAccessibleParent::AddChildDoc binding to nonexistant proxy!

Categories

(Core :: Disability Access APIs, defect)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 + verified
firefox54 + fixed
firefox55 + fixed

People

(Reporter: marcia, Assigned: handyman)

References

Details

(Keywords: crash, topcrash, Whiteboard: aes+)

Crash Data

Attachments

(4 files, 5 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-f65b09a1-69ac-4fe8-9216-9e1832170120.
=============================================================

Seen while looking at crash stats: http://bit.ly/2jxcQFm. Crashes started using the 20170120030214 build.
Whiteboard: aes+
Looks like a more verbose version of bug 1331255. Closing that one as a dup, as this one now provides more clarity.
(In reply to Trevor Saunders (:tbsaunde) from comment #3)
> Created attachment 8828926 [details] [diff] [review]
> return failure from DocAccessibleParent::RecvShow() when appropriate

we shouldn't really tolerate that anyway, and it seems like the most likely way for this to happen is failing to add accessibles we should, so lets add more checking for that and see what happens.
Comment on attachment 8828926 [details] [diff] [review]
return failure from DocAccessibleParent::RecvShow() when appropriate

Review of attachment 8828926 [details] [diff] [review]:
-----------------------------------------------------------------

thanks
Attachment #8828926 - Flags: review?(yzenevich) → review+
Pushed by tsaunders@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6160d7ce24e8
return failure from DocAccessibleParent::RecvShow() when appropriate r=yzen
Depends on: 1334673
Crash volume for signature 'IPCError-browser | PDocAccessibleParent::AddChildDoc binding to nonexistant proxy!':
 - nightly (version 54): 4196 crashes from 2017-01-23.
 - aurora  (version 53): 777 crashes from 2017-01-23.
 - beta    (version 52): 0 crashes from 2017-01-23.
 - release (version 51): 0 crashes from 2017-01-16.
 - esr     (version 45): 0 crashes from 2016-08-03.

Crash volume on the last weeks (Week N is from 01-30 to 02-05):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly    3283
 - aurora      556
 - beta          0
 - release       0       0
 - esr           0       0       0       0       0       0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly           #2
 - aurora            #2
 - beta
 - release
 - esr
This is still the #5 topcrash in Windows Nightly 20170203094345, with 211 occurrences. tbsaunde, seems like the landed patch didn't fix the problem?
Flags: needinfo?(tbsaunde+mozbugs)
This crash appears to be on pages with heavy flash advertising, or at least mine appear to be. Whereas none on SUMO at all.
Assignee: nobody → tbsaunde+mozbugs
Pushed by tsaunders@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/15d237d39fb8
remove diagnostic for nonexistant proxies
This is ranked #3 in the Windows crashes for Nightly 20170209030214,
with 271 occurrences.
Depends on: 1337983
#9 Windows topcrash in Aurora 20170210004018.
Per meeting today, Trevor will try relanding here.
#4 Windows topcrash in Nightly 20170217030229.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Flags: needinfo?(tbsaunde+mozbugs)
Keywords: leave-open
Please request Aurora approval on this when you get a chance.
Flags: needinfo?(tbsaunde+mozbugs)
Target Milestone: --- → mozilla54
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(tbsaunde+mozbugs)
Assignee: tbsaunde+mozbugs → jmathies
I think you will find https://crash-stats.mozilla.com/report/index/20f686cd-7da8-4c01-b383-57ae32170225 replaced if for that time frame.  At least that was my experience.

Crashes continued but the id changed.
a BindChildDoc message can race with the parent process shutting down a tab.
The result of that is that RecvBindChildDoc() can be called on a
DocAccessibleParent that has already been shut down by the destruction of the
owning TabParent.
Attachment #8843158 - Flags: review?(eitan)
Attachment #8843158 - Flags: review?(eitan) → review?(yzenevich)
Attachment #8843158 - Flags: review?(yzenevich) → review+
Trevor, it doesn't look like you landed this one yet.
Flags: needinfo?(tbsaunde+mozbugs)
#2 Windows topcrash in Nightly 20170303030202, behind only ShutdownKill crashes.
Keywords: topcrash
Pushed by tsaunders@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/86ba78dc87ce
don't try and bind child docs to a shutdown DocAccessibleParent r=yzen
https://hg.mozilla.org/mozilla-central/rev/86ba78dc87ce
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Setting the status for 53 to disabled since e10s+a11y is riding the 54 train IIUC.
Flags: needinfo?(tbsaunde+mozbugs)
Assignee: jmathies → tbsaunde+mozbugs
I'm seeing crashes on nightly build 20170307030205. Not sure if that has the fix...
Crash-stats is making me reopen this. Sorry.

David do we think your work will help here?
Status: RESOLVED → REOPENED
Flags: needinfo?(tbsaunde+mozbugs)
Flags: needinfo?(davidp99)
Resolution: FIXED → ---
I'm still curating some chaos that I wrote while I tracked down test failures but these two patches include the changes I made that are likely relevant to this crash (I think).

This simply quick-outs when AddChildDoc is called on a shutdown document.  Another patch from this bug did the same for BindChildDoc, which is one place where AddChildDoc is called -- but it could also be called from TabParent::RecvPDocAccessibleConstructor.  Given the wreckage that are the call stacks connected to this crash, I cant say if this is coming from TabParent or from BindChildDoc... but its probably still valuable.

I think the next patch is more likely the (main) cause.
Flags: needinfo?(davidp99)
Attachment #8846117 - Flags: review?(tbsaunde+mozbugs)
Attachment #8846119 - Flags: review?(tbsaunde+mozbugs)
...

This fixes a name change that affected non-Windows builds.

Tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=af830bd611823817d81e8ea4d916679deadd922f
Attachment #8846127 - Flags: review?(tbsaunde+mozbugs)
Attachment #8846119 - Attachment is obsolete: true
Attachment #8846119 - Flags: review?(tbsaunde+mozbugs)
Rushing never works for me.

Non-windows builds work finally.

Tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e430f8ff125ca257b37f9451e220805f4b1c227
Attachment #8846127 - Attachment is obsolete: true
Attachment #8846127 - Flags: review?(tbsaunde+mozbugs)
Attachment #8846151 - Flags: review?(tbsaunde+mozbugs)
Comment on attachment 8846151 [details] [diff] [review]
2. Delay sending BindChildDoc and PDocAccessibleConstructor until ready

># HG changeset patch
># User David Parks <dparks@mozilla.com>
># Date 1489180384 28800
>#      Fri Mar 10 13:13:04 2017 -0800
># Node ID 9d2cfccf08bd0385c07668f108ba36c00a1b4775
># Parent  a2fd6b4e698f633196c7342402d36381209ef3f2
>Bug 1332690 - Delay sending BindChildDoc and PDocAccessibleConstructor until ready
>
>Those messages should not be sent to the parent process until the objects they are binding/parenting to have been created in the parent.

I agree with this, but the implementation confuses me.

>       Accessible* parent = childDoc->Parent();
>       DocAccessibleChild* parentIPCDoc = mDocument->IPCDoc();
>       MOZ_DIAGNOSTIC_ASSERT(parentIPCDoc);
>       uint64_t id = reinterpret_cast<uintptr_t>(parent->UniqueID());
>       MOZ_DIAGNOSTIC_ASSERT(id);
>       DocAccessibleChild* ipcDoc = childDoc->IPCDoc();
>+      DocAccessibleChild* parentDoc = childDoc->ParentDocument()->IPCDoc();

this parentDoc really should be the same as parentIPCDoc a few lines up, are you seeing cases where its different?

That said gecko is crazy I suppose it could be different, and yeah that might lead to problems.

>       if (ipcDoc) {
>-        parentIPCDoc->SendBindChildDoc(ipcDoc, id);
>+        parentIPCDoc->BindChildDocInParentProcess(parentDoc, ipcDoc, id);

this extra function seems like unhelpful abstraction, and the third document is highly confusing since there are fundimentally only two docs involved in this operation.

> 
> #if defined(XP_WIN)
>-      parentIPCDoc->ConstructChildDocInParentProcess(ipcDoc, id,
>+      parentIPCDoc->ConstructChildDocInParentProcess(parentDoc, ipcDoc, id,
>                                                      AccessibleWrap::GetChildIDFor(childDoc));

same comment here about three documents, I'm not sure there's really more to say without understanding the parentDoc vs parentIPCDoc question above.
Flags: needinfo?(tbsaunde+mozbugs)
Comment on attachment 8846117 [details] [diff] [review]
1. Do not add children to shutdown doc

>Bug 1332690 - Do not add children to a shutdown accessible doc

hrm, so the only way this can matter is if TabChild::mDestroyed is false, but aParentDoc->mShutdown is true when we recieve a PDocAccessibleConstructor message.

lets consider the possible ways a document could get marked as shutdown.
1. TabParent::DestroyInternal() calls DocAccessibleParent::Destroy() on the root DocAccessibleParent, but clearly that isn't the case here because then the TabParent would have mDestroyed being true and this wouldn't matter.
2. the DocAccessibleParent could have set mShutdown because RecvShutdown() was called, but the one place that calls SendShutdown() immediately clears the DocAccessible's pointer to the DocAccessibleChild, so its not clear how someone  could get hold of the DocAccessibleChild to send a PDocAccessibleConstructor message with it as the parent doc.
3. *waves hands in general direction of ActorDestroy()* possible, but then why are we getting messages involving a destroyed actor?

So I'm not going to say it can't happen, but I don't see how and it seems kind of unlikely.

>DocAccessibleParent::AddChildDoc should (silently) ignore requests if it is already shutdown.

I'm fine with it being silent for now, but since its not clear why its a good thing please don't say it should be that way in the commit message. Lets not confuse people into thinking there was some good reason we didn't write down for doing this.

also might be better to add it to RecvPDocAccessibleConstructor() as the one place that actually needs it, or clean this up, but whatever that can happen another day.

so r=me if you drop the should part from the commit message
Attachment #8846117 - Flags: review?(tbsaunde+mozbugs) → review+
> >       Accessible* parent = childDoc->Parent();
> >       DocAccessibleChild* parentIPCDoc = mDocument->IPCDoc();
> >       MOZ_DIAGNOSTIC_ASSERT(parentIPCDoc);
> >       uint64_t id = reinterpret_cast<uintptr_t>(parent->UniqueID());
> >       MOZ_DIAGNOSTIC_ASSERT(id);
> >       DocAccessibleChild* ipcDoc = childDoc->IPCDoc();
> >+      DocAccessibleChild* parentDoc = childDoc->ParentDocument()->IPCDoc();
> 
> this parentDoc really should be the same as parentIPCDoc a few lines up, are
> you seeing cases where its different?
> 
> That said gecko is crazy I suppose it could be different, and yeah that
> might lead to problems.

to be clear this patch working makes me suspect there are cases where parentDoc != parentIPCDoc, but if that is actually the case it seems like we should be able to do something more principled here.
(In reply to Trevor Saunders (:tbsaunde) from comment #37)
> 
> to be clear this patch working makes me suspect there are cases where
> parentDoc != parentIPCDoc, but if that is actually the case it seems like we
> should be able to do something more principled here.

+1. That looks to me like the real problem.
Assignee: tbsaunde+mozbugs → davidp99
#2 topcrash in Windows Nightly 20170316030211, with 136 occurrences.
(In reply to Alice0775 White from comment #40)

> Reproducible: almost 100% in Nightly55 clean new profile. but Aurora54 not
> crash

if forcible enabled e10s, Aurora54 also crash. bp-f205fc76-5951-4ae2-a0ff-11c502170326
I've just gone back to playing with mochitests.  The bug this patch fixes looks at least to be related to this crash.  Its worth a shot.

The DocAccessibleChild (DAC) delays messages until it receives a ParentCOMProxy message on the root doc-tree node from the parent.  When it gets that message, it drains the delayed message queue on the root.  When a child of the root of the doc-tree needs to delay a message, it posts it to the root's queue with DAC::PushDeferredEvent.  The relevant line is here [1], where that method gets the root node using the DAC's (IPC) Manager().  The problem is that, with PDocAccessibles, the 'management' is on the parent side (DAPs are created by IPDL -- DACs are created with 'new').  IPDL doesn't set a Manager for us (it wouldn't even know what to use) so I added it in the constructor.

This means that the delayed message system hasn't worked for more than 1 level of child documents -- grandchildren were always shafted.

Tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcb84677c7e112a121661b40bf62ccd8ea1a1abd

(To eliminate confusion, I obsoleted the prior 2 patches.)

[1] https://dxr.mozilla.org/mozilla-central/rev/e03e0c60462c775c7558a1dc9d5cf2076c3cd1f9/accessible/ipc/win/DocAccessibleChild.cpp#86
Attachment #8846117 - Attachment is obsolete: true
Attachment #8846151 - Attachment is obsolete: true
Attachment #8846151 - Flags: review?(tbsaunde+mozbugs)
Attachment #8851675 - Flags: review?(tbsaunde+mozbugs)
I hadn't seen Alice's STR in comment 20.  When run with my patch, I don't see the crash in this bug but I hit a different ASSERT in the DocAccessibleParent so this may not be a fix to the issue.  I'll keep investigating, which should be much easier with the STR.
Alice, I'm actually having a hard time reproducing the issue with your STR, even with today's Nightly.  Its clearly timing related so this is probably a machine thing.  Can you take a look with the try build from the test run above:

https://archive.mozilla.org/pub/firefox/try-builds/davidp99@gmail.com-e8b0a7dab2a7310dfc397e072168cf74e1136102/try-win32-debug/

On my end, I'll keep playing with the STR to see if I can repro the crash myself.
Flags: needinfo?(alice0775)
Attached file console.log
(In reply to David Parks (dparks) [:handyman] from comment #45)
> Alice, I'm actually having a hard time reproducing the issue with your STR,
> even with today's Nightly.  Its clearly timing related so this is probably a
> machine thing.  Can you take a look with the try build from the test run
> above:
> 
> https://archive.mozilla.org/pub/firefox/try-builds/davidp99@gmail.com-
> e8b0a7dab2a7310dfc397e072168cf74e1136102/try-win32-debug/
> 
> On my end, I'll keep playing with the STR to see if I can repro the crash
> myself.

The try build crashes. Console out put is attached here.


It may be necessary to add the following extra step to comment#40.
(ATOK2016(Japanese IME) activates a11y in my case)

4. Switch between tabs while page is still loading
5. Repeat from step3
Flags: needinfo?(alice0775)
Thanks.  I'll try that next.

I gave you the link to the wrong build -- that was the debug build (I saw the same assert once... but its not the one we're looking for and wouldnt crash release).  Would you try this build (the release build)?

https://archive.mozilla.org/pub/firefox/try-builds/davidp99@gmail.com-e8b0a7dab2a7310dfc397e072168cf74e1136102/try-win32/
Flags: needinfo?(alice0775)
(In reply to David Parks (dparks) [:handyman] from comment #47)
> Thanks.  I'll try that next.
> 
> I gave you the link to the wrong build -- that was the debug build (I saw
> the same assert once... but its not the one we're looking for and wouldnt
> crash release).  Would you try this build (the release build)?
> 
> https://archive.mozilla.org/pub/firefox/try-builds/davidp99@gmail.com-
> e8b0a7dab2a7310dfc397e072168cf74e1136102/try-win32/

The try build is no longer crashed. It seems to fix the crash.
Flags: needinfo?(alice0775)
> The try build is no longer crashed. It seems to fix the crash.

Joy!  Thanks for checking it (and big thanks for the STR in the first place).
Comment on attachment 8851675 [details] [diff] [review]
1. Set Manager for DocAccessibleChildren

># HG changeset patch
># User David Parks <dparks@mozilla.com>
># Date 1490634223 25200
>#      Mon Mar 27 10:03:43 2017 -0700
># Node ID ad9854adc4307adad5371614dc8219ac7802d70f
># Parent  200182ef115692c4ed2909f1a8beae8a6f19d127
>Bug 1332690 - Assign Manager to DocAccessibleChilds

I think managers would be better english?

>The Manager is set by IPDL for remotely constructed objects but our DocAccessibleChilds are created on the child process side.  Assign a manager in the constructor so that we can find it when needed.

I would probably start that second sentence as "so we need ..."

>-      ipcDoc = new DocAccessibleChild(childDoc);
>+      ipcDoc = new DocAccessibleChild(childDoc,
>+                                      static_cast<dom::TabChild*>(parentIPCDoc->Manager()));

the argument is a IProtocol*  right? so why cast?

> public:
>-  explicit DocAccessibleChild(DocAccessible* aDoc)
>+  explicit DocAccessibleChild(DocAccessible* aDoc, dom::TabChild* aManager)

you can drop the explicit now

> public:
>-  explicit DocAccessibleChild(DocAccessible* aDoc);
>+  explicit DocAccessibleChild(DocAccessible* aDoc, dom::TabChild* aManager);

same
Attachment #8851675 - Flags: review?(tbsaunde+mozbugs) → review+
Tracking this too for 53 - it would be nice to uplift this if the fix works in m-c and if you think it worth the risk to bring into late beta.
Includes tbsaunde's nits.
Attachment #8851675 - Attachment is obsolete: true
Checkin for patch "Set Manager for DocAccessibleChildren (r=tbsaunde)"
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3c911506e8c7
Assign Managers to DocAccessibleChilds. r=tbsaunde
Keywords: checkin-needed
Hanging a friendly needinfo on dparks as a reminder to request uplift for branch landings.
Flags: needinfo?(davidp99)
https://hg.mozilla.org/mozilla-central/rev/3c911506e8c7
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Looks like this is fixed, I don't see any reports after the 20170329071901 build.
\o/ handyman++
Comment on attachment 8852231 [details] [diff] [review]
1. Set Manager for DocAccessibleChildren (r=tbsaunde)

Try spreading the love to beta...

Approval Request Comment
[Feature/Bug causing the regression]:
Formerly #3 Top Windows crasher 

[User impact if declined]:
Many crashes

[Is this code covered by automated tests?]:
Not really.  Most testing is currently not done with a11y turned on.

[Has the fix been verified in Nightly?]:
We've seen the crash signature disappear.

[Needs manual test from QE? If yes, steps to reproduce]: 
Alice has a test in comment 40.  This patch worked in that case.

[List of other uplifts needed for the feature/fix]:
N/A

[Is the change risky?]:
Not more than the present state.

[Why is the change risky/not risky?]:
Previously, we were not sending messages to some documents if they came in too early.  Any danger would come from properly processing those messages, which certainly seems no more dangerous than ignoring them, as we do now.

[String changes made/needed]:
N/A
Flags: needinfo?(davidp99)
Attachment #8852231 - Flags: approval-mozilla-beta?
Target Milestone: mozilla54 → mozilla55
Attachment #8852231 - Flags: approval-mozilla-aurora?
Comment on attachment 8852231 [details] [diff] [review]
1. Set Manager for DocAccessibleChildren (r=tbsaunde)

Fix for a topcrash, let's take this. It should land in time for tomorrow's beta 9 build.
Attachment #8852231 - Flags: approval-mozilla-beta?
Attachment #8852231 - Flags: approval-mozilla-beta+
Attachment #8852231 - Flags: approval-mozilla-aurora?
Attachment #8852231 - Flags: approval-mozilla-aurora+
Flags: qe-verify+
Unfortunately, I wasn't able to reproduce the crash on Nightly 2017-01-20 and 2017-03-27 using the STR in comment 40 and 46.
Alice, could you please verify this on latest Beta?
https://archive.mozilla.org/pub/firefox/candidates/53.0b9-candidates/
Flags: needinfo?(alice0775)
(In reply to Paul Silaghi, QA [:pauly] from comment #63)
> Unfortunately, I wasn't able to reproduce the crash on Nightly 2017-01-20
> and 2017-03-27 using the STR in comment 40 and 46.
> Alice, could you please verify this on latest Beta?
> https://archive.mozilla.org/pub/firefox/candidates/53.0b9-candidates/

I can verify that latest beta does not crash.
Flags: needinfo?(alice0775)
Thanks, Alice.
Flags: qe-verify+
Blocks: 1359129
You need to log in before you can comment on or make changes to this bug.