Closed Bug 1124821 Opened 7 years ago Closed 7 years ago

[e10s] crash in mozilla::a11y::ProxyDestroyed(mozilla::a11y::ProxyAccessible*) when viewing article on forbes.com, if a11y is enabled in Gnome

Categories

(Core :: Disability Access APIs, defect)

All
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
e10s + ---

People

(Reporter: dholbert, Assigned: tbsaunde)

References

()

Details

(Keywords: crash, crashreportid, Whiteboard: a11y enabled in Gnome via e.g. "gsettings set org.gnome.desktop.interface toolkit-accessibility true")

Crash Data

Attachments

(5 files)

This bug was filed from the Socorro interface and is 
report bp-0c2a1aab-e081-42bf-85a9-4a0ce2150122.
=============================================================

STR:
 0. Have e10s enabled (as it is by default, in Nightly)

 1. Load http://www.forbes.com/sites/sarahjeong/2015/01/22/the-dread-pirates-diary/

 2. Click through the ad splash page (click "Continue to site" at upper-right)

ACTUAL RESULTS:
Crash (or abort, rather?), with crash report like any of these:
bp-a4b13915-0501-4203-9c00-8323a2150122
bp-0c2a1aab-e081-42bf-85a9-4a0ce2150122
bp-d7e9157d-6b01-4a2c-b5e5-a40782150122
bp-398ad307-3b0f-4a59-a499-910882150122
...and with this in my terminal:
{
[16261] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-cen-l64-ntly-000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1604
[16261] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-cen-l64-ntly-000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1604

###!!! [Child][MessageChannel] Error: Channel error: cannot send/recv

[Child 16054] ###!!! ABORT: constructor for actor failed: file /builds/slave/m-cen-l64-ntly-000000000000000/build/src/obj-firefox/ipc/ipdl/PNeckoChild.cpp, line 313
[Child 16054] ###!!! ABORT: constructor for actor failed: file /builds/slave/m-cen-l64-ntly-000000000000000/build/src/obj-firefox/ipc/ipdl/PNeckoChild.cpp, line 313
}


The page loads fine if I disable e10s.
(This might be a dupe of bug 1099153, which has the same crash signature, but given that that bug's been open for a few months, I'm guessing this is something new & distinct, given that it's an insta-crash on a high-profile site.)
tracking-e10s: --- → ?
Summary: crash in mozilla::a11y::ProxyDestroyed(mozilla::a11y::ProxyAccessible*) when viewing article on forbes.com → [e10s] crash in mozilla::a11y::ProxyDestroyed(mozilla::a11y::ProxyAccessible*) when viewing article on forbes.com
Blocks: e10sa11y2
mozregression gives me this range:
Last good revision: b3f84cf78dc2 (2015-01-09)
First bad revision: bb8d6034f5f2 (2015-01-10)
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3f84cf78dc2&tochange=bb8d6034f5f2

(I was careful to make sure e10s was enabled at each mozregression step, since it's enabled by default in some nightlies & not in others.)

That pushlog includes mozilla-central changeset 06ea411856f7 from tbsaunde, with commit message:
> no bug - reenable ipc accessibility yet again

This seems likely to be the proximate cause of this crash.


ALSO: for some reason, I can only reproduce this on my Lenovo linux laptop -- not my linux desktop. (using a fresh profile on both, & both machines are running 64-bit Ubuntu 14.04) And on my laptop, the page does occasionally load correctly, though it crashes at least 50% of the time. And occasionally, I'll get the content-process-crashed crash page, instead of a full-browser crash. But if I reload at that point, I'll usually get a full crash.
If I perform the STR on my laptop, I abort from tripping a fatal MOZ_ASSERT:
> Assertion failure: !mOuterDoc, at $SRC/accessible/ipc/ProxyAccessible.cpp:18

Backtrace attached.
(That's the parent process aborting, BTW. The child process aborts shortly after, with "constructor for actor failed", the first child-process error output quoted in comment 0. I'm not sure, but I suspect that's just saying it couldn't reach the parent, because the parent has crashed.)
(In reply to Daniel Holbert [:dholbert] from comment #1)
> (This might be a dupe of bug 1099153, which has the same crash signature,
> but given that that bug's been open for a few months, I'm guessing this is
> something new & distinct, given that it's an insta-crash on a high-profile
> site.)

(following up on this -- per comment 2, it sounds like IPC accessibility was off for a while and was only recently re-enabled, which increases the likelihood that this is really just a dupe of bug 1099153 after all. In any case -- we have a URL here, whereas it looks like we don't on bug 1099153, so maybe we've got something more actionable here than we did when that other bug was originally filed.)
See Also: → 1099153
(Also, as I discovered in bug 1125442 comment 6, my laptop is apparently configured such that mozilla::a11y::ShouldA11yBeEnabled() returns 'true', which means we start the a11y service. This likely explains why I can specifically reproduce this on my laptop, but not on my desktop, as noted at the end of comment 2.)
(In reply to Daniel Holbert [:dholbert] from comment #5)
> (In reply to Daniel Holbert [:dholbert] from comment #1)
> > (This might be a dupe of bug 1099153, which has the same crash signature,
> > but given that that bug's been open for a few months, I'm guessing this is
> > something new & distinct, given that it's an insta-crash on a high-profile
> > site.)
> 
> (following up on this -- per comment 2, it sounds like IPC accessibility was
> off for a while and was only recently re-enabled, which increases the
> likelihood that this is really just a dupe of bug 1099153 after all. In any
> case -- we have a URL here, whereas it looks like we don't on bug 1099153,
> so maybe we've got something more actionable here than we did when that
> other bug was originally filed.)

I hope its the same issue, and more data is always nice.  Unfortunately I'm not having a easy time reproducing this in a consistant way, but I guess I'll keep trying.
Comment on attachment 8554699 [details] [diff] [review]
make shutdown of attached documents more robust

My debug build still aborts with this patch applied, now with "Assertion failure: aIndex < Length() (invalid array index), at ../../dist/include/nsTArray.h:936"

I get a whole bunch of other assertion output before that abort -- pruned, it's basically:
{
> [Parent 23591] ###!!! ASSERTION: Why do we still have a child doc?: '!mOuterDoc', file $SRC/accessible/ipc/ProxyAccessible.cpp, line 18
> [Parent 23591] ###!!! ASSERTION: invalid root being removed!: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 100
> [Parent 23591] ###!!! ASSERTION: no proxy for event!: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 127
[...SNIP -- Last two assertions repeat a bunch...]
> [Parent 23591] ###!!! ASSERTION: adding child to unknown accessible: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 35
> IPDL protocol error: Handler for ShowEvent returned error code
> 
> ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x3E0004,name=PDocAccessible::Msg_ShowEvent) Processing error: message was deserialized, but the handler returned false (indicating failure)
[...SNIP -- last two assertions and "IPDL protocol error:" and "Error" repeat a bunch...]
> [Parent 23591] ###!!! ASSERTION: no proxy for event!: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 127
> ###!!! [Child][MessageChannel::Call] Error: Channel error: cannot send/recv
[...SNIP -- send/recv error repeats a few times...]
> [Parent 23591] ###!!! ASSERTION: why wheren't the child docs destroyed already?: 'mChildDocs.IsEmpty()', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 161
> Assertion failure: aIndex < Length() (invalid array index), at ../../dist/include/nsTArray.h:936
}
Attachment #8554699 - Flags: feedback?(dholbert) → feedback-
Here's the backtrace of that abort.

Somehow, this loop in DocAccessibleParent::Destroy()...
  uint32_t childDocCount = mChildDocs.Length();
  for (uint32_t i = 0; i < childDocCount; i++)
    mChildDocs[i]->Destroy();
... is ending up out of bounds, at the mChildDocs[i] access.  Perhaps Destroy() removes the child doc from the array?

At the point where we crash, childDocCount = 4, i = 2, and  mChildDocs->mHdr->mLength = 2.
(In reply to Daniel Holbert [:dholbert] from comment #9)
> Comment on attachment 8554699 [details] [diff] [review]
> make shutdown of attached documents more robust
> 
> My debug build still aborts with this patch applied, now with "Assertion
> failure: aIndex < Length() (invalid array index), at
> ../../dist/include/nsTArray.h:936"
> 
> I get a whole bunch of other assertion output before that abort -- pruned,
> it's basically:

yeah, I think there's basically a cascade of errors.  We may need to deal with the initial one if it causes tabs to crash, but lets deal with the array one first.

The assertions about documents not being shutdown already are expected to stay around but as NS_ASSERTIONS because I do consider them bugs, but bugs that we unfortunately need to work around for now.
(In reply to Daniel Holbert [:dholbert] from comment #10)
> Created attachment 8554765 [details]
> backtrace of abort, after applying attachment 8554699 [details] [diff] [review]
> [review] and loading forbes URL
> 
> Here's the backtrace of that abort.
> 
> Somehow, this loop in DocAccessibleParent::Destroy()...
>   uint32_t childDocCount = mChildDocs.Length();
>   for (uint32_t i = 0; i < childDocCount; i++)
>     mChildDocs[i]->Destroy();
> ... is ending up out of bounds, at the mChildDocs[i] access.  Perhaps
> Destroy() removes the child doc from the array?


oops it does.  New patch comming
Comment on attachment 8554808 [details] [diff] [review]
make shutdown of attached documents more robust

With this patch, this article just ends up turning into a "Tab crashed" page -- it doesn't take down my whole browser anymore.

I'm not sure if that's what you were going for, but it's a step in the right direction. :)

The crash happens shortly after the "why wheren't the child docs destroyed already" assertion quoted at the end of comment 9.
Attachment #8554808 - Flags: feedback?(dholbert) → feedback+
(In reply to Daniel Holbert [:dholbert] from comment #13)
> Comment on attachment 8554808 [details] [diff] [review]
> make shutdown of attached documents more robust
> 
> With this patch, this article just ends up turning into a "Tab crashed" page
> -- it doesn't take down my whole browser anymore.
> 
> I'm not sure if that's what you were going for, but it's a step in the right
> direction. :)

I'd like to fix the tab crashing part too, but I agree the parent not crashing is an improvement.

> The crash happens shortly after the "why wheren't the child docs destroyed
> already" assertion quoted at the end of comment 9.

does it crash because an IPC message handler returns false, or does it actually crash?
(In reply to Trevor Saunders (:tbsaunde) from comment #14)
> does it crash because an IPC message handler returns false, or does it
> actually crash?

I think the former, though I'm not 100% sure.

I do get a pile of "Processing error: message was deserialized, but the handler returned false" assertions, as quoted in comment 9. And I don't see a backtrace printed to my terminal when the child process dies -- just something like this:
{
[Parent 8220] WARNING: '!aObserver', file /scratch/work/builds/mozilla-central/mozilla-central.13-11-22.16-03/mozilla/xpcom/ds/nsObserverService.cpp, line 283
nsStringStats
 => mAllocCount:            129
 => mReallocCount:            1
 => mFreeCount:             129
 => mShareCount:            232
 => mAdoptCount:              0
 => mAdoptFreeCount:          0
 => Process ID: 8379, Thread ID: 140383081974336
}
which could (?) represent an intentional quit from the child process.
(Also, FWIW: as of today I can now reproduce this on my linux desktop at the office (where I formerly couldn't), and I've confirmed that  bug 1125442 comment 5 reproduces on this machine as well. Presumably there was some gnome update that turned on accessibility or something, as Trevor suggested over on that other bug, and I just received that update on my laptop first, which is why I hit these things there first.)
(In reply to Daniel Holbert [:dholbert] from comment #15)
> (In reply to Trevor Saunders (:tbsaunde) from comment #14)
> > does it crash because an IPC message handler returns false, or does it
> > actually crash?
> 
> I think the former, though I'm not 100% sure.
> 
> I do get a pile of "Processing error: message was deserialized, but the
> handler returned false" assertions, as quoted in comment 9. And I don't see
> a backtrace printed to my terminal when the child process dies -- just
> something like this:
> {
> [Parent 8220] WARNING: '!aObserver', file
> /scratch/work/builds/mozilla-central/mozilla-central.13-11-22.16-03/mozilla/
> xpcom/ds/nsObserverService.cpp, line 283
> nsStringStats
>  => mAllocCount:            129
>  => mReallocCount:            1
>  => mFreeCount:             129
>  => mShareCount:            232
>  => mAdoptCount:              0
>  => mAdoptFreeCount:          0
>  => Process ID: 8379, Thread ID: 140383081974336
> }
> which could (?) represent an intentional quit from the child process.

I'm not really sure either then, but I'd guess ipc message failed.  Can you tell which was the first of those messages?
(In reply to Trevor Saunders (:tbsaunde) from comment #17)
> I'm not really sure either then, but I'd guess ipc message failed.  Can you
> tell which was the first of those messages?

Yeah -- I put a breakpoint at MessageChannel.cpp:1233 (where we print the error for "handler returned false"). When I hit that (for the first time after loading this URL), aMsg.name_ is "PDocAccessible::Msg_ShowEvent".

(I'm not 100% sure that this answers your question, but I think it does)

This happens after a big pile of assertion/error output, the most recent of which is:
{
[Parent 5123] ###!!! ASSERTION: invalid root being removed!: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 106
[Parent 5123] ###!!! ASSERTION: no proxy for event!: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 133
[Parent 5123] ###!!! ASSERTION: adding child to unknown accessible: 'Error', file $SRC/accessible/ipc/DocAccessibleParent.cpp, line 38
IPDL protocol error: Handler for ShowEvent returned error code
}

Note the "IPDL protocol error: Handler for ShowEvent returned error code" error-message there, too, which indicates the handler that's having trouble here. I placed a breakpoint at that, and I think I might have found the (or a) bug. More on that in next comment.
So the "Handler for ShowEvent returned error code" message comes after RecvShowEvent fails here, in PDocAccessibleParent.cpp (generated in my objdir):
>            if ((!(RecvShowEvent(mozilla::Move(data))))) {
>                mozilla::ipc::ProtocolErrorBreakpoint("Handler for ShowEvent returned error code");
>                return MsgProcessingError;
>            }

The "RecvShowEvent" function is here:
  http://mxr.mozilla.org/mozilla-central/source/accessible/ipc/DocAccessibleParent.cpp#16
and it returns type 'bool', and yet its final return statement is "return consumed", which is a uint32_t. That seems like a bug. (It's not clear to me that "consumed == 0" is really an error condition here. If it is, we should probably "return consumed != 0;" instead of "return consumed;", to make the bool conversion more intentional/explicit.

I had a hunch that this was what was responsible for RecvShowEvent failing, but I was wrong -- we're actually taking this early-return-false:
> 32   // XXX This should really never happen, but sometimes we fail to fire the
> 33   // required show events.
> 34   if (!parent) {
> 35     NS_ERROR("adding child to unknown accessible");
> 36     return false;
> 37   }
http://mxr.mozilla.org/mozilla-central/source/accessible/ipc/DocAccessibleParent.cpp#32
(tbsaunde, I'm tempted to let you take it from here, if you can reproduce, since I don't really know this code at all, and async debugging over bugzilla is not very efficient. Have you been able to reproduce, or are you still having trouble per comment 7?)
(In reply to Daniel Holbert [:dholbert] from comment #19)
> So the "Handler for ShowEvent returned error code" message comes after
> RecvShowEvent fails here, in PDocAccessibleParent.cpp (generated in my
> objdir):
> >            if ((!(RecvShowEvent(mozilla::Move(data))))) {
> >                mozilla::ipc::ProtocolErrorBreakpoint("Handler for ShowEvent returned error code");
> >                return MsgProcessingError;
> >            }
> 
> The "RecvShowEvent" function is here:
>  
> http://mxr.mozilla.org/mozilla-central/source/accessible/ipc/
> DocAccessibleParent.cpp#16
> and it returns type 'bool', and yet its final return statement is "return
> consumed", which is a uint32_t. That seems like a bug. (It's not clear to me
> that "consumed == 0" is really an error condition here. If it is, we should
> probably "return consumed != 0;" instead of "return consumed;", to make the
> bool conversion more intentional/explicit.

yeah, return consumed is correct (though apparently more confusing than I found it writing this)

> I had a hunch that this was what was responsible for RecvShowEvent failing,
> but I was wrong -- we're actually taking this early-return-false:
> > 32   // XXX This should really never happen, but sometimes we fail to fire the
> > 33   // required show events.
> > 34   if (!parent) {

ok, so this still sometimes fails which is unfun, but its good to have a test case.

(In reply to Daniel Holbert [:dholbert] from comment #20)
> (tbsaunde, I'm tempted to let you take it from here, if you can reproduce,
> since I don't really know this code at all, and async debugging over
> bugzilla is not very efficient. Have you been able to reproduce, or are you
> still having trouble per comment 7?)

I haven't had useful luck debugging this yet, but if you don't have rr setup you should absolutely stop at this point trying to debug from this point with gdb will be miserable.  I believe all I need to see is where the accessible with address aData.ID() is created, but maybe that won't quiet be it so its probably best I find a way to reproduce this (I suspect part of the problem with reproducing is that the a11y ipc stuff is far from complete so its hard for me to use the browser, but that can be worked around).
Comment on attachment 8554808 [details] [diff] [review]
make shutdown of attached documents more robust

maybe smaug wants to learn this stuff better by reviewing it ;) if not punting is fine.

We need to handle these cases correctly because a misbehaving content process can cause them.  While a well behaved content process won't trip them there are currently some bugs in core a11y code that do trip them so fatal asserts don't really help.
Attachment #8554808 - Flags: review?(bugs)
(In reply to Trevor Saunders (:tbsaunde) from comment #21)
> yeah, return consumed is correct (though apparently more confusing than I
> found it writing this)

Cool, thanks for confirming -- so that was a red herring. I filed bug 1126541 on clarifying that return statement to make it less suspicious-looking.

> I haven't had useful luck debugging this yet, but if you don't have rr setup
> you should absolutely stop at this point trying to debug from this point
> with gdb will be miserable.

Heh, ok. I don't have rr set up, but I probably should get it running at some point, now that it supports 64-bit systems (IIRC). Anyway, I'll stop worrying about this bug for now, but feel free to needinfo me down the line if you still can't repro & need me to poke at someting.
(In reply to Trevor Saunders (:tbsaunde) from comment #22)
> Comment on attachment 8554808 [details] [diff] [review]
> make shutdown of attached documents more robust
> 
> maybe smaug wants to learn this stuff better by reviewing it ;) if not
> punting is fine.

right, it'd be cool to see Olli on-board.
So comment 13 says the patch doesn't fix the child side crash, so why to review the patch yet?

I can't reproduce this on linux.

But looking at the code anyhow.
ok, just enabling a11y in FF isn't enough. Gnome a11y needs to be enabled too.
...via something like "gsettings set org.gnome.desktop.interface toolkit-accessibility true" -- that's the only a11y thing I had enabled on my system (without realizing it) when I hit this, at least.
Summary: [e10s] crash in mozilla::a11y::ProxyDestroyed(mozilla::a11y::ProxyAccessible*) when viewing article on forbes.com → [e10s] crash in mozilla::a11y::ProxyDestroyed(mozilla::a11y::ProxyAccessible*) when viewing article on forbes.com, if a11y is enabled in Gnome
Whiteboard: a11y enabled in Gnome via e.g. "gsettings set org.gnome.desktop.interface toolkit-accessibility true"
So with the patch I got a child process crash, but that one is a necko crash.
Comment on attachment 8554808 [details] [diff] [review]
make shutdown of attached documents more robust

> a11y::ProxyCreated(ProxyAccessible* aProxy, uint32_t aInterfaces)
> {
>   GType type = GetMaiAtkType(GetInterfacesForProxy(aProxy, aInterfaces));
>   NS_ASSERTION(type, "why don't we have a type!");
> 
>   AtkObject* obj =
>     reinterpret_cast<AtkObject *>
>     (g_object_new(type, nullptr));
>-  if (!obj)
>-    return;
unrelated change.

> DocAccessibleParent::RecvShowEvent(const ShowEventData& aData)
> {
>+  if (mShutdown)
>+    return true;
I don't know a11y coding style rules, but I'd say {} should be used with 'if'


> DocAccessibleParent::RecvHideEvent(const uint64_t& aRootID)
> {
>+  if (mShutdown)
>+    return true;
ditto


> DocAccessibleParent::Destroy()
> {
>-  MOZ_ASSERT(mChildDocs.IsEmpty(),
>-      "why wheren't the child docs destroyed already?");
>+  NS_ASSERTION(mChildDocs.IsEmpty(),
>+               "why wheren't the child docs destroyed already?");
s/wheren't/weren't/

>+  uint32_t childDocCount = mChildDocs.Length();
>+  for (uint32_t i = childDocCount - 1; i < childDocCount; i--)
>+    mChildDocs[i]->Destroy();
>+

{} with 'for'.


So this looks hackish to me, but looks like ProxyAccessible::Shutdown() isn't any better.
So, fine for now.

And I guess separate fix for the possible child process crash.
Attachment #8554808 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #29)
> > DocAccessibleParent::Destroy()
> > {
> >-  MOZ_ASSERT(mChildDocs.IsEmpty(),
> >-      "why wheren't the child docs destroyed already?");
> >+  NS_ASSERTION(mChildDocs.IsEmpty(),
> >+               "why wheren't the child docs destroyed already?");
> s/wheren't/weren't/

(FWIW: I fixed this typo in the existing assertion-message in bug 1110510 comment 10 yesterday, though we need to make sure this patch doesn't add back the typo. :) Sorry if that one-off causes bitrot here.)
> 
> > DocAccessibleParent::RecvShowEvent(const ShowEventData& aData)
> > {
> >+  if (mShutdown)
> >+    return true;
> I don't know a11y coding style rules, but I'd say {} should be used with 'if'

302 surkov if you don't like it ;-)

> >+  uint32_t childDocCount = mChildDocs.Length();
> >+  for (uint32_t i = childDocCount - 1; i < childDocCount; i--)
> >+    mChildDocs[i]->Destroy();
> >+
> 
> {} with 'for'.
> 
> 
> So this looks hackish to me, but looks like ProxyAccessible::Shutdown()
> isn't any better.
> So, fine for now.

why? if we're being shutdown doesn't it make sense we shutdown the child documents?


> 
> And I guess separate fix for the possible child process crash.

yeah, we need to handle trees of actors getting shutdown which this patch does irespective of fixing the child issue.
make shutdown of attached documents more robust
https://hg.mozilla.org/mozilla-central/rev/c46b0a95c754
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
I spun off bug 1128751 for the remaining content-process crash here, FWIW.
Blocks: 1128751
Duplicate of this bug: 1138700
Assignee: nobody → tbsaunde+mozbugs
You need to log in before you can comment on or make changes to this bug.