Closed Bug 1330484 Opened 3 years ago Closed 3 years ago

crash near NULL [mozilla::a11y::NotificationController::WillRefresh]

Categories

(Core :: Disability Access APIs, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- disabled
firefox54 --- disabled
firefox55 + fixed

People

(Reporter: tsmith, Assigned: handyman)

References

(Blocks 1 open bug)

Details

(Keywords: crash, csectype-nullptr, testcase, Whiteboard: aes+)

Crash Data

Attachments

(4 files, 3 obsolete files)

No description provided.
Bah, darn trackpad submitted before I was done.

==8687==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000102 (pc 0x7fc5ba03d0fa bp 0x7ffdc31a0210 sp 0x7ffdc319fca0 T0)
    #0 0x7fc5ba03d0f9 in mozilla::a11y::NotificationController::WillRefresh(mozilla::TimeStamp) /home/worker/workspace/build/src/accessible/base/NotificationController.cpp:585
    #1 0x7fc5b8e6d8e6 in nsRefreshDriver::Tick(long, mozilla::TimeStamp) /home/worker/workspace/build/src/layout/base/nsRefreshDriver.cpp:1796:7
    #2 0x7fc5b8e7b575 in mozilla::RefreshDriverTimer::TickRefreshDrivers(long, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) /home/worker/workspace/build/src/layout/base/nsRefreshDriver.cpp:295:7
    #3 0x7fc5b8e7b1da in mozilla::RefreshDriverTimer::Tick(long, mozilla::TimeStamp) /home/worker/workspace/build/src/layout/base/nsRefreshDriver.cpp:316:5
    #4 0x7fc5b8e7d8de in applyImpl<mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver, void (mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::*)(mozilla::TimeStamp), StoreCopyPassByConstLRef<mozilla::TimeStamp> , 0> /home/worker/workspace/build/src/obj-firefox/dist/include/nsThreadUtils.h:768:12
    #5 0x7fc5b8e7d8de in apply<mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver, void (mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::*)(mozilla::TimeStamp)> /home/worker/workspace/build/src/obj-firefox/dist/include/nsThreadUtils.h:774
    #6 0x7fc5b8e7d8de in mozilla::detail::RunnableMethodImpl<void (mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::*)(mozilla::TimeStamp), true, false, mozilla::TimeStamp>::Run() /home/worker/workspace/build/src/obj-firefox/dist/include/nsThreadUtils.h:803
    #7 0x7fc5b255142b in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1240:7
    #8 0x7fc5b25d52fc in NS_ProcessNextEvent(nsIThread*, bool) /home/worker/workspace/build/src/xpcom/glue/nsThreadUtils.cpp:390:10
    #9 0x7fc5b338d2df in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /home/worker/workspace/build/src/ipc/glue/MessagePump.cpp:96:21
    #10 0x7fc5b32fb2a8 in RunInternal /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:238:3
...
see log.txt
Blocks: grizzly
Severity: normal → critical
Attached file test_case.html (obsolete) —
(In reply to Tyson Smith [:tsmith] from comment #2)
> Created attachment 8826025 [details]
> test_case.html

That doesn't seem to crash for me even if I let it run for a while.  Any particular str?
Requires fuzzPriv extension to reproduce:
https://github.com/MozillaSecurity/funfuzz/tree/master/dom/extension

This seems to be timing related. e10s was disabled.
See Also: → 1325258
Attached file test_case.html (obsolete) —
Try to make the test case more reliable.

I can still repro very quickly on ASan opt builds with e10s disabled.

https://tools.taskcluster.net/index/artifacts/#gecko.v2.mozilla-central.latest.firefox/gecko.v2.mozilla-central.latest.firefox.linux64-asan-opt
Attached file test_case.html
I missed marking the old one as obsolete.
Attachment #8826025 - Attachment is obsolete: true
Attachment #8837737 - Attachment is obsolete: true
No luck on debug asan build with e10s disabled for me as well.

Could you probably confirm where the problem exactly, since line numbers from the original report seems out of dated. Is it crashed because nsRefreshDriver keeps a dead pointer to NotificationController, or is it crashed somewhere inside NotificationController::WillRefresh?
(In reply to alexander :surkov from comment #7)
> No luck on debug asan build with e10s disabled for me as well.

Please use an ASan opt build not debug.

> Could you probably confirm where the problem exactly, since line numbers
> from the original report seems out of dated. Is it crashed because
> nsRefreshDriver keeps a dead pointer to NotificationController, or is it
> crashed somewhere inside NotificationController::WillRefresh?

I'll attached an updated log.
Crash Signature: [@ mozilla::a11y::NotificationController::WillRefresh]
#9 Nightly topcrash in Nightly 20170224030232.

tbsaunde, this signature was put into bug 1341731 but perhaps it shouldn't have been.l
Flags: needinfo?(tbsaunde+mozbugs)
#6 Windows topcrash in Nightly 20170303030202.
Alex, ping. Comment 9 provided a log.
Flags: needinfo?(tbsaunde+mozbugs) → needinfo?(surkov.alexander)
Attached patch stop bleeding (obsolete) — Splinter Review
more debugging is needed, here's a crash stopper
Assignee: nobody → surkov.alexander
Flags: needinfo?(surkov.alexander)
Attachment #8850610 - Flags: review?(yzenevich)
Comment on attachment 8850610 [details] [diff] [review]
stop bleeding

Review of attachment 8850610 [details] [diff] [review]:
-----------------------------------------------------------------

looks reasonable. thanks
Attachment #8850610 - Flags: review?(yzenevich) → review+
leave open until we figure out what's going on there; the patch has a debug assert for easier crash catching
Keywords: leave-open
We're hitting this assert a lot:
https://hg.mozilla.org/mozilla-central/annotate/45692c884fdd/accessible/base/NotificationController.cpp#l861

Is that relevant to this bug or should we file a new report?
Flags: needinfo?(tbsaunde+mozbugs)
I'm not sure I've seen a good stack for this, so I'm not sure.  Can't hurt to fix it though.
Flags: needinfo?(tbsaunde+mozbugs)
Assignee: surkov.alexander → tbsaunde+mozbugs
Flags: needinfo?(tbsaunde+mozbugs)
For recent builds this is #3 on nightly and should be given commensurate attention.
Still #3 Windows topcrash on Nightly 20170421030241.

tbsaunde, any updates?
[Tracking Requested - why for this release]:

#3 topcrash on Nightly.
I tried reproducing some on monday and tuesday, but didn't have any luck.  I've been pto since then.
Flags: needinfo?(tbsaunde+mozbugs)
Tracking 55+ as this is a top crash.
(In reply to Jim Mathies [:jimm] from comment #20)
> - High crash rate in Nightly builds (both 54 and 55) (diagnostic assert)
> - Startup crash
> - Linux and Windows

> https://hg.mozilla.org/mozilla-central/annotate/201231223cd4/accessible/base/
> NotificationController.cpp#l858

it feels different from the original bug, it might be worth to have another bug open for that

(In reply to Trevor Saunders (:tbsaunde) from comment #24)
> I tried reproducing some on monday and tuesday, but didn't have any luck. 

(In reply to Marcia Knous [:marcia - use ni] from comment #25)
> Tracking 55+ as this is a top crash.

would it be reasonable to add a null check for parentIPCDoc then and replace diagnostic_assert or ordinal assert?
Depends on: 1359129
#5 Windows topcrash in Nightly 20170428030259.
Reminder for Trevor to see if we can catch this in try.
Flags: needinfo?(tbsaunde+mozbugs)
Trevor I think this was hit here: https://treeherder.mozilla.org/logviewer.html#?job_id=98491161&repo=try&lineNumber=16399
In reftest/tests/layout/reftests/forms/legend/shadow-dom.html
I didn't have much luck triggering this in windows mochitest runs, ended up triggering a number of other asserts though:

test run:
https://treeherder.mozilla.org/logviewer.html#?job_id=98394381&repo=try&lineNumber=8494

errors:
* [Child 5656] ###!!! ASSERTION: We must reach document accessible implementing text interface!: 'Not Reached'
* [Parent 5464] ###!!! ASSERTION: no outer doc for accessible remote tab!: 'outerDoc'
* PROCESS-CRASH | dom/html/test/test_fullscreen-api.html | application crashed [@ mozilla::a11y::DocAccessibleParent::RemoveChildDoc(mozilla::a11y::DocAccessibleParent *)]

test run: https://treeherder.mozilla.org/logviewer.html#?job_id=98394416&repo=try&lineNumber=35666

errors:
* Child 3452] ###!!! ASSERTION: We must reach document accessible implementing text interface!: 'Not Reached'
This seems like it should be fixable. It's a startup crash for one thing, which should provide some clues.

typical crash:
https://crash-stats.mozilla.com/report/index/7b318eb1-2e19-40ae-8b00-e03d10170517

code:
https://hg.mozilla.org/mozilla-central/annotate/b8e9b674033b/accessible/base/NotificationController.cpp#l857

I looked at DocAccessible's IPCDoc() to see how it could be null - 

http://searchfox.org/mozilla-central/search?q=SetIPCDoc&path=

We set this in three places, each of which is during some form of tear down - 

// found in mozilla::a11y::DocAccessibleChildBase::~DocAccessibleChildBase
// found in mozilla::a11y::DocAccessibleChildBase::ActorDestroy
// found in mozilla::a11y::DocAccessibleChildBase::DetachDocument

Seems to me that when we run across this in NotificationController, this object is in the process of getting torn down. When we do have this pointer, looks like we are setting up a new child doc so that we can call SendBindChildDoc on it. Can we bail early here like we do for the check of Defunct just above this code block? 

https://hg.mozilla.org/mozilla-central/annotate/b8e9b674033b/accessible/base/NotificationController.cpp#l851
Flags: needinfo?(tbsaunde+mozbugs)
Assignee: tbsaunde+mozbugs → davidp99
This is the #7 Windows topcrash in Nightly 20170519030205.
Still no STR or anything like that.  We were going to add crash-stats to learn more about the conditions of the bug but I ended up resolving most of my questions manually.  For example, I am convinced that the missing IPC actor at the root of this bug must have existed at _some_ point -- to be in this part of a DocAccessible's NotificationController's timer handler, the DocAccessible would need to have had a parent [1][2] (enforced by state machine), and would have generated its actor in the same call where it got that parent [3].

Instead, I think this patch works to avoid an edge condition.  I'm not 100% that this is happening but I do believe its safe and is a good candidate.

A loose example of what could be the cause:

1. A DocAccessible DA1 binds as parent of another DA2.  In the process, DA1 and DA2 obtain DocAccessibleChild IPC objects DAC1 and DAC2.
2. DA2 queues a request to bind as parent of a third document, DA3.
3. DAC2 (somehow) gets an ActorDestroy message, causing it to remove itself.  Now DAC2 has no actor.  
4. DA2's NotificationController processes the queued bind request from step 2, hitting the assert.

---

I shutdown the child document that failed to bind because it is now improperly represented remotely.  This is what we currently do when the binding fails for other reasons.

[1] https://dxr.mozilla.org/mozilla-central/rev/5b74bbf20e803e299790d266fc6ebf5d53b7a1b7/accessible/base/NotificationController.cpp#745
[2] https://dxr.mozilla.org/mozilla-central/rev/5b74bbf20e803e299790d266fc6ebf5d53b7a1b7/accessible/base/NotificationController.cpp#615
[3] https://dxr.mozilla.org/mozilla-central/rev/5b74bbf20e803e299790d266fc6ebf5d53b7a1b7/accessible/base/NotificationController.cpp#867
Attachment #8870093 - Flags: review?(aklotz)
Apropos of nothing except that I ran into this a few times while working on this bug.  The patch removes an assert that is invalid.  Here's an example where it trips:

RecvHideEvent is received on a DocAccessibleParent DAP1 that has a child document DAP2... that in turn has a child document DAP3.  DAP1::RecvHideEvent calls Shutdown on its root ProxyAccessible.  This recursively calls Shutdown on child ProxyAccessibles until it reaches the one that is the parent of the child document DAP2.  ProxyAccessibleBase::Shutdown then attempts to Unbind DAP2, which calls DAP1::RemoveChildDoc, which trips the assert because DAP2 has child document DAP3.

---

I traced the assert back to the earliest a11y e10s work (bug 982842).  The assert really says that RemoveChildDoc can only be used as part of DocAccessibleParent Shutdown, which it initially was but is not now -- here it is being used by DocAccessibleParent::Unbind as part of a normal accessible element's Shutdown.  I dont think the assert condition is required _but_ the code involved [1] suggests that the issue may have been introduced from a hack.  It also suggests that no one may really know the right answer.  Quoting ProxyAccessibleBase::Shutdown:

> // XXX Ideally  this wouldn't be necessary, but it seems OuterDoc accessibles
> // can be destroyed before the doc they own.

We could change the code to e.g. recursively unbind child documents in post-order instead of just unbinding the immediate child so that the condition is met but I don't know that unbinding those documents is the right result.

[1] https://dxr.mozilla.org/mozilla-central/rev/9851fcb0bf4d855c36729d7de19f0fa5c9f69776/accessible/ipc/ProxyAccessibleBase.cpp#34
Attachment #8870102 - Flags: review?(aklotz)
Comment on attachment 8850610 [details] [diff] [review]
stop bleeding

>diff --git a/accessible/base/NotificationController.cpp b/accessible/base/NotificationController.cpp
>--- a/accessible/base/NotificationController.cpp
>+++ b/accessible/base/NotificationController.cpp
>@@ -750,18 +750,21 @@ NotificationController::WillRefresh(mozi
> 
>         outerDocAcc->RemoveChild(childDoc);
>       }
> 
>       // Failed to bind the child document, destroy it.
>       childDoc->Shutdown();
>     }
>   }
>-
>   mHangingChildDocuments.Clear();
>+  MOZ_ASSERT(mDocument, "Illicit document shutdown");
>+  if (!mDocument) {
>+    return;
>+  }
> 
>   // If the document is ready and all its subdocuments are completely loaded
>   // then process the document load.
>   if (mDocument->HasLoadState(DocAccessible::eReady) &&
>       !mDocument->HasLoadState(DocAccessible::eCompletelyLoaded) &&
>       hangingDocCnt == 0) {
>     uint32_t childDocCnt = mDocument->ChildDocumentCount(), childDocIdx = 0;
>     for (; childDocIdx < childDocCnt; childDocIdx++) {
Attachment #8850610 - Attachment is obsolete: true
Attachment #8870093 - Flags: review?(aklotz) → review+
Comment on attachment 8870102 [details] [diff] [review]
2. Part 2: Remove invalid assert that checked that removed a11y child documents have no children

r+ing these for aklotz, Aaron if you have any additional feedback dparks can address in follow ups.
Attachment #8870102 - Flags: review?(aklotz) → review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1257d5a66ed0
Part 1: Detect and Shutdown binding accessibles that have lost their actors. r=jimm
https://hg.mozilla.org/integration/mozilla-inbound/rev/d2c50b0bab73
Part 2: Remove invalid assert that checked that removed a11y child documents have no children. r=jimm
Keywords: checkin-needed
With mozilla-central 20170524-37d777d87200 and the attached test case I can trigger the following assertion, Should this be a separate bug?

Assertion failure: mDocument (Illicit document shutdown), at src/accessible/base/NotificationController.cpp:768

==58708==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f8eb94de661 bp 0x7ffffdb4ceb0 sp 0x7ffffdb4c8e0 T0)
==58708==The signal is caused by a WRITE memory access.
==58708==Hint: address points to the zero page.
    #0 0x7f8eb94de660 in mozilla::a11y::NotificationController::WillRefresh(mozilla::TimeStamp) src/accessible/base/NotificationController.cpp:705:75
    #1 0x7f8eb736b35f in nsRefreshDriver::Tick(long, mozilla::TimeStamp) src/layout/base/nsRefreshDriver.cpp:1790:12
    #2 0x7f8eb737414e in mozilla::RefreshDriverTimer::TickRefreshDrivers(long, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) src/layout/base/nsRefreshDriver.cpp:300:7
    #3 0x7f8eb7373f1d in mozilla::RefreshDriverTimer::Tick(long, mozilla::TimeStamp) src/layout/base/nsRefreshDriver.cpp:321:5
    #4 0x7f8eb7377d65 in mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::TimeStamp) src/layout/base/nsRefreshDriver.cpp:753:5
    #5 0x7f8eb7376886 in mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::TimeStamp) src/layout/base/nsRefreshDriver.cpp:666:35
    #6 0x7f8eb7372537 in mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::ParentProcessVsyncNotifier::Run() src/layout/base/nsRefreshDriver.cpp:512:20
    #7 0x7f8eb1db4ff1 in nsThread::ProcessNextEvent(bool, bool*) src/xpcom/threads/nsThread.cpp:1302:14
    #8 0x7f8eb1dbf1a0 in NS_ProcessNextEvent(nsIThread*, bool) src/xpcom/threads/nsThreadUtils.cpp:393:10
    #9 0x7f8eb28e9235 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) src/ipc/glue/MessagePump.cpp:96:21
    #10 0x7f8eb28361d7 in MessageLoop::RunInternal() src/ipc/chromium/src/base/message_loop.cc:238:10
    #11 0x7f8eb2836069 in MessageLoop::Run() src/ipc/chromium/src/base/message_loop.cc:211:3
    #12 0x7f8eb6e9a07a in nsBaseAppShell::Run() src/widget/nsBaseAppShell.cpp:156:27
    #13 0x7f8eb9a7a471 in nsAppStartup::Run() src/toolkit/components/startup/nsAppStartup.cpp:283:30
    #14 0x7f8eb9bd3a52 in XREMain::XRE_mainRun() src/toolkit/xre/nsAppRunner.cpp:4568:22
    #15 0x7f8eb9bd568b in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) src/toolkit/xre/nsAppRunner.cpp:4748:8
    #16 0x7f8eb9bd6578 in XRE_main(int, char**, mozilla::BootstrapConfig const&) src/toolkit/xre/nsAppRunner.cpp:4843:21
    #17 0x4eca88 in do_main(int, char**, char**) src/browser/app/nsBrowserApp.cpp:236:22
    #18 0x4ec3a0 in main src/browser/app/nsBrowserApp.cpp:309:16
    #19 0x7f8ece87482f in __libc_start_main /build/glibc-9tT8Do/glibc-2.23/csu/../csu/libc-start.c:291
    #20 0x41e0d4 in _start (m-c-1495643560-asan-debug/firefox+0x41e0d4)
Flags: in-testsuite?
This looks related to me.  We can leave it in this bug for now.  I want to see what happens with these patches for another day or so before deciding what to do about this.

The assert would seem to suggest that the DocAccessible tree is developing a loop...!
fixed in the 5-23 build.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
See Also: → 1367799
Is it possible to land the attached test in a useful fashion?
Flags: needinfo?(davidp99)
Keywords: leave-open
Target Milestone: --- → mozilla55
Would've been great but unfortunately no.  I actually haven't seen the test case cause this crash or the issue Tyson brought up in comment 42.  Its very timing dependent, which is why the patches were kind of a stab in the dark to begin with.
Flags: needinfo?(davidp99)
You need to log in before you can comment on or make changes to this bug.