Closed Bug 1185726 Opened 6 years ago Closed 4 years ago

Mysterious crashes in nsIFrame in Nightly 42

Categories

(Core :: Disability Access APIs, defect)

42 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
mozilla42
Tracking Status
firefox42 + wontfix
firefox43 --- ?
firefox44 --- ?
firefox45 --- ?

People

(Reporter: away, Unassigned)

References

Details

(Keywords: crash, Whiteboard: ShutDownKill)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-48a4d38d-841c-4c1c-8db8-7bb282150715.
=============================================================

The signatures are all over nsIFrame (I've only listed a few). In total it's one of the top crashes on nightly right now.

The reason code is EXCEPTION_BREAKPOINT which normally means an "int 3" assembly instruction from a MOZ_CRASH, but I don't see those instructions in the minidumps. This seems to have started in 20150714030206.

(I checekd, and it's not the AMD bug)

https://crash-stats.mozilla.com/search/?signature=%24nsIFrame&release_channel=nightly&product=Firefox&platform=Windows+NT&_facets=signature&_facets=build_id&_facets=version&_facets=release_channel&_facets=reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Needs an owner.
Flags: needinfo?(bugs)
[Tracking Requested - why for this release]:
This is a huge portion of the currently very high crash rate on Nightly. I think most of the bold items in https://crash-analysis.mozilla.com/rkaiser/2015-07-19/2015-07-19.firefox.42.explosiveness.html belong to this issue.
I think these are actually crashes in the parent process, with this stack:
xul!mozilla::dom::CrashReporterParent::GenerateCompleteMinidump<mozilla::dom::ContentParent>
xul!mozilla::dom::ContentParent::KillHard
xul!mozilla::dom::ContentParent::ProcessingError (*)
xul!mozilla::ipc::MessageChannel::MaybeHandleError
xul!mozilla::ipc::MessageChannel::DispatchMessageW
xul!mozilla::ipc::MessageChannel::OnMaybeDequeueOne
xul!RunnableMethod<mozilla::ipc::MessageChannel,void (__cdecl mozilla::ipc::MessageChannel::*)(void) __ptr64,Tuple0>::Run
xul!MessageLoop::DoWork

Near (*) I see these strings on the stack
00feed40  6c746e54 "DispatchAsyncMessage"
00feed44  00feede8 "(msgtype=0x4A0003,name=PDocAccessible::Msg_ShowEvent) P"

I suspect that these EXCEPTION_BREAKPOINTS in the child process are artificial and coming from here: https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc#767

And I bet that the reason we see so much layout is just due to the hotness of that code.

The regression range is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=38d03bf4616e&tochange=283b5f38ce57 which makes me suspect bug 1172525. Trevor can you please investigate?
Flags: needinfo?(tbsaunde+mozbugs)
Blocks: 1172525
More complete version of the error string:
           aReason = 0x00feede8 "(msgtype=0x4A0003,name=PDocAccessible::Msg_ShowEvent) Processing error: message was deserialized, but the handler returned false (indicating failure)"
I filled bug 1183913 for a huge amount of crashes I have with e10s ON, starting with the build from July 14th. I didn't do much investigations.
Could you file a bug on getting crash reporter to report this better?
Flags: needinfo?(dmajor)
Duplicate of this bug: 1183913
Out of the ~180 signatures we saw on Nightly yesterday, I guess that >150 are this crash, and probably 80-90% of the total crash volume. We really need to get this resolved in some form.
Wondering if Billm might shed light on comment 4.
Flags: needinfo?(wmccloskey)
The problem is that DocAccessibleParent::RecvShowEvent is returning false. That causes us to kill the child process. Since the Show message is async, the child process might have been doing almost anything at this point, so that's why the signatures are so random.

It seems like bug 1172525 is a likely candidate for regressing this given the range. I'm afraid I don't really understand the patch.

In general, the only reason to return false from these functions is if the child process is likely to be misbehaving or exploited. It shouldn't just be for expected errors.
Flags: needinfo?(wmccloskey)
Also, you can use Super Search to look for these based on the "ipc channel error" field. There are an enormous number of them! Out of all our "return false" crashes, 97% are from PDocAccessible. 78% are from RecvShowEvent, 11% from TextChangeEvent, and 7% from StateChangeEvent.
(In reply to Bill McCloskey (:billm) from comment #10)
> The problem is that DocAccessibleParent::RecvShowEvent is returning false.
> That causes us to kill the child process. Since the Show message is async,
> the child process might have been doing almost anything at this point, so
> that's why the signatures are so random.
> 
> It seems like bug 1172525 is a likely candidate for regressing this given
> the range. I'm afraid I don't really understand the patch.

could be your patch from bug 1100602 too (i haven't looked at the range) but I'd give that a fair chance of causing some of these.

> In general, the only reason to return false from these functions is if the
> child process is likely to be misbehaving or exploited. It shouldn't just be
> for expected errors.

I'm pretty sure all of these are cases of the child process being buggy.  That said for now I'll change these to assert and return true without doing anything.
Assignee: nobody → tbsaunde+mozbugs
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Trevor Saunders (:tbsaunde) from comment #12)
> could be your patch from bug 1100602 too (i haven't looked at the range) but
> I'd give that a fair chance of causing some of these.

I thought that too, but the crashes predate my patch. They start on 7/14 and the earliest nightly that would have had my patch is 7/17.

> I'm pretty sure all of these are cases of the child process being buggy. 
> That said for now I'll change these to assert and return true without doing
> anything.

Okay, as long as the parent isn't going to just crash based on that.
(In reply to Bill McCloskey (:billm) from comment #13)
> (In reply to Trevor Saunders (:tbsaunde) from comment #12)
> > could be your patch from bug 1100602 too (i haven't looked at the range) but
> > I'd give that a fair chance of causing some of these.
> 
> I thought that too, but the crashes predate my patch. They start on 7/14 and
> the earliest nightly that would have had my patch is 7/17.

hmm

I'm working on getting the dom fuzzer to work with e10s hopefully that will get me a test case and I can see what's happening here.

> > I'm pretty sure all of these are cases of the child process being buggy. 
> > That said for now I'll change these to assert and return true without doing
> > anything.
> 
> Okay, as long as the parent isn't going to just crash based on that.

I believe the parent should be fine.  It should behave just like if those events never happened.
Attachment #8636604 - Flags: review?(lorien) → review+
Component: Layout → Disability Access APIs
Adding a tracking flag for FF42 as it is a topcrash.
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #6)
> Could you file a bug on getting crash reporter to report this better?

Yeah, this was a suboptimal debugging experience.
Flags: needinfo?(dmajor)
Flags: needinfo?(bugs)
See Also: → 1186092
Duplicate of this bug: 1186097
Crash Signature: , mozilla::Sides, int* const)] [@ nsIFrame::MarkAbsoluteFramesForDisplayList(nsDisplayListBuilder*, nsRect const&)] [@ nsIFrame::IsVisibleConsideringAncestors(unsigned int)] [@ nsIFrame::HasOpacity()] [@ nsIFrame::GetOverflowAreasProperty()] → , mozilla::Sides, int* const)] [@ nsIFrame::MarkAbsoluteFramesForDisplayList(nsDisplayListBuilder*, nsRect const&)] [@ nsIFrame::IsVisibleConsideringAncestors(unsigned int)] [@ nsIFrame::HasOpacity()] [@ nsIFrame::GetOverflowAreasProperty()] [@ NtSet…
The fix is looking good. I'm not seeing any reports of these crashes on the latest nightly so far.
Nightly crash volume is way down, but still I noticed this on the 0723 build: bp-d97663da-d8e1-4a7a-84d4-c98c82150723

         reason = char [512] "(msgtype=0x4A0008,name=PDocAccessible::Msg_BindChildDoc) Processing error: message was deserialized, but the handler returned false (indicating failure)"

Is it another manifestation of the same issue?
Flags: needinfo?(tbsaunde+mozbugs)
in some sense yeah, since we don't correctly keep the tree in the parent up to date any message that rellies on that can fail, and BindChildDoc is another such message.

I'll prepare a patch to return true from that one :(
Flags: needinfo?(tbsaunde+mozbugs)
BindDoc isn't very common, that ShoweEvent was the bad one. Here's a report showing how these failures get associated with child calls.

http://www.mathies.com/mozilla/client-abort-report-nightly.txt
(In reply to Trevor Saunders (:tbsaunde) from comment #25)
> in some sense yeah, since we don't correctly keep the tree in the parent up
> to date any message that rellies on that can fail, and BindChildDoc is
> another such message.
> 
> I'll prepare a patch to return true from that one :(

I'm crashing a lot recently, and almost all of my crash reports have "(msgtype=0x4A0008,name=PDocAccessible::Msg_BindChildDoc) Processing error: message was deserialized, but the handler returned false (indicating failure)".
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Guilherme Lima from comment #27)
> (In reply to Trevor Saunders (:tbsaunde) from comment #25)
> > in some sense yeah, since we don't correctly keep the tree in the parent up
> > to date any message that rellies on that can fail, and BindChildDoc is
> > another such message.
> > 
> > I'll prepare a patch to return true from that one :(
> 
> I'm crashing a lot recently, and almost all of my crash reports have
> "(msgtype=0x4A0008,name=PDocAccessible::Msg_BindChildDoc) Processing error:
> message was deserialized, but the handler returned false (indicating
> failure)".

BindChildDoc is bug 1189277.
(In reply to Jim Mathies [:jimm] from comment #28)
> (In reply to Guilherme Lima from comment #27)
> > (In reply to Trevor Saunders (:tbsaunde) from comment #25)
> > > in some sense yeah, since we don't correctly keep the tree in the parent up
> > > to date any message that rellies on that can fail, and BindChildDoc is
> > > another such message.
> > > 
> > > I'll prepare a patch to return true from that one :(
> > 
> > I'm crashing a lot recently, and almost all of my crash reports have
> > "(msgtype=0x4A0008,name=PDocAccessible::Msg_BindChildDoc) Processing error:
> > message was deserialized, but the handler returned false (indicating
> > failure)".
> 
> BindChildDoc is bug 1189277.

Ah, thanks!
Flags: needinfo?(tbsaunde+mozbugs)
OK, seems the crash is anyhow back.
I use Win7, 64-bit and FF42.0a2, 64-bit with e10s.
Seems I got this signature on shutdown of FF from the plugin-container.exe.
As this crashes are ATM normally not send out, I guess it happens a lot more often then just the ~150 times that are in Socorro for now ...

Here a overview about the crashes in Socorro:
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=nsIFrame%3A%3AStyleDisplay%28%29

Here is my one:
https://crash-stats.mozilla.com/report/index/34df7073-ea7f-4ee9-a8e2-a48ef2150906

And here the Crashing Thread from it:
0 	xul.dll 	nsIFrame::StyleDisplay() 	obj-firefox/dist/include/nsStyleStructList.h
1 	xul.dll 	nsFrame::DestroyFrom(nsIFrame*) 	layout/generic/nsFrame.cpp
2 	xul.dll 	nsBoxFrame::DestroyFrom(nsIFrame*) 	layout/xul/nsBoxFrame.cpp
3 	xul.dll 	nsContainerFrame::DestroyFrom(nsIFrame*) 	layout/generic/nsContainerFrame.cpp
4 	xul.dll 	nsBoxFrame::DestroyFrom(nsIFrame*) 	layout/xul/nsBoxFrame.cpp
5 	xul.dll 	nsContainerFrame::DestroyFrom(nsIFrame*) 	layout/generic/nsContainerFrame.cpp
6 	xul.dll 	nsContainerFrame::DestroyFrom(nsIFrame*) 	layout/generic/nsContainerFrame.cpp
7 	xul.dll 	nsFrameManager::Destroy() 	layout/base/nsFrameManager.cpp
8 	xul.dll 	PresShell::Destroy() 	layout/base/nsPresShell.cpp
9 	xul.dll 	nsDocumentViewer::DestroyPresShell() 	layout/base/nsDocumentViewer.cpp
10 	xul.dll 	nsDocumentViewer::Destroy() 	layout/base/nsDocumentViewer.cpp
11 	xul.dll 	nsDocShell::Destroy() 	docshell/base/nsDocShell.cpp
12 	xul.dll 	nsWebBrowser::SetDocShell(nsIDocShell*) 	embedding/browser/nsWebBrowser.cpp
13 	xul.dll 	nsWebBrowser::InternalDestroy() 	embedding/browser/nsWebBrowser.cpp
14 	xul.dll 	nsWebBrowser::Destroy() 	embedding/browser/nsWebBrowser.cpp
15 	xul.dll 	mozilla::dom::TabChild::DestroyWindow() 	dom/ipc/TabChild.cpp
16 	xul.dll 	mozilla::dom::TabChild::RecvDestroy() 	dom/ipc/TabChild.cpp
17 	xul.dll 	mozilla::dom::PBrowserChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp
18 	xul.dll 	mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PContentChild.cpp
19 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
20 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp
21 	xul.dll 	RunnableMethod<mozilla::ipc::MessageChannel, void ( mozilla::ipc::MessageChannel::*)(void), Tuple0>::Run() 	ipc/chromium/src/base/task.h
22 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
23 	xul.dll 	mozilla::ipc::DoWorkRunnable::Run() 	ipc/glue/MessagePump.cpp
24 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
25 	xul.dll 	XPTC__InvokebyIndex 	xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_64.asm
26 		@0xfff8ffffffffffff 	
27 	xul.dll 	nsXPConnect::XPConnect() 	js/xpconnect/src/xpcprivate.h
28 	xul.dll 	XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) 	js/xpconnect/src/XPCWrappedNative.cpp
29 	xul.dll 	XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) 	js/xpconnect/src/XPCWrappedNativeJSOps.cpp
30 		@0x9da2121c4b
Status: RESOLVED → REOPENED
Hardware: x86 → All
Resolution: FIXED → ---
Crash Signature: , nsRect const&)] [@ nsIFrame::IsVisibleConsideringAncestors(unsigned int)] [@ nsIFrame::HasOpacity()] [@ nsIFrame::GetOverflowAreasProperty()] [@ NtSetIoCompletion ] [@ ZwSetIoCompletion ] → , nsRect const&)] [@ nsIFrame::IsVisibleConsideringAncestors(unsigned int)] [@ nsIFrame::HasOpacity()] [@ nsIFrame::GetOverflowAreasProperty()] [@ NtSetIoCompletion ] [@ ZwSetIoCompletion ] [@ nsIFrame::StyleDisplay] [@ nsIFrame::BuildDisplayListFo…
Blocks: shutdownkill
OS: Windows NT → Windows
Whiteboard: ShutDownKill
From the crash signature nsIFrame::GetOverflowAreasProperty, the current affected versions are: 
- Nightly: 47
- Aurora: 44, 46
- Beta: 44.0b1, 44.0b99, 45.0b1
- Release: 44.0

From the crash signature nsIFrame::IsVisibleConsideringAncestors, the current affected versions are:
- Beta: 44.0b9, 44.0b99, 45.0b1, 45.0b2
- Release: 44.0

From the crash signature nsIFrame::HasOpacity, the current affected versions are:
- Beta: 44.0b1, 44.0b7, 44.0b99, 45.0b1
- Release: 44.0

From the crash signature nsIFrame::StyleDisplay, the current affected versions are:
- Aurora: 44, 46
- Beta: 44.0b1, 44.0b6, 44.0b8, 44.0b9, 44.0b99, 45.0b1, 45.0b2
- Release: 44.0

From the crash signature nsIFrame::MarkAbsoluteFramesForDisplayList, the current affected versions are:
- Beta: 45.0b1, 45.0b2

From the crash signature nsIFrame::BuildDisplayListForChild, the current affected versions are:
- Nightly: 46, 47
- Aurora: 44, 46
- Beta: 44.0b1, 44.0b2, 44.0b8, 44.0b99, 45.0b1, 45.0b2
- Release: 44.0

From the crash signature nsIFrame::ComputeBorderRadii, the current affected versions are:
- Nightly: 45
- Beta: 44.0b99
- Release: 44.0

From the crash signature NtSetIoCompletion, the current affected versions are:
- Nightly: 44, 45
- Aurora: 44
- Beta: 44.0b9

From the crash signature ZwSetIoCompletion and nsIFrame::InlinePrefISizeData::ForceBreak the current affected versions are:
- Nightly: 45

From the crash signature nsIFrame::BuildDisplayListForChild, nsIFrame::GetOffsetTo, nsIFrame::GetOffsetToCrossDoc, nsIFrame::GetPositionOfChildIgnoringScrolling and nsIFrame::SetStyleContext the current affected versions are:
- Nightly: 47
No longer blocks: shutdownkill
Assignee: tbsaunde+mozbugs → nobody
I do see some crashes like nsIFrame::IsVisibleConsideringAncestors, but I don't longer see a11y on stack, do you? If not, then it's worth to open a new bug for those crashes, and close this one.
Flags: needinfo?(brindusa.tot)
Hi Alexander,

From the list of the crashes from the bug, under the mozilla crash reports, the followings are still reproducing in the last 7 days:
  - nsIFrame::StyleDisplay
 - nsIFrame::BuildDisplayListForChild
 - nsIFrame::ComputeBorderRadii
 - nsIFrame::IsVisibleConsideringAncestors
 - nsIFrame::HasOpacity
 - nsIFrame::GetOverflowAreasProperty
 - ZwSetIoCompletion
 - nsIFrame::MarkAbsoluteFramesForDisplayList

For me, it's not clear why we should open a new bug for these crashes, considering that the above-mentioned crashes are still reproducible. Couldn't we continue tracking these failures in this bug?
Flags: needinfo?(brindusa.tot) → needinfo?(surkov.alexander)
(In reply to Brindusa Tot[:brindusat] from comment #34)
> Hi Alexander,
> 
> From the list of the crashes from the bug, under the mozilla crash reports,
> the followings are still reproducing in the last 7 days:
>   - nsIFrame::StyleDisplay
>  - nsIFrame::BuildDisplayListForChild
>  - nsIFrame::ComputeBorderRadii
>  - nsIFrame::IsVisibleConsideringAncestors
>  - nsIFrame::HasOpacity
>  - nsIFrame::GetOverflowAreasProperty
>  - ZwSetIoCompletion
>  - nsIFrame::MarkAbsoluteFramesForDisplayList
> 
> For me, it's not clear why we should open a new bug for these crashes,
> considering that the above-mentioned crashes are still reproducible.
> Couldn't we continue tracking these failures in this bug?

I suggested another bug because these crashes look different from those reported initially. Granted, they have same signatures, but their stacks are different, because I don't see a11y on the stacks any longer.

If you like to keep this bug open, then it's fine for sure, just the bug's component should be changed.
Flags: needinfo?(surkov.alexander) → needinfo?(brindusa.tot)
I would say, just update the component with the correct one and keep this bug opened.
Flags: needinfo?(brindusa.tot)
Component: Disability Access APIs → Layout
I don't see much value in keeping this bug open and tracking the crashes here.

The bug was originally opened for accessibility issue (see comment 3 onward). As far as the crashes no longer have a11y bits in their reports, the new crashes seem to be unrelated, and thus the discussion in this bug may not provide useful information anymore.

Furthermore, this bug contains lots of effectively unrelated signatures, which may actually belong to different bugs, and they happen to be connected to this bug just because there was a bug which can end up causing crashes with these signatures. So counting new crashes into this bug is not helpful, and can be confusing.

Given comment 30 and the fact that we don't see a11y bits in those crashes recently, I'd say this is a WORKSFORME since we don't actually know what fixed the a11y issue to have it stop triggering crashes here.

Please open new bugs for any signature which are still outstanding here.
Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Component: Layout → Disability Access APIs
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.