Closed Bug 1258312 Opened 5 years ago Closed 5 years ago

crash in mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::PBrowserChild::FatalError | mozilla::dom::PBrowserChild::OnMessageReceived

Categories

(Core :: DOM: Content Processes, defect, P1)

x86
Windows NT
defect

Tracking

()

RESOLVED FIXED
Tracking Status
e10s + ---
firefox47 --- fixed

People

(Reporter: kanru, Assigned: kanru)

References

Details

(Keywords: crash, Whiteboard: btpp-active)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-8c873afc-9585-404c-a46f-9bc4d2160320.
=============================================================

This crash is #2, with 296 crashes, +2.09% compared with the control group on the beta46 a/b experiment.

Frame 	Module 	Signature 	Source
0 	mozglue.dll 	mozalloc_abort(char const* const) 	memory/mozalloc/mozalloc_abort.cpp
1 	xul.dll 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
2 	xul.dll 	mozilla::ipc::FatalError(char const*, char const*, unsigned long, bool) 	ipc/glue/ProtocolUtils.cpp
3 	xul.dll 	mozilla::dom::PBrowserChild::FatalError(char const* const) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp
4 	xul.dll 	mozilla::dom::PBrowserChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp
5 	xul.dll 	mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PContentChild.cpp
6 	xul.dll 	mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
7 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
8 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp
9 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
10 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
Assignee: nobody → kchen
1) appear to be content process protocol FatalAborts - 

http://hg.mozilla.org/releases/mozilla-beta/annotate/5904e3eb711d/ipc/glue/ProtocolUtils.cpp#l332

2) All appear to be startup crashes

< 1 min 	98.67% 	446
1-5 min 	1.33% 	6

3) All signatures have DispatchAsyncMessage on the stack

We need to build beta up to get a look at where this happens in auto generated code.
I checked one report https://crash-stats.mozilla.com/report/index/f221d2f6-e2cf-4002-9336-7df922160317 and a few others.

The error is "Error deserializing nsSizeMode". Maybe we send some initialized value the content process. Since we only assert the validity in debug build we don't catch this on the sender side.
See Also: → 1259827
https://dxr.mozilla.org/mozilla-central/rev/63be002b4a803df1122823841ef7633b7561d873/ipc/chromium/src/base/pickle.cc#508-509

If Resize failed the content will crash in the exact same signature so this might be an OOM issue. Note bug 897870 added annotation to initial message construction but not WriteBegin. I'm not sure if writing a single bool could trigger the Resize yet. Our initial capacity_ is 64 on both x86 and x86_64.
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #3)
> If Resize failed the content will crash in the exact same signature so this
> might be an OOM issue. Note bug 897870 added annotation to initial message
> construction but not WriteBegin. I'm not sure if writing a single bool could
> trigger the Resize yet. Our initial capacity_ is 64 on both x86 and x86_64.

That's a great point. That could explain a number of these deserialization crashes we see (bug 1259192 is another). While the lower level code that calls Pickle::BeginWrite() (like Pickle::WriteBytes()) seems to properly do error checking, I don't think that's happening in the higher-level IPDL-generated code such as 
PBrowserParent::SendUpdateDimensions. Do you want to write a patch to check the last Resize() or should I?
Flags: needinfo?(kchen)
Priority: -- → P1
See Also: → 1259192
See Also: → 1259198
(In reply to Andrew McCreight [:mccr8] from comment #4)
> (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #3)
> > If Resize failed the content will crash in the exact same signature so this
> > might be an OOM issue. Note bug 897870 added annotation to initial message
> > construction but not WriteBegin. I'm not sure if writing a single bool could
> > trigger the Resize yet. Our initial capacity_ is 64 on both x86 and x86_64.
> 
> That's a great point. That could explain a number of these deserialization
> crashes we see (bug 1259192 is another). While the lower level code that
> calls Pickle::BeginWrite() (like Pickle::WriteBytes()) seems to properly do
> error checking, I don't think that's happening in the higher-level
> IPDL-generated code such as 
> PBrowserParent::SendUpdateDimensions. Do you want to write a patch to check
> the last Resize() or should I?

I have a patch that turns Resize() to be infallible. What do you think? Although the Pickle methods is geared towards allocation failure reporting, none of the higher-level IPDL code checks the result.
Flags: needinfo?(kchen)
Not sure if we need the second patch but if we will want to know the real deserialization error in case it's not OOM.
The patches looks good to me, but you should get an IPC peer to review it. Maybe jld.
Attachment #8736137 - Flags: review?(continuation) → review?(jld)
Attachment #8736138 - Flags: review?(continuation) → review?(jld)
This particular signature does not appear in builds after 46 experiment 1, but that may be because it morphed to a different signature because of inlining or COMDAT-folding. I think we should proceed with getting the better debugging data.
Flags: needinfo?(jld)
review ping
Comment on attachment 8736137 [details]
MozReview Request: Bug 1258312 - Make Pickle::Resize infallible r?mccr8

https://reviewboard.mozilla.org/r/43123/#review40537

Longer-term it could be useful to be able to kill only the child process that caused the parent to fail an allocation, but that's not too meaningful in the immediate future.
Attachment #8736137 - Flags: review?(jld) → review+
Attachment #8736138 - Flags: review?(jld) → review+
Comment on attachment 8736138 [details]
MozReview Request: Bug 1258312 - Add crash annotation to EnumSerializer r?mccr8

https://reviewboard.mozilla.org/r/43125/#review40539
Flags: needinfo?(jld)
Is this ready to land now? It would be nice to see what happens with these patches.
Flags: needinfo?(kchen)
Flags: needinfo?(kchen)
Keywords: leave-open
Comment on attachment 8736137 [details]
MozReview Request: Bug 1258312 - Make Pickle::Resize infallible r?mccr8

Approval Request Comment
[Feature/regressing bug #]: N/A
Uplift this to 47 in preparation for the next e10s experiment and diagnose the problem. 
[User impact if declined]: no
[Describe test coverage new/current, TreeHerder]: cooked on m-c
[Risks and why]: Little. This turns a potential content crash to a main process OOM crash. The main process wouldn't be able to survive long after that anyway.
[String/UUID change made/needed]: no
Attachment #8736137 - Flags: approval-mozilla-aurora?
Comment on attachment 8736138 [details]
MozReview Request: Bug 1258312 - Add crash annotation to EnumSerializer r?mccr8

Approval Request Comment
Same as above
Attachment #8736138 - Flags: approval-mozilla-aurora?
See Also: → 1259480
I see a couple of deserialization FatalError crashes on April 2 through April 6, but none after this patch landed. These crashes aren't super common on Nightly, so it is hard to say for sure if they are gone entirely, but we should consider duping all of the other deserialization crashes in the "see also" section over to this one.

https://crash-stats.mozilla.com/search/?product=Firefox&version=48.0a1&signature=~mozilla%3A%3Aipc%3A%3AFatalError&build_id=%3E%3D20160311030233&_facets=signature&_columns=date&_columns=signature&_columns=build_id&_columns=ipc_fatal_error_msg&_columns=ipc_fatal_error_protocol#crash-reports
Whiteboard: btpp-active
Flags: needinfo?(rkothari)
Duplicate of this bug: 1259192
Duplicate of this bug: 1259198
These patches modify IPC OOM errors such that they show up as OOM signatures vs. FatalError signatures. It's really important we get this uplifted for the beta 47 experiment.
Comment on attachment 8736137 [details]
MozReview Request: Bug 1258312 - Make Pickle::Resize infallible r?mccr8

e10s stability related, Aurora47+
Attachment #8736137 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8736138 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
The reason why I didn't prioritize this over the other bugs in the aurora nom queue is the bug resolution is "New". I was under the impression that the fix hasn't landed in Nightly yet and therefore not ready for uplift to Aurora. Sorry about the delay!
Duplicate of this bug: 1259827
As expected, these crashes seem to have gone away on Aurora after 4-13, which is about when this landed.
I do actually see a few in the parent process after the landing, so we may have some lingering bugs in our deserialization code. Hopefully they aren't common enough to impact stability. If so, we can file new bugs for them.
Status: NEW → RESOLVED
Closed: 5 years ago
Keywords: leave-open
Resolution: --- → FIXED
I saw some new signatures on Nightly which are deserialization on nsTArray. One of the messages is PBrowser::Msg_HandleAccessKey. I haven't check the others. Totally disappeared on Aurora.
Blocks: 1265680
Duplicate of this bug: 1248788
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #19)
> Some results are coming in:
> 
> https://crash-stats.mozilla.com/search/
> ?product=Firefox&proto_signature=%5Emozalloc_abort+%7C+mozalloc_handle_oom+%7
> C+moz_xrealloc+%7C+Pickle%3A%3AResize&_facets=signature&_columns=date&_column
> s=signature&_columns=product&_columns=version&_columns=build_id&_columns=plat
> form#facet-signature

FWIW, that search is broken because it puts '+' characters before and after the '|' characters. This one is better:
> https://crash-stats.mozilla.com/search/?product=Firefox&proto_signature=%5Emozalloc_abort%20%7C%20mozalloc_handle_oom%20%7C%20moz_xrealloc%20%7C%20Pickle%3A%3AResize&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

(Socorro inserts the '+' chars sometimes. I haven't worked out exactly why or when.)
You need to log in before you can comment on or make changes to this bug.