Closed Bug 1271102 Opened 4 years ago Closed 2 years ago

Release assert when sending large (>128MB) IPC messages

Categories

(Core :: IPC, defect, P1, critical)

Unspecified
Windows 8
defect

Tracking

()

RESOLVED FIXED
Tracking Status
e10s + ---
firefox48 --- unaffected
firefox49 + fixed
firefox50 + fixed

People

(Reporter: dbaron, Unassigned)

References

(Depends on 1 open bug)

Details

(4 keywords, Whiteboard: btpp-meta)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-5b3756e3-1086-4cd8-94b2-2a98e2160506.
=============================================================

A new topcrash started in build 20160506052823.
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-04-01&signature=mozilla%3A%3Aipc%3A%3AProcessLink%3A%3ASendMessageW

It's clearly a regression from:
https://hg.mozilla.org/mozilla-central/rev/5cdbf605f972

It's not clear to me how the crashes should usefully be categorized, though.

Are initial segments of the stack (but with more than one function) a useful way to categorize them?  (i.e., should we add things to prefix_signature_re?)  Or something else?


FWIW, I looked at 5 reports, and there were two pairs of reports that were similar.  Two reports had the stack top:


0 	xul.dll 	mozilla::ipc::ProcessLink::SendMessageW(IPC::Message*) 	ipc/glue/MessageLink.cpp:161
1 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message const&) 	ipc/glue/MessageChannel.cpp:1607
2 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp:1560
3 	xul.dll 	nsRunnableMethodImpl<bool ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:741


and two had:


0 	xul.dll 	mozilla::ipc::ProcessLink::SendMessageW(IPC::Message*) 	ipc/glue/MessageLink.cpp:161
1 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*) 	ipc/glue/MessageChannel.cpp:780
2 	xul.dll 	mozilla::dom::PBrowserChild::SendInvokeDragSession(nsTArray<mozilla::dom::IPCDataTransfer> const&, unsigned int const&, nsCString const&, unsigned int const&, unsigned int const&, unsigned int const&, unsigned char const&, int const&, int const&) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp:2127
3 	xul.dll 	nsDragServiceProxy::InvokeDragSessionImpl(nsISupportsArray*, nsIScriptableRegion*, unsigned int) 	widget/nsDragServiceProxy.cpp:61
Flags: needinfo?(erahm)
To the extent that we can't fix it quickly, we should classify these by the message type, I think. That's probably not available in any annotations, and perhaps not in the minidump at all currently.
We changed the behavior here to enforce the maximum message size on the sender side rather than the receiver side. In theory we're not causing more crashes, just crashing sooner.

I could either add the message name to the assertion message or I could add an annotation (I'm not sure what the process for that is though).

I also saw structured clone in some of the stacks:

> 0 	xul.dll 	mozilla::ipc::ProcessLink::SendMessageW(IPC::Message*) 	ipc/glue/MessageLink.cpp:161
> 1 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*) 	ipc/glue/MessageChannel.cpp:780
> 2 	xul.dll 	mozilla::dom::PBrowserParent::SendAsyncMessage(nsString const&, nsTArray<mozilla::jsipc::CpowEntry> const&, IPC::Principal const&, mozilla::dom::ClonedMessageData const&) 	obj-firefox/ipc/ipdl/PBrowserParent.cpp:231
> 3 	xul.dll 	nsFrameLoader::DoSendAsyncMessage(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameLoader.cpp:2681
Flags: needinfo?(erahm)
If you do add an annotation, it would be nice to have the size of the message in there, like we do for NS_ABORT_OOM.
Annotating has a problem in that it's main thread only in content processes. I would guess that the IPC messaging layer is off main thread (I need to confirm this), if that's the case there's not much I can do.
If the above page prints

"progress 1"
"The transfer is complete."

then the crash did not occur.

Running in non-e10s mode (set browser.tabs.remote.autostart.2;false) avoids the crash, so looks like e10s related.
Yes, this assertion is caused by an IPC message being very large, so they are mostly going to be e10s-only. Your particular crash is in message manager, for what it is worth. The other recognizable one I see in this thread is in drag and drop code, which sends images serialized as a string. I should file a bug about using shared memory for that.

Message manager will be harder to fix. Maybe the message manager could reject large messages somehow, and throw an exception.

(In reply to Eric Rahm [:erahm] from comment #2)
> We changed the behavior here to enforce the maximum message size on the
> sender side rather than the receiver side. In theory we're not causing more
> crashes, just crashing sooner.

You also decreased the message size limit. Maybe we should back that out until some of these cases are more under control? Though sending messages between 128mb and 256mb is going to often result in an OOM crash on 32-bit systems.
Depends on: 1272018
I'm hitting this crash when trying to analyze something with the gecko profiler addon, using custom profiler settings with an increased sample buffer.
not blocking on this, we prefer to fix the individual cases where we allocate large blocks for individual messages.
Priority: -- → P1
Keywords: meta
Depends on: 1272423
Yeah, I think bug 1093357 will let us avoid giant IPC messages for a lot of the various IPC users that are sending giant gobs of data.
Depends on: 1093357
(In reply to Andrew McCreight [:mccr8] from comment #11)
> Yeah, I think bug 1093357 will let us avoid giant IPC messages for a lot of
> the various IPC users that are sending giant gobs of data.

Keep in mind that bug 1093357 is only supporting pipes from child-to-parent at first.  At least one of the stacks above seems to be sending from parent-to-child.
Summary: Crashes in mozilla::ipc::ProcessLink::SendMessageW, starting 2016-05-06, from MOZ_RELEASE_ASSERT(msg->size() < IPC::Channel::kMaximumMessageSize) → Release assert when sending large (>128MB) IPC messages
This crash seems to show up without the W on OSX. It is the number one crash on OSX for 05-11 builds (with only 5 crashes).
Crash Signature: [@ mozilla::ipc::ProcessLink::SendMessageW] → [@ mozilla::ipc::ProcessLink::SendMessageW] [@ mozilla::ipc::ProcessLink::SendMessage ]
Blocks: 1273258
Crash Signature: [@ mozilla::ipc::ProcessLink::SendMessageW] [@ mozilla::ipc::ProcessLink::SendMessage ] → [@ mozilla::ipc::ProcessLink::SendMessageW] [@ mozilla::ipc::ProcessLink::SendMessage] [@ xul.dll@0xccfa63 | strspn]
I'm crashing on this on Windows on a lot of different activities. Trying to view a JS file in the debugger, or trying to visit a page that stores asset files to IndexedDB cache, or trying to run other pages that call postmessage.

In comment 9, it is stated "we prefer to fix the individual cases ...". Does that mean we prefer to open a separate bug entry for each distinct crashing call stack leading to SendMessage?
Assuming yes to my above question, I went ahead and filed the two separate call stack crashes as

bug 1274074 - console.log()ging too much data can crash e10s via IPC call from mozilla::dom::ProcessGlobal::SendAsyncMessage

and

bug 1274075 - IndexedDB ObjectStore AddOrPut operation crashes when attempting to send too much data via an IPC pipe

The call stacks are separate from comment 0, so they likely warrant reports of their own.
Depends on: 1274404
Depends on: 1274706
Also marked down a third detected call site in addition to those two from comment 15:

bug 1275062 -  IndexedDB IDB Request Send__delete__ operation crashes when attempting to send too much data via an IPC pipe
Depends on: 1275062
Depends on: 1274075
Depends on: 1274074
Duplicate of this bug: 1119836
Depends on: 1275398
(In reply to Jim Mathies [:jimm] from comment #9)
> not blocking on this, we prefer to fix the individual cases where we
> allocate large blocks for individual messages.

I would like a second try to block on this.

The size of the data that the parent process sends to the child can be controlled from a web page, as the Indexed DB bugs that Jukka has mentioned demonstrate.  This means that you can write a web page that causes the parent to send more than 128 megs of data and crash the whole browser which seems very bad.

Also, this API is exposed to chrome JS as this call stack shows https://crash-stats.mozilla.com/report/index/ddafba04-04f3-41d5-a810-cde2c2160602 which means that add-ons can also crash Firefox in this way.

I think we need to fix this issue.
On branches without bug 1262671, which just landed, I can't imagine that many users on 32-bit systems will actually survive trying to send a 128MB message.

We could also revert to what should have been the prior behavior by bumping the limit to 256MB. This will still let a webpage crash the browser by sending a lot of data, but of course there are lots of other ways to OOM kill the browser from content.

(In reply to :Ehsan Akhgari (busy, don't ask for review please) from comment #18)
> Also, this API is exposed to chrome JS as this call stack shows
> https://crash-stats.mozilla.com/report/index/ddafba04-04f3-41d5-a810-
> cde2c2160602 which means that add-ons can also crash Firefox in this way.

That should be addressed in bug 1272423, by just throwing an exception.
Depends on: 1277664
Depends on: 1277681
Depends on: 1277744
Whiteboard: btpp-meta
It seems like indexeddb blobs are the last remaining issue here. That will probably be hard to fix quickly. I think a good short-term fix here is to boost the limit back to 256MB. That can be done as an M9 bug.
It appears to be hard to fix some sources of >128 MiB messages (e.g. IndexedDB),
so revert back to a 256MiB limit for the short term.

Review commit: https://reviewboard.mozilla.org/r/58308/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/58308/
Attachment #8760913 - Flags: review?(wmccloskey)
Attachment #8760913 - Flags: review?(wmccloskey) → review+
Pushed by jryans@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/04beffefe3a7
Revert back to 256 MiB message limit. r=billm
Comment on attachment 8760913 [details]
Bug 1271102 - Revert back to 256 MiB message limit.

Approval Request Comment
[Feature/regressing bug #]:
bug 1268616
[User impact if declined]:
occasional crashes in games and mail sites
[Describe test coverage new/current, TreeHerder]:
reverting to the previous setting. well tested.
[Risks and why]: 
low, setting a maximum message size limit back to its original value.
[String/UUID change made/needed]:
none
Attachment #8760913 - Flags: approval-mozilla-beta?
Attachment #8760913 - Flags: approval-mozilla-aurora?
This isn't on beta.
Attachment #8760913 - Flags: approval-mozilla-beta?
Looks like this was fixed in 50, comment 24. Tracking since this looks like a recent regression.
Comment on attachment 8760913 [details]
Bug 1271102 - Revert back to 256 MiB message limit.

Revert to previous size limit to prevent crashes, please uplift to aurora
Attachment #8760913 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Since uplift to aurora, there are still new crash reports coming in with this signature. Are there still remaining issues for a followup bug?
Flags: needinfo?(dbaron)
:dbaron is out, let's redirect comment 30.
Flags: needinfo?(dbaron) → needinfo?(continuation)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> Since uplift to aurora, there are still new crash reports coming in with
> this signature. Are there still remaining issues for a followup bug?

The patch wasn't expected to fix all of these crashes. It just increases a limit from 128MB to 256MB. The open bugs blocking this one address some of the common remaining issues.
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #32)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> > Since uplift to aurora, there are still new crash reports coming in with
> > this signature. Are there still remaining issues for a followup bug?
> 
> The patch wasn't expected to fix all of these crashes. It just increases a
> limit from 128MB to 256MB. The open bugs blocking this one address some of
> the common remaining issues.

I guess this is why I'm not able to download files > 256 MB from mega.nz without crashing.
Do we know when we regressed this?
This bug makes Cleopatra (Gecko Profiler) cause reproducible tab crashes in many cases. It stops me from using the profiler. Should I file a bug against Cleopatra?
I wonder if there's been a regression in the last few days, I remember being able to download the files a few days ago, but I can't anymore.

kael, when did the crashes you're experiencing start to happen?
(In reply to Marco Castelluccio [:marco] from comment #33)
> (In reply to Andrew McCreight [:mccr8] from comment #32)
> > (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> > > Since uplift to aurora, there are still new crash reports coming in with
> > > this signature. Are there still remaining issues for a followup bug?
> > 
> > The patch wasn't expected to fix all of these crashes. It just increases a
> > limit from 128MB to 256MB. The open bugs blocking this one address some of
> > the common remaining issues.
> 
> I guess this is why I'm not able to download files > 256 MB from mega.nz
> without crashing.
> Do we know when we regressed this?

I haven't had any trouble with large file downloads. I've brought down isos from msdn for example without issue. I'll check again to be sure. Can you post crash report ids for this?

(In reply to K. Gadd (:kael) from comment #34)
> This bug makes Cleopatra (Gecko Profiler) cause reproducible tab crashes in
> many cases. It stops me from using the profiler. Should I file a bug against
> Cleopatra?

Yes, please do.
(In reply to Marco Castelluccio [:marco] from comment #35)
> I wonder if there's been a regression in the last few days, I remember being
> able to download the files a few days ago, but I can't anymore.
> 
> kael, when did the crashes you're experiencing start to happen?

I haven't used Cleopatra in months before today, sorry. If I don't have the GPU and Main Thread I/O options turned on, it doesn't crash, presumably because the profiles are smaller. And it seems to be intermittent, unfortunately. Bug 1294712 filed against it. Thanks.
(In reply to Jim Mathies [:jimm] from comment #36)
> I haven't had any trouble with large file downloads. I've brought down isos
> from msdn for example without issue. I'll check again to be sure. Can you
> post crash report ids for this?

I believe file downloads from mega.nz aren't "normal" downloads. They're stored in a blob first.

Here's a report: https://crash-stats.mozilla.com/report/index/e405c34e-e7fd-42fd-b30d-ef6a02160812.
See Also: → 1288997
Depends on: 1288997
See Also: 1288997
Depends on: 1306331
Could we augment the mozilla::ipc::ProcessLink::SendMessage and mozilla::ipc::ProcessLink::SendMessageW signatures with the ipc_message_name?
It would give us a better matching between signatures and bugs blocking this bug.
Flags: needinfo?(ted)
Not if we don't have it available as a crash annotation, no. We'd have to add that first. Ting-Yu did something similar in bug 1301022, but I don't think it's exactly what you'd want here.
Flags: needinfo?(ted)
marco pointed out that we did in fact add that annotation in bug 1274404. I don't see any reason why we couldn't use that for signatures. You'd want to file a Socorro bug on that.
Depends on: 1306449
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #41)
> You'd want to file a Socorro bug on that.

I've now filed bug 1271102 for that.
Bug 1306449 was fixed, so I've added the new signatures to the blocking bugs.
This was part of our memory reduction work for shipping e10s. closing this meta out.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.