Very large allocations in call to input_overflow_buf_.reserve()

RESOLVED FIXED in Firefox 49

Status

()

P1
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: mccr8, Assigned: erahm)

Tracking

({regression})

Trunk
mozilla49
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+, firefox46 unaffected, firefox47 wontfix, firefox48 wontfix, firefox49 fixed)

Details

(Whiteboard: btpp-active, crash signature)

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

3 years ago
I've come across 7 OOM crashes on this line:
  input_overflow_buf_.reserve(length + kReadBufferSize);
...where try_realloc() is attempting to allocate over 1GB.

Here are two examples:

4GB:
https://crash-stats.mozilla.com/report/index/dffd132e-0d84-4dfa-a878-071802160428

2GB:
https://crash-stats.mozilla.com/report/index/4ff040cf-0d63-4863-9d33-400f92160428

I have no idea what is going wrong to cause this.
(Reporter)

Updated

3 years ago
Crash Signature: [@ OOM | large | NS_ABORT_OOM | Buffer::try_realloc ]
The length in the message must be corrupted somehow. We probably shouldn't be so trusting of it. I think it would make sense to limit length to 32MB or something. If the message is actually bigger, it just means we'll have to grow the buffer.
(Reporter)

Comment 2

3 years ago
Eric said he'd look at this.
Assignee: nobody → erahm
(Assignee)

Comment 3

3 years ago
Several issues here:

#1) kMaximumMessageSize is too large [1]
#2) We check the size *after* growing the buffer [2], rather than before [3]
#3) We don't enforce the max message size when *sending*
#4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly broken in the more than one message in a buffer case [4]. We steal the buffer and assume the message was at the front. But that assumption is incorrect if there happened to have been a smaller message first. This would result in a malformed message as well as the next message in the buffer pointing to junk.
#5) The nesting and density of this code probably led to #4

[1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel.h#51
[2] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358
[3] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358
[4] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#388-408
(Assignee)

Comment 4

3 years ago
I guess there's technically another, although I think we always kill the connection if |ProcessIncomingMessages| fails.

#6) We clear the buffer if the message was too big, but we leave the pipe with a potentially half-read message [1]

[1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358-362
(In reply to Eric Rahm [:erahm] from comment #3)
> Several issues here:
> 
> #1) kMaximumMessageSize is too large [1]

It is sort of big, but I'm not convinced we should make it smaller. Any reason?

> #2) We check the size *after* growing the buffer [2], rather than before [3]
> #3) We don't enforce the max message size when *sending*
> #4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly
> broken in the more than one message in a buffer case [4]. We steal the
> buffer and assume the message was at the front. But that assumption is
> incorrect if there happened to have been a smaller message first. This would
> result in a malformed message as well as the next message in the buffer
> pointing to junk.

I don't think that can happen. See the comment:
https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#393

> #5) The nesting and density of this code probably led to #4

This code is going to change significantly in bug 1262671, so please hold off on any big cleanups for now. In general I agree the code is overly complicated, though. Especially the duplication between the posix and windows implementations.
(Assignee)

Comment 7

3 years ago
#7) If we had a partial message on the previous iteration for some reason we grow the overflow buffer by kReadBufferSize (4K) [1], but then we just truncate it further down [2].

[1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#374
[2] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#441
(In reply to Eric Rahm [:erahm] from comment #7)
> #7) If we had a partial message on the previous iteration for some reason we
> grow the overflow buffer by kReadBufferSize (4K) [1], but then we just
> truncate it further down [2].

In [1], we're sizing the buffer to the full size of the message plus some slop in case there are any other messages read in at the same time. That happens the first time we start reading a big message. We shouldn't resize the buffer down again until that message is processed. That's why the if condition is there in [2].
(Assignee)

Comment 9

3 years ago
(In reply to Bill McCloskey (:billm) from comment #6)
> (In reply to Eric Rahm [:erahm] from comment #3)
> > Several issues here:
> > 
> > #1) kMaximumMessageSize is too large [1]
> 
> It is sort of big, but I'm not convinced we should make it smaller. Any
> reason?

OOMs, as mccr8 noted it's less in upstream, but if we're going to segment things maybe it doesn't matter as much.

> > #4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly
> > broken in the more than one message in a buffer case [4]. We steal the
> > buffer and assume the message was at the front. But that assumption is
> > incorrect if there happened to have been a smaller message first. This would
> > result in a malformed message as well as the next message in the buffer
> > pointing to junk.
> 
> I don't think that can happen. See the comment:
> https://dxr.mozilla.org/mozilla-central/rev/
> 86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/
> ipc_channel_win.cc#393

Ah right, a completed large message couldn't come after another message due to the size being greater than the read buffer. Looks like the posix version has a release assert for that.

> > #5) The nesting and density of this code probably led to #4
> 
> This code is going to change significantly in bug 1262671, so please hold
> off on any big cleanups for now. In general I agree the code is overly
> complicated, though. Especially the duplication between the posix and
> windows implementations.

Sure I can revert that (I just added some newlines and // end if/while/for (foo) comments).

(In reply to Bill McCloskey (:billm) from comment #8)
> (In reply to Eric Rahm [:erahm] from comment #7)
> > #7) If we had a partial message on the previous iteration for some reason we
> > grow the overflow buffer by kReadBufferSize (4K) [1], but then we just
> > truncate it further down [2].
> 
> In [1], we're sizing the buffer to the full size of the message plus some
> slop in case there are any other messages read in at the same time. That
> happens the first time we start reading a big message. We shouldn't resize
> the buffer down again until that message is processed. That's why the if
> condition is there in [2].

Yeah the flow here is a bit weird:
#1) We append the message data just received
#2) We reserve enough memory for the full message (we might already have a full message) + 4K
#3) If we get a full message, we process it, truncate the buffer

It seems like instead we'd want to reverse 1) and 2) if we have enough data to get the length. And for 2) I guess what we really want is reserve(round_up(len, 4k)).

It also seems silly to shrink overflow_buf to anything less the kReadBufferSize. Well unless there are a bunch of them, but there's just one right?

Anyhow, if you're changing all of this around perhaps it's not worth getting into this further.
Whiteboard: btpp-active
(Assignee)

Comment 11

3 years ago
Created attachment 8747130 [details] [diff] [review]
Part 1: Check max message size before resizing
Attachment #8747130 - Flags: review?(wmccloskey)
(Assignee)

Comment 12

3 years ago
Created attachment 8747131 [details] [diff] [review]
Part 2: Stop sending messages that are too large
Attachment #8747131 - Flags: review?(wmccloskey)
(Assignee)

Comment 13

3 years ago
Created attachment 8747132 [details] [diff] [review]
Part 3: Reduce the maxmimum IPC message size

This reduces the maximum message size to 128MiB which matches what is done
upstream. The goal is to help reduce OOMs due to overly large messages.
Attachment #8747132 - Flags: review?(wmccloskey)
(Assignee)

Comment 14

3 years ago
Bill I understand if you'd rather not land these if they overlap too much with what you're doing. I stuck with the simplest changes. If not maybe you can incorporate them into what you have.
Attachment #8747130 - Flags: review?(wmccloskey) → review+
Comment on attachment 8747131 [details] [diff] [review]
Part 2: Stop sending messages that are too large

Review of attachment 8747131 [details] [diff] [review]:
-----------------------------------------------------------------

Could you instead stick a check in MessageLink that will crash if the message is too big?
http://searchfox.org/mozilla-central/source/ipc/glue/MessageLink.cpp#174
That way, we'll get a stack telling us exactly who sent the message.
Attachment #8747131 - Flags: review?(wmccloskey)
Attachment #8747132 - Flags: review?(wmccloskey) → review+
(Assignee)

Comment 16

3 years ago
Created attachment 8748945 [details] [diff] [review]
Part 2: Stop sending messages that are too large

This adds an assert rather than just failing to send. I'm not sure if we would have just asserted on failing to send previously, I think we probably just closed the connection.
Attachment #8748945 - Flags: review?(wmccloskey)
(Assignee)

Updated

3 years ago
Attachment #8747131 - Attachment is obsolete: true
Comment on attachment 8748945 [details] [diff] [review]
Part 2: Stop sending messages that are too large

Review of attachment 8748945 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Eric.
Attachment #8748945 - Flags: review?(wmccloskey) → review+

Comment 20

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9d579653d733
https://hg.mozilla.org/mozilla-central/rev/5cdbf605f972
https://hg.mozilla.org/mozilla-central/rev/bd5c9cfed83d
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox49: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49

Updated

3 years ago
tracking-e10s: --- → +
Priority: -- → P1

Updated

3 years ago
Blocks: 1273258

Comment 21

3 years ago
It looks like the above commits have caused a crash regression, see https://bugzilla.mozilla.org/show_bug.cgi?id=1273258.

Updated

3 years ago
No longer blocks: 1273258
Depends on: 1273258

Updated

3 years ago
Depends on: 1274074

Updated

3 years ago
Depends on: 1274075
Depends on: 1277744

Comment 22

2 years ago
bill, should we uplift this to 48?
status-firefox46: --- → wontfix
status-firefox47: --- → wontfix
status-firefox48: --- → affected
Flags: needinfo?(wmccloskey)
I don't think so. It seems to have caused problems for some people. I'm not sure if that's because of the 256->128 change or something else.
Flags: needinfo?(wmccloskey)

Comment 24

2 years ago
hi, i'm not sure if this crash is fixed. crashes with the signature [@ OOM | large | NS_ABORT_OOM | Buffer::try_realloc] were continuing on nightly after the patch here in the bug has landed. 
also, the signature wasn't showing up in builds before 47. in early crash data this issue is now on #20 with 0.5% of all crashes.
status-firefox46: wontfix → unaffected
Keywords: regression
(Reporter)

Comment 25

2 years ago
This was not intended to fix all crashes with this signature, only those where the OOM allocation size is very large (say, over 1GB). It does seem to have done that, though I only found two signatures on 38 (where this is not fixed), so it is hard to say what the rate of this crash really is.
status-firefox48: affected → wontfix

Comment 26

2 years ago
(In reply to [:philipp] from comment #24)
> in early crash data this issue is now on #20 with 0.5% of all crashes.
oops comment 24 should have said: in early crash data for 47.0 this signature is ~0.5% of all crashes
Depends on: 1288997
You need to log in before you can comment on or make changes to this bug.