Closed Bug 1268616 Opened 9 years ago Closed 9 years ago

Very large allocations in call to input_overflow_buf_.reserve()

Categories

(Core :: IPC, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s + ---
firefox46 --- unaffected
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- fixed

People

(Reporter: mccr8, Assigned: erahm)

References

Details

(Keywords: regression, Whiteboard: btpp-active)

Crash Data

Attachments

(3 files, 1 obsolete file)

I've come across 7 OOM crashes on this line: input_overflow_buf_.reserve(length + kReadBufferSize); ...where try_realloc() is attempting to allocate over 1GB. Here are two examples: 4GB: https://crash-stats.mozilla.com/report/index/dffd132e-0d84-4dfa-a878-071802160428 2GB: https://crash-stats.mozilla.com/report/index/4ff040cf-0d63-4863-9d33-400f92160428 I have no idea what is going wrong to cause this.
Crash Signature: [@ OOM | large | NS_ABORT_OOM | Buffer::try_realloc ]
The length in the message must be corrupted somehow. We probably shouldn't be so trusting of it. I think it would make sense to limit length to 32MB or something. If the message is actually bigger, it just means we'll have to grow the buffer.
Eric said he'd look at this.
Assignee: nobody → erahm
Several issues here: #1) kMaximumMessageSize is too large [1] #2) We check the size *after* growing the buffer [2], rather than before [3] #3) We don't enforce the max message size when *sending* #4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly broken in the more than one message in a buffer case [4]. We steal the buffer and assume the message was at the front. But that assumption is incorrect if there happened to have been a smaller message first. This would result in a malformed message as well as the next message in the buffer pointing to junk. #5) The nesting and density of this code probably led to #4 [1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel.h#51 [2] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358 [3] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358 [4] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#388-408
I guess there's technically another, although I think we always kill the connection if |ProcessIncomingMessages| fails. #6) We clear the buffer if the message was too big, but we leave the pipe with a potentially half-read message [1] [1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#358-362
(In reply to Eric Rahm [:erahm] from comment #3) > Several issues here: > > #1) kMaximumMessageSize is too large [1] It is sort of big, but I'm not convinced we should make it smaller. Any reason? > #2) We check the size *after* growing the buffer [2], rather than before [3] > #3) We don't enforce the max message size when *sending* > #4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly > broken in the more than one message in a buffer case [4]. We steal the > buffer and assume the message was at the front. But that assumption is > incorrect if there happened to have been a smaller message first. This would > result in a malformed message as well as the next message in the buffer > pointing to junk. I don't think that can happen. See the comment: https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#393 > #5) The nesting and density of this code probably led to #4 This code is going to change significantly in bug 1262671, so please hold off on any big cleanups for now. In general I agree the code is overly complicated, though. Especially the duplication between the posix and windows implementations.
#7) If we had a partial message on the previous iteration for some reason we grow the overflow buffer by kReadBufferSize (4K) [1], but then we just truncate it further down [2]. [1] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#374 [2] https://dxr.mozilla.org/mozilla-central/rev/86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ipc_channel_win.cc#441
(In reply to Eric Rahm [:erahm] from comment #7) > #7) If we had a partial message on the previous iteration for some reason we > grow the overflow buffer by kReadBufferSize (4K) [1], but then we just > truncate it further down [2]. In [1], we're sizing the buffer to the full size of the message plus some slop in case there are any other messages read in at the same time. That happens the first time we start reading a big message. We shouldn't resize the buffer down again until that message is processed. That's why the if condition is there in [2].
(In reply to Bill McCloskey (:billm) from comment #6) > (In reply to Eric Rahm [:erahm] from comment #3) > > Several issues here: > > > > #1) kMaximumMessageSize is too large [1] > > It is sort of big, but I'm not convinced we should make it smaller. Any > reason? OOMs, as mccr8 noted it's less in upstream, but if we're going to segment things maybe it doesn't matter as much. > > #4) The steal-the-buffer-if-the-message-is-big insanity is unsurprisingly > > broken in the more than one message in a buffer case [4]. We steal the > > buffer and assume the message was at the front. But that assumption is > > incorrect if there happened to have been a smaller message first. This would > > result in a malformed message as well as the next message in the buffer > > pointing to junk. > > I don't think that can happen. See the comment: > https://dxr.mozilla.org/mozilla-central/rev/ > 86730d0a82093d705e44f33a34973d28b269f1ea/ipc/chromium/src/chrome/common/ > ipc_channel_win.cc#393 Ah right, a completed large message couldn't come after another message due to the size being greater than the read buffer. Looks like the posix version has a release assert for that. > > #5) The nesting and density of this code probably led to #4 > > This code is going to change significantly in bug 1262671, so please hold > off on any big cleanups for now. In general I agree the code is overly > complicated, though. Especially the duplication between the posix and > windows implementations. Sure I can revert that (I just added some newlines and // end if/while/for (foo) comments). (In reply to Bill McCloskey (:billm) from comment #8) > (In reply to Eric Rahm [:erahm] from comment #7) > > #7) If we had a partial message on the previous iteration for some reason we > > grow the overflow buffer by kReadBufferSize (4K) [1], but then we just > > truncate it further down [2]. > > In [1], we're sizing the buffer to the full size of the message plus some > slop in case there are any other messages read in at the same time. That > happens the first time we start reading a big message. We shouldn't resize > the buffer down again until that message is processed. That's why the if > condition is there in [2]. Yeah the flow here is a bit weird: #1) We append the message data just received #2) We reserve enough memory for the full message (we might already have a full message) + 4K #3) If we get a full message, we process it, truncate the buffer It seems like instead we'd want to reverse 1) and 2) if we have enough data to get the length. And for 2) I guess what we really want is reserve(round_up(len, 4k)). It also seems silly to shrink overflow_buf to anything less the kReadBufferSize. Well unless there are a bunch of them, but there's just one right? Anyhow, if you're changing all of this around perhaps it's not worth getting into this further.
Whiteboard: btpp-active
Attachment #8747130 - Flags: review?(wmccloskey)
Attachment #8747131 - Flags: review?(wmccloskey)
This reduces the maximum message size to 128MiB which matches what is done upstream. The goal is to help reduce OOMs due to overly large messages.
Attachment #8747132 - Flags: review?(wmccloskey)
Bill I understand if you'd rather not land these if they overlap too much with what you're doing. I stuck with the simplest changes. If not maybe you can incorporate them into what you have.
Attachment #8747130 - Flags: review?(wmccloskey) → review+
Comment on attachment 8747131 [details] [diff] [review] Part 2: Stop sending messages that are too large Review of attachment 8747131 [details] [diff] [review]: ----------------------------------------------------------------- Could you instead stick a check in MessageLink that will crash if the message is too big? http://searchfox.org/mozilla-central/source/ipc/glue/MessageLink.cpp#174 That way, we'll get a stack telling us exactly who sent the message.
Attachment #8747131 - Flags: review?(wmccloskey)
Attachment #8747132 - Flags: review?(wmccloskey) → review+
This adds an assert rather than just failing to send. I'm not sure if we would have just asserted on failing to send previously, I think we probably just closed the connection.
Attachment #8748945 - Flags: review?(wmccloskey)
Attachment #8747131 - Attachment is obsolete: true
Comment on attachment 8748945 [details] [diff] [review] Part 2: Stop sending messages that are too large Review of attachment 8748945 [details] [diff] [review]: ----------------------------------------------------------------- Thanks Eric.
Attachment #8748945 - Flags: review?(wmccloskey) → review+
tracking-e10s: --- → +
Priority: -- → P1
Blocks: 1273258
It looks like the above commits have caused a crash regression, see https://bugzilla.mozilla.org/show_bug.cgi?id=1273258.
No longer blocks: 1273258
Depends on: 1273258
Depends on: 1274074
Depends on: 1274075
Depends on: 1277744
bill, should we uplift this to 48?
Flags: needinfo?(wmccloskey)
I don't think so. It seems to have caused problems for some people. I'm not sure if that's because of the 256->128 change or something else.
Flags: needinfo?(wmccloskey)
hi, i'm not sure if this crash is fixed. crashes with the signature [@ OOM | large | NS_ABORT_OOM | Buffer::try_realloc] were continuing on nightly after the patch here in the bug has landed. also, the signature wasn't showing up in builds before 47. in early crash data this issue is now on #20 with 0.5% of all crashes.
This was not intended to fix all crashes with this signature, only those where the OOM allocation size is very large (say, over 1GB). It does seem to have done that, though I only found two signatures on 38 (where this is not fixed), so it is hard to say what the rate of this crash really is.
(In reply to [:philipp] from comment #24) > in early crash data this issue is now on #20 with 0.5% of all crashes. oops comment 24 should have said: in early crash data for 47.0 this signature is ~0.5% of all crashes
Depends on: 1288997
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: