Closed Bug 1235633 Opened 7 years ago Closed 6 years ago

[e10s] crash in OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink


(Core :: IPC, defect)

44 Branch
Windows NT
Not set



Tracking Status
e10s m9+ ---
firefox46 --- wontfix
firefox47 --- fixed
firefox48 --- fixed


(Reporter: tracy, Assigned: billm)



(Keywords: crash, topcrash-win, Whiteboard: [MemShrink:P1])

Crash Data


(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-00b84fbe-0ca0-4be2-b160-ccca52151219.

This crash is #10, with 132 crashes, on the 44beta a/b experiment.

Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
2 	kernelbase.dll 	ReadFile 	
3 	kernel32.dll 	ReadFileImplementation 	
4 	kernel32.dll 	ReadFileImplementation 	
5 	nss3.dll 	_MD_CURRENT_THREAD 	nsprpub/pr/src/md/windows/w95thred.c
6 	xul.dll 	mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) 	ipc/glue/MessageLink.cpp
7 	xul.dll 	IPC::Channel::ChannelImpl::ProcessIncomingMessages(base::MessagePumpForIO::IOContext*, unsigned long) 	ipc/chromium/src/chrome/common/
8 	xul.dll 	IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext*, unsigned long, unsigned long) 	ipc/chromium/src/chrome/common/
No longer blocks: 1234647
Blocks: 1234647
This is happening on the line:
IIRC, this is the point where a message is copied out of the raw buffer. I think this is just a fundamental problem where things that send huge messages are going to hit heap fragmentation issues.

These crash reports are also missing all of the nice information about system memory usage that you usually see (Total Virtual Memory, Available Virtual Memory, System Memory Use Percentage). Maybe that doesn't work in a content process? I'll file a bug about that.

This particular allocation size should also get that size of the allocation reported. I don't know if that's related to the previous issue or not.
Component: XPCOM → IPC
Bug 1156484 is in the same part of the code, except in that crash report the memory information (and allocation size) is showing up properly.
Depends on: 1236108
Blocks: 1246180
Also, why isn't the allocation size annotated for this case? (that's what the "unknown" in the signature means)
Ah, that seems to be bug 1236108, we really need to fix that to even be able to analyze if e10s is nearing ship-readiness.
renom'ing. Top crash in beta experiment. Poiru, can you indicate where this is in the top crash list?
Flags: needinfo?(birunthan)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5)
> renom'ing. Top crash in beta experiment. Poiru, can you indicate where this
> is in the top crash list?

This is #4 (2.08%) for content processes. See bug 1249209 comment 2 for full list.
Flags: needinfo?(birunthan)
Assignee: nobody → wmccloskey
Message's copy constructor goes to Pickle's copy constructor, which calls realloc() to have enough memory to store payload. If realloc() fails, NS_ABORT_OOM() is called and that is where this crash happens.

realloc() winds up calling imalloc(). Assume the size is a large number, and huge_malloc() is used. If it is chunk_alloc() to return null, it has to fail both chunk_recycle() and chunk_alloc_mmap(). Note chunk_alloc_mmap() goes to VirtualAlloc() on Windows. I wonder how did it fail.
The most common cause of OOM with large allocations is a lack of contiguous address space. Because of bug  1236108, these crash reports do not contain information about contiguous address space or memory usage, so it is hard to be sure if that is happening here.
I wonder how hard it would be to get data on IPC protocols using large messages, broken out by IPDL actor class and message (or the low-level message ID, although the mapping from that back to something meaningful varies with IPDL changes).
I think we can avoid the copy here. I'm working on a patch.
(In reply to Bill McCloskey (:billm) from comment #10)
> I think we can avoid the copy here. I'm working on a patch.

That'd be awesome!
Attached patch patch (obsolete) — Splinter Review
How does this look, Jed? This patch changes the code so that the Message passed to OnMessageReceivedFromLink owns its data. Therefore we can just move it to the queue rather than copying it. The goal of the patch is to ensure that we only have one copy of the message data around at a time, at least for large messages.

I also added some code to properly size the buffer for the message once we have received enough data to know how big it will be. Right now we rely on std::string to do exponential growth, which is good, but not as good as allocating with the right size from the beginning.

If this looks okay, then I'll fix up the Windows code in the same way. I made the new Buffer class as much like std::string as possible so that it fits into both the POSIX and Windows implementation with few changes. Perhaps in the future we could do a better job sharing the code in these two classes, but I'm not sure it's worth it.
Attachment #8726014 - Flags: feedback?(jld)
Whiteboard: [MemShrink]
Whiteboard: [MemShrink] → [MemShrink:P1]
Comment on attachment 8726014 [details] [diff] [review]

Review of attachment 8726014 [details] [diff] [review]:

Sorry for the delay.  I haven't looked at this quite as much as for r?, but it seems reasonable and a definite improvement for the case where the first message is large and the rest of the buffer is small.

::: ipc/chromium/src/base/
@@ +92,5 @@
> +Buffer::trade_bytes(size_t count)
> +{
> +  char* result = mBuffer;
> +  mSize = mReserved = mSize - count;
> +  mBuffer = (char*)malloc(mReserved);

What happens if mSize == count?

@@ +95,5 @@
> +  mSize = mReserved = mSize - count;
> +  mBuffer = (char*)malloc(mReserved);
> +  memcpy(mBuffer, result + count, mSize);
> +  return result;

Could the buffer be realloc'ed down to the requested size here?  (Specifically: does jemalloc do something reasonable with that in cases where the part to be dropped is small?)  I'm concerned about the case where the remaining part of the buffer is also large, and possibility of making things worse in some other case while trying to improve this one.

::: ipc/glue/MessageChannel.cpp
@@ +638,5 @@
>      }
>  };
>  void
> +MessageChannel::OnMessageReceivedFromLink(Message& aMsg)

It would be nice if this could be an rvalue reference to avoid the Move(), or at least move the Move() farther up the stack to where it's more obvious it the Message isn't used afterwards.

This would apply to the other const removals too, I think.
Attachment #8726014 - Flags: feedback?(jld) → feedback+
Depends on: 1257486
Assignee: wmccloskey → cyu
Depends on: 1256541
[@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ] looks like the same issue: during the call to push_back in MessageChannel::OnMessageReceivedFromLink(), the copy constructor for Pickle() attempts to allocate a new block of memory and fails. The allocation sizes I see in the handful of reports on Nightly so far range from about 262KB to 1.3MB.
Crash Signature: [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] → [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] [@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ]
Blocks: e10s-oom
No longer blocks: e10s-crashes
Assignee: cyu → wmccloskey
On further examination, most (but not all) of the crashes in the Pickle ctor are under PPluginScriptableObjectParent::CallInvoke(), in the parent process, so they aren't really e10s-specific. I'm not sure if I should split that into a separate bug or what.
Attached patch patchSplinter Review
I still don't have a green try run on Windows, but I think I'm close. This should be ready to review.
Attachment #8726014 - Attachment is obsolete: true
Attachment #8736506 - Flags: review?(jld)
See Also: → 1261094
Actually, the try push for this patch came back and it's green on Windows now.
Comment on attachment 8736506 [details] [diff] [review]

Review of attachment 8736506 [details] [diff] [review]:

Some minor things, one not-so-minor thing, but otherwise looks good.

::: ipc/chromium/src/base/
@@ +76,5 @@
> +void
> +Buffer::assign(const char* bytes, size_t length)
> +{
> +  if (bytes >= mBuffer && bytes < mBuffer + mReserved) {
> +    MOZ_RELEASE_ASSERT(length <= mSize);

Nit: this assertion could be stronger — bytes + length <= mBuffer + mSize should hold, I think?

@@ +91,5 @@
> +void
> +Buffer::erase(size_t start, size_t count)
> +{
> +  mSize -= count;
> +  memmove(mBuffer + start, mBuffer + start + count, mSize);

The third argument should be `mSize - start`.  (This won't have shown up in testing because the only call site passes 0 for start.)

@@ +110,5 @@
> +  char* result = mBuffer;
> +  mSize = mReserved = mSize - count;
> +  mBuffer = mReserved ? (char*)malloc(mReserved) : nullptr;
> +  MOZ_RELEASE_ASSERT(!mReserved || mBuffer);
> +  memcpy(mBuffer, result + count, mSize);

Nit: You might want to skip the memcpy if mBuffer is nullptr and mSize is 0.  This is one of those cases that's technically undefined behavior even though there's not much reason for it to do anything weird, but this part might be cleaner anyway with a single `if (mReserved)`.

::: ipc/chromium/src/chrome/common/
@@ +439,5 @@
> +          // overflow buffer.
> +          MOZ_RELEASE_ASSERT(p == overflowp);
> +          buf = input_overflow_buf_.trade_bytes(len);
> +
> +          // At this point the remaining data is at the from of

Typo nit: “at the front of”?
(Also in ipc_channel_win.)
Attachment #8736506 - Flags: review?(jld) → review+
Duplicate of this bug: 1156484
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Bill, should we uplift this fix to Aurora 47 in preparation for our Beta 47 test? This was the top e10s crash from our Beta 46 test.
Flags: needinfo?(wmccloskey)
Depends on: 1263292
Let's wait a few days to backport. This needs to stabilize. We have until 4/25, right?
Flags: needinfo?(wmccloskey)
Depends on: 1263457
Depends on: 1263763
Depends on: 1264398
Depends on: 1265036
Comment on attachment 8736506 [details] [diff] [review]

I'd like to get this into Aurora so that we get data for the next e10s experiment. It has been on Nightly for a week and I think we've shaken out all the bugs. It's not particularly configuration-dependent, so I wouldn't expect new problems to appear in other channels.

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: more OOMs with e10s
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: has been on Nightly for a while. risk is moderate and mostly restricted to e10s.
[String/UUID change made/needed]: none
Attachment #8736506 - Flags: approval-mozilla-aurora?
Comment on attachment 8736506 [details] [diff] [review]

Fixes some e10s OOMs, Aurora47+
Attachment #8736506 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Ryan, please let me land this. Setting needinfo so you see this :-).
Flags: needinfo?(wmccloskey)
Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm) → needinfo?(cbook)
(I rarely do uplifts anymore, FWIW)
Flags: needinfo?(wkocher)
Bill, this has conflicts when uplifting, can you take a look, thanks!
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
I believe that Bill's intent was for him to take care of it himself ;)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #29)
> I believe that Bill's intent was for him to take care of it himself ;)

oh sorry misread this that as we should check this in :)
Flags: needinfo?(wmccloskey)
Depends on: 1266578
Depends on: 1267106
Depends on: 1268616
Depends on: 1273258
Bill, just curious here:  we are seeing around 300 crashes a week with this signature on 46.0.1 release. I wondered if we have a bunch of people running 46.0.1 with e10s enabled. I don't think so, so in that case, we may still be seeing this crash signature in some other context. Do you think it's worth filing a followup bug? Or might this be related to whatever is going on in bug 1273258?
Flags: needinfo?(wmccloskey)
Liz, that has nothing to do with e10s (although I'm seeing only about 40 crashes per week):

These are crashes in the plugin subsystem where somebody is passing a very large string to a scriptable plugin via JS. see e.g. which is a representative example. I don't know if we can avoid crashing in this case and just reject the call back through MessageChannel::Call.
Flags: needinfo?(wmccloskey)
You need to log in before you can comment on or make changes to this bug.