Bugzilla

Comment 1

•

8 years ago

This is happening on the line:
   mPending.push_back(aMsg);
IIRC, this is the point where a message is copied out of the raw buffer. I think this is just a fundamental problem where things that send huge messages are going to hit heap fragmentation issues.

These crash reports are also missing all of the nice information about system memory usage that you usually see (Total Virtual Memory, Available Virtual Memory, System Memory Use Percentage). Maybe that doesn't work in a content process? I'll file a bug about that.

This particular allocation size should also get that size of the allocation reported. I don't know if that's related to the previous issue or not.

Component: XPCOM → IPC

Comment 2

•

8 years ago

Bug 1156484 is in the same part of the code, except in that crash report the memory information (and allocation size) is showing up properly.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

8 years ago

Depends on: 1236108

Biru [:poiru]

Updated

•

8 years ago

Blocks: 1246180

Biru [:poiru]

Updated

•

8 years ago

Blocks: e10s-crashes

Robert Kaiser

Comment 3

•

8 years ago

Also, why isn't the allocation size annotated for this case? (that's what the "unknown" in the signature means)

Robert Kaiser

Comment 4

•

8 years ago

Ah, that seems to be bug 1236108, we really need to fix that to even be able to analyze if e10s is nearing ship-readiness.

Comment 5

•

8 years ago

renom'ing. Top crash in beta experiment. Poiru, can you indicate where this is in the top crash list?

tracking-e10s: + → ?

Flags: needinfo?(birunthan)

Biru [:poiru]

Comment 6

•

8 years ago

(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5)
> renom'ing. Top crash in beta experiment. Poiru, can you indicate where this
> is in the top crash list?

This is #4 (2.08%) for content processes. See bug 1249209 comment 2 for full list.

Flags: needinfo?(birunthan)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Assignee: nobody → wmccloskey

tracking-e10s: ? → m9+

Ting-Yu Chou [:ting] (away)

Updated

•

8 years ago

Blocks: 1251376

Comment 7

•

8 years ago

Message's copy constructor goes to Pickle's copy constructor, which calls realloc() to have enough memory to store payload. If realloc() fails, NS_ABORT_OOM() is called and that is where this crash happens.

realloc() winds up calling imalloc(). Assume the size is a large number, and huge_malloc() is used. If it is chunk_alloc() to return null, it has to fail both chunk_recycle() and chunk_alloc_mmap(). Note chunk_alloc_mmap() goes to VirtualAlloc() on Windows. I wonder how did it fail.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 8

•

8 years ago

The most common cause of OOM with large allocations is a lack of contiguous address space. Because of bug  1236108, these crash reports do not contain information about contiguous address space or memory usage, so it is hard to be sure if that is happening here.

Comment 9

•

8 years ago

I wonder how hard it would be to get data on IPC protocols using large messages, broken out by IPDL actor class and message (or the low-level message ID, although the mapping from that back to something meaningful varies with IPDL changes).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 10

•

8 years ago

I think we can avoid the copy here. I'm working on a patch.

Ting-Yu Chou [:ting] (away)

Comment 11

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #10)
> I think we can avoid the copy here. I'm working on a patch.

That'd be awesome!

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 12

•

8 years ago

Attached patch patch (obsolete) — Details — Splinter Review

How does this look, Jed? This patch changes the code so that the Message passed to OnMessageReceivedFromLink owns its data. Therefore we can just move it to the queue rather than copying it. The goal of the patch is to ensure that we only have one copy of the message data around at a time, at least for large messages.

I also added some code to properly size the buffer for the message once we have received enough data to know how big it will be. Right now we rely on std::string to do exponential growth, which is good, but not as good as allocating with the right size from the beginning.

If this looks okay, then I'll fix up the Windows code in the same way. I made the new Buffer class as much like std::string as possible so that it fits into both the POSIX and Windows implementation with few changes. Perhaps in the future we could do a better job sharing the code in these two classes, but I'm not sure it's worth it.

Attachment #8726014 - Flags: feedback?(jld)

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Whiteboard: [MemShrink]

Updated

•

8 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 13

•

8 years ago

Comment on attachment 8726014 [details] [diff] [review]
patch

Review of attachment 8726014 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay.  I haven't looked at this quite as much as for r?, but it seems reasonable and a definite improvement for the case where the first message is large and the rest of the buffer is small.

::: ipc/chromium/src/base/buffer.cc
@@ +92,5 @@
> +Buffer::trade_bytes(size_t count)
> +{
> +  char* result = mBuffer;
> +  mSize = mReserved = mSize - count;
> +  mBuffer = (char*)malloc(mReserved);

What happens if mSize == count?

@@ +95,5 @@
> +  mSize = mReserved = mSize - count;
> +  mBuffer = (char*)malloc(mReserved);
> +  MOZ_RELEASE_ASSERT(mBuffer);
> +  memcpy(mBuffer, result + count, mSize);
> +  return result;

Could the buffer be realloc'ed down to the requested size here?  (Specifically: does jemalloc do something reasonable with that in cases where the part to be dropped is small?)  I'm concerned about the case where the remaining part of the buffer is also large, and possibility of making things worse in some other case while trying to improve this one.

::: ipc/glue/MessageChannel.cpp
@@ +638,5 @@
>      }
>  };
>  
>  void
> +MessageChannel::OnMessageReceivedFromLink(Message& aMsg)

It would be nice if this could be an rvalue reference to avoid the Move(), or at least move the Move() farther up the stack to where it's more obvious it the Message isn't used afterwards.

This would apply to the other const removals too, I think.

Attachment #8726014 - Flags: feedback?(jld) → feedback+

Jonathan Howard

Updated

•

8 years ago

Depends on: 1257486

Jim Mathies [:jimm]

Updated

•

8 years ago

Assignee: wmccloskey → cyu

Updated

•

8 years ago

Blocks: e10s-rc

Updated

•

8 years ago

Depends on: 1256541

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 14

•

8 years ago

[@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ] looks like the same issue: during the call to push_back in MessageChannel::OnMessageReceivedFromLink(), the copy constructor for Pickle() attempts to allocate a new block of memory and fails. The allocation sizes I see in the handful of reports on Nightly so far range from about 262KB to 1.3MB.

Jim Mathies [:jimm]

Updated

•

8 years ago

Blocks: e10s-oom
No longer blocks: e10s-crashes

Assignee

Updated

•

8 years ago

Assignee: cyu → wmccloskey

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 15

•

8 years ago

On further examination, most (but not all) of the crashes in the Pickle ctor are under PPluginScriptableObjectParent::CallInvoke(), in the parent process, so they aren't really e10s-specific. I'm not sure if I should split that into a separate bug or what.

Assignee

Comment 16

•

8 years ago

Attached patch patch — Details — Splinter Review

I still don't have a green try run on Windows, but I think I'm close. This should be ready to review.

Attachment #8726014 - Attachment is obsolete: true

Attachment #8736506 - Flags: review?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Updated

•

8 years ago

Comment 17

•

8 years ago

Actually, the try push for this patch came back and it's green on Windows now.

Comment 18

•

8 years ago

Comment on attachment 8736506 [details] [diff] [review]
patch

Review of attachment 8736506 [details] [diff] [review]:
-----------------------------------------------------------------

Some minor things, one not-so-minor thing, but otherwise looks good.

::: ipc/chromium/src/base/buffer.cc
@@ +76,5 @@
> +void
> +Buffer::assign(const char* bytes, size_t length)
> +{
> +  if (bytes >= mBuffer && bytes < mBuffer + mReserved) {
> +    MOZ_RELEASE_ASSERT(length <= mSize);

Nit: this assertion could be stronger — bytes + length <= mBuffer + mSize should hold, I think?

@@ +91,5 @@
> +void
> +Buffer::erase(size_t start, size_t count)
> +{
> +  mSize -= count;
> +  memmove(mBuffer + start, mBuffer + start + count, mSize);

The third argument should be `mSize - start`.  (This won't have shown up in testing because the only call site passes 0 for start.)

@@ +110,5 @@
> +  char* result = mBuffer;
> +  mSize = mReserved = mSize - count;
> +  mBuffer = mReserved ? (char*)malloc(mReserved) : nullptr;
> +  MOZ_RELEASE_ASSERT(!mReserved || mBuffer);
> +  memcpy(mBuffer, result + count, mSize);

Nit: You might want to skip the memcpy if mBuffer is nullptr and mSize is 0.  This is one of those cases that's technically undefined behavior even though there's not much reason for it to do anything weird, but this part might be cleaner anyway with a single `if (mReserved)`.

::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ +439,5 @@
> +          // overflow buffer.
> +          MOZ_RELEASE_ASSERT(p == overflowp);
> +          buf = input_overflow_buf_.trade_bytes(len);
> +
> +          // At this point the remaining data is at the from of

Typo nit: “at the front of”?
(Also in ipc_channel_win.)

Attachment #8736506 - Flags: review?(jld) → review+

Pulsebot

Comment 20

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/dd3e03fcb06b

Carsten Book [:Tomcat]

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dd3e03fcb06b

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox48: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla48

Comment 22

•

8 years ago

Bill, should we uplift this fix to Aurora 47 in preparation for our Beta 47 test? This was the top e10s crash from our Beta 46 test.

status-firefox47: --- → ?

Flags: needinfo?(wmccloskey)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Depends on: 1263292

Updated

•

8 years ago

Depends on: 1263281

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 23

•

8 years ago

Let's wait a few days to backport. This needs to stabilize. We have until 4/25, right?

Flags: needinfo?(wmccloskey)

Updated

•

8 years ago

Depends on: 1263457

Updated

•

8 years ago

Depends on: 1263763

Alice0775 White

Updated

•

8 years ago

Depends on: 1264398

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

8 years ago

Depends on: 1265036

Assignee

Comment 24

•

8 years ago

Comment on attachment 8736506 [details] [diff] [review]
patch

I'd like to get this into Aurora so that we get data for the next e10s experiment. It has been on Nightly for a week and I think we've shaken out all the bugs. It's not particularly configuration-dependent, so I wouldn't expect new problems to appear in other channels.

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: more OOMs with e10s
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: has been on Nightly for a while. risk is moderate and mostly restricted to e10s.
[String/UUID change made/needed]: none

Attachment #8736506 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox46: --- → wontfix

status-firefox47: ? → affected

Comment 25

•

8 years ago

Comment on attachment 8736506 [details] [diff] [review]
patch

Fixes some e10s OOMs, Aurora47+

Attachment #8736506 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 26

•

8 years ago

Ryan, please let me land this. Setting needinfo so you see this :-).

Flags: needinfo?(wmccloskey)

Flags: needinfo?(ryanvm)

Updated

•

8 years ago

Flags: needinfo?(ryanvm) → needinfo?(cbook)

Comment 27

•

8 years ago

(I rarely do uplifts anymore, FWIW)

Flags: needinfo?(wkocher)

Carsten Book [:Tomcat]

Comment 28

•

8 years ago

Bill, this has conflicts when uplifting, can you take a look, thanks!

Flags: needinfo?(wkocher)

Flags: needinfo?(cbook)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 29

•

8 years ago

I believe that Bill's intent was for him to take care of it himself ;)

Carsten Book [:Tomcat]

Comment 30

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #29)
> I believe that Bill's intent was for him to take care of it himself ;)

oh sorry misread this that as we should check this in :)

Assignee

Comment 31

•

8 years ago

http//hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169

Flags: needinfo?(wmccloskey)

Comment 32

•

8 years ago

Marking status-firefox47=fixed as per Aurora uplift in comment 31:

https://hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169

status-firefox47: affected → fixed

Updated

•

8 years ago

Depends on: 1266578

Updated

•

8 years ago

Depends on: 1267106