Closed Bug 999351 Opened 6 years ago Closed 6 years ago

Some kind of IPC leak in b2g

Categories

(Core :: IPC, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla32
blocking-b2g 1.3T+
Tracking Status
b2g-v1.3T --- fixed
b2g-v1.4 --- ?

People

(Reporter: ting, Assigned: ting)

References

Details

(Keywords: memory-footprint, memory-leak, perf, Whiteboard: [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1])

Attachments

(5 files, 1 obsolete file)

Attached file about:memory
STR:
  1. Enable and connect WiFi
  2. Launch browser, open http://www.pinterest.com
  3. Monitor adb logcat for script error messages, such as:

       E/GeckoConsole( 1362): [JavaScript Warning: "Unknown property 'box-sizing'.  Declaration dropped." {file: "http://passets-ec.pinterest.com/webapp/style/app/mobile/bundle.e0b47832.css" line: 1 column: 22740 source: ...

  4. Browser, and Homescreen process may be killed if 3 is happened
  5. Check about:memory, heap-unclassified of b2g process will be around 2x~4x MB.
     From DMD report, the memory was allocated from Pickle::Resize() when
     OnMessageReceivedFromLink() calls mPending.push_back().

Reproduce rate:
  ~60%

Version:
  v1.3t
  gecko = a887469f184a0153e5e398a7878d50ed54353f9c
  gaia  = fed1cc94e76abccea85ae998329430bb46cb7e53

What happened:
  Content child send multiple script error IPC to b2g process contains the
  source line, which here it is a one line css with length 190049.

Things to clarify:
  - How is the script error happened, is it because of incomplete download?
  - Why the pending messages are not digested and free?
Attached file dmd-b2g
The dmd of b2g process comes together with attachment 8410140 [details].
I know we've discussed this problem in the MemShrink meetings before in relation to other bugs.  I can't find those bugs at the moment, though.

Weren't we going to truncate these strings at some reasonable length?
Whiteboard: [MemShrink]
(In reply to Ben Kelly [:bkelly] from comment #2)
> Weren't we going to truncate these strings at some reasonable length?

That should be helpful, but I'd like to figure out why the messages are kept in pending, and how is the script error triggered. It seems the css is not downloaded completely, but if this is the case, why the css is still parsed and processed.
Ah, I did not see that it was also being leaked.  Yea, that seems bad.
noming based on comment 4.
blocking-b2g: --- → 1.3T?
Keywords: footprint, mlk, perf
Whiteboard: [MemShrink] → [c=memory p= s= u=] [MemShrink]
Yeah, it sounds like we should backport that.

I'm removing Tarako from the subject, because 25MB of Pickle data is pretty awful anywhere on B2G.

Ben, the DMD log here shows that the parent process has 25MB of Pickle data.  One theory we discussed in the MemShrink meeting is that maybe the child process is sending us a number of huge messages, then we kill it and fail to clean up the pending messages.  Does that sound plausible?
Flags: needinfo?(bent.mozilla)
Summary: [Tarako] b2g process takes too much memory when there're script errors → b2g process takes too much memory when there're script errors
Whiteboard: [c=memory p= s= u=] [MemShrink] → [c=memory p= s= u=] [MemShrink:P1]
Just to clarify, there appear to be two problems occurring here.

- We're sending absurdly large messages. Hopefully backporting bug 976777 will fix that.

- We seem to be leaking some stuff in the parent process. So this bug should probably be about that, since bug 976777 already exists for the first problem. I've renamed it accordingly.
Summary: b2g process takes too much memory when there're script errors → Some kind of IPC leak in b2g
Attached file log of IPC handling
Processes:
  84   = b2g
  1009 = Browser

Troublesome channel:
  0x4039a030

Found:
  - The channel works fine at the beginning, message enqueued and dequeued, but
    start from line 7802, the messages are not dequeued anymore. For the pending
    script error, each one resizes to 380352 bytes, and the number of pending
    messages grows to 90 in the end.
  - From #8110, the DequeueOne task starts working, and reports channel error.
  - PContentParent::OnChannelError() is called in the end, but I didn't find
    code freeing pending messages from following functions, I'll keep look around.

      MessageChannel::OnNotifyMaybeChannelError(), and
      MessageChannel::NotifyMaybeChannelError()
(In reply to Ting-Yu Chou [:ting] from comment #9)
>   - PContentParent::OnChannelError() is called in the end, but I didn't find
>     code freeing pending messages

The memory should be freed from Pickle destructor, but it is not called, neither ~MessageChannel.
triage: 1.3T+ to avoid mem leak on tarako
blocking-b2g: 1.3T? → 1.3T+
The ContentParent is instantiated from ContentParent::RecvAddNewProcess(), but its reference count is not 0 after ContentParent::OnChannelError() finishes.
(In reply to Ting-Yu Chou [:ting] from comment #12)
> The ContentParent is instantiated from ContentParent::RecvAddNewProcess(),
> but its reference count is not 0 after ContentParent::OnChannelError()
> finishes.

It seems PreallocatedProcessManager::PublishSpareProcess() and PreallocatedProcessManager::MaybeForgetSpare() update the reference count, am trying to understand how.
Component: General → IPC
Product: Firefox OS → Core
Assignee: nobody → tchou
Priority: -- → P1
It looks to me like once we have a channel error (like, child process dies) we don't immediately do anything with additional messages that were pending from that child. They stay queued and eventually get freed when the channel is destroyed. Since ContentParent is reference counted that can keep the channel alive longer than needed.

I would suggest trying to clear mPending, mPendingUrgentRequest, mPendingRPCCall, mDeferred, and mOutOfTurnReplies in MessageChannel::Clear(). That should get you memory back sooner.
Flags: needinfo?(bent.mozilla)
Is ContentParent expected to be freed after its DelayedDeleteContentParentTask dies?
No, it is reference-counted and held alive by lots of things... I don't think we can guarantee its destruction time.
I was trying to make sure the references from Init() and PublishSpareProcess() are released after ContentParent::OnChannelError() finishes, seems I don't have to check it as we can't guarantee DelayedDeleteContentParentTask is the last referece?

  ContentParent::RecvAddNewProcess(const uint32_t& aPid,
                                   const InfallibleTArray<ProtocolFdMapping>& aFds)
  {
      nsRefPtr<ContentParent> content;
      content = new ContentParent(...);
      content->Init();
      PreallocatedProcessManager::PublishSpareProcess(content);
      return true;
  }
Yeah, I don't think it's worth trying to hunt down all references to ContentParent here. I know of several, some even attached to reference-counted DOM objects...

Let's just try clearing the MessageChannel properly like I suggested in comment 14. If that doesn't work or isn't enough then we can start worrying about the other refs I guess.
Ben, I have created a patch as you suggested, please let me know if I should ask someone else for reviewing.

I'll still try to understand why the css is broken and parsed, as it prevents user from browsering.
Attachment #8411627 - Flags: review?(bent.mozilla)
BTW, I found Init() will be done twice on a ContentParent instance if it is from preallocated process manager in ContentParent::GetNewOrUsed(), is this expected? Note Init() increases reference count.
Comment on attachment 8411627 [details] [diff] [review]
WIP - Clear MessageChannel resource before ContentParent dies

Review of attachment 8411627 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like I expected, but it sounds like it breaks something? We should get a handle on that before checking this in.

::: ipc/glue/MessageChannel.cpp
@@ +284,5 @@
>          mChannelErrorTask = nullptr;
>      }
> +
> +    // Clear resource now since ContentParent is reference counted and may stay
> +    // alive longer than needed.

This isn't a problem specific to ContentParent so I wouldn't mention it. Just say:

  // Free up any memory used by pending messages.

@@ +290,5 @@
> +    mPendingUrgentRequest = nullptr;
> +    mPendingRPCCall = nullptr;
> +    mOutOfTurnReplies.clear();
> +    while (!mDeferred.empty())
> +        mDeferred.pop();

Please add braces here.
Attachment #8411627 - Flags: review?(bent.mozilla)
Ben, I updated the patch as you commented, please help review again.

For the "broken css", sorry I made you confused, it is not about this patch, but how I ran into the leak. The patch fixes the leak, but not the "broken css", so I was just want to let other people know I will still try to figure out why the css is broken. I'll be careful not to mention things cause interference next time.
Attachment #8411627 - Attachment is obsolete: true
Attachment #8412339 - Flags: review?(bent.mozilla)
Comment on attachment 8412339 [details] [diff] [review]
WIPv2 - Clear MessageChannel resource before ContentParent dies

Review of attachment 8412339 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this looks good, thanks!
Attachment #8412339 - Flags: review?(bent.mozilla) → review+
Try results: https://tbpl.mozilla.org/?tree=Try&rev=4fccea6c3166

Following cases are failed with timout but without known bug, how should I check them?

  Linux Debug: bc1
  Linux x64 Opt: bc3 oth
  Linux x64 Debug: bc1
  Android 2.2 Armv6 Opt: X
Some of them look like known intermittent failures.  You can certainly check if the failure is intermittent by retriggering that particular test (click the blue "+" icon in TBPL).
Thanks dbaron. Yes, they're passed after retriggering. I guess I still need to finish fllowing things to have the patch landed:

  1. Star the remaining failed cases
  2. Update the patch with commit log r=bent, and maybe a more detailed explanation
  3. Add tag checkin-needed

Please correct me if I am wrong.
You don't really need to add the r=bent - the sheriff doing the landing will do that for you - and starring try isn't really required unless you pushed on top of a very orange changeset (and you already have passing retriggers).

For the patch description, on the first line you want something describing the goal of the patch*, and something about how if it's short enough - in this case something like "Bug 999351 - Fix an IPC leak by clearing MessageChannel resource before ContentParent dies." On subsequent lines you can put a more detailed description if you think it's warranted (these won't show in the commit log), but remember to wrap to 80 characters (this doesn't apply to the first line).

* this doesn't make as much sense in bugs with multiple patches - there you probably just want to describe what each patch does.
> You don't really need to add the r=bent - the sheriff doing the landing will
> do that for you

I bet the sheriffs prefer patches that require no modification.

> - and starring try isn't really required

But it can be useful, as the results come in, to keep track of which oranges you've already looked at.
IIRC they use a script to add the reviewer automatically, but noted :)
I have tried to star the failed cases, note I mistakenly star Bug 982100 on B2G ICS Emultator Opt R1.
Keywords: checkin-needed
It's great to see this parent process leak diagnosed and fixed.  I think I had seen signs of this problem a number of times, but I didn't piece together what was happening or realize the severity.  Thanks.
Thank you all for helping me go through the process. :)

# BTW, I coulnd't reproduce the broken CSS and script errors on today's Tarako build, so I'll just stop hunting it down.

    gecko: e0401abed8c2cf397b5963b23fecb7fd3b9b7c54
    gaia:  8895b180ed636069473703d0e7b73086989601ce
https://hg.mozilla.org/mozilla-central/rev/78e0fd0ed33f
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
We need the patch also land on b2g v1.3t, anything else I should do to make it happen?
(In reply to Ting-Yu Chou [:ting] from comment #20)
> BTW, I found Init() will be done twice on a ContentParent instance if it is
> from preallocated process manager in ContentParent::GetNewOrUsed(), is this
> expected? Note Init() increases reference count.

Filed bug 1002527 for this.
See Also: → 1002527
Whiteboard: [c=memory p= s= u=] [MemShrink:P1] → [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1]
You need to log in before you can comment on or make changes to this bug.