Closed Bug 999351 Opened 11 years ago Closed 11 years ago

Some kind of IPC leak in b2g

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla32

Project Flags:

blocking-b2g

1.3T+

Tracking Flags:

Tracking

Status

b2g-v1.3T

---

fixed

b2g-v1.4

---

People

(Reporter: ting, Assigned: ting)

References

Details

(Keywords: memory-footprint, memory-leak, perf, Whiteboard: [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1])

Attachments

(5 files, 1 obsolete file)

about:memory 11 years ago Ting-Yu Chou [:ting] (away) 30.07 KB, text/plain		Details
dmd-b2g 11 years ago Ting-Yu Chou [:ting] (away) 1005.34 KB, application/gzip		Details
log of IPC handling 11 years ago Ting-Yu Chou [:ting] (away) 792.90 KB, text/plain		Details
WIP - Clear MessageChannel resource before ContentParent dies 11 years ago Ting-Yu Chou [:ting] (away) 1.16 KB, patch		Details \| Diff \| Splinter Review
WIPv2 - Clear MessageChannel resource before ContentParent dies 11 years ago Ting-Yu Chou [:ting] (away) 1.11 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review
Fix an IPC leak by clearing MessageChannel resource before destruction. r=bent 11 years ago Ting-Yu Chou [:ting] (away) 1.41 KB, patch		Details \| Diff \| Splinter Review

Ting-Yu Chou [:ting] (away)

Assignee

Description

•

11 years ago

Attached file about:memory — Details

STR: 1. Enable and connect WiFi 2. Launch browser, open http://www.pinterest.com 3. Monitor adb logcat for script error messages, such as: E/GeckoConsole( 1362): [JavaScript Warning: "Unknown property 'box-sizing'. Declaration dropped." {file: "http://passets-ec.pinterest.com/webapp/style/app/mobile/bundle.e0b47832.css" line: 1 column: 22740 source: ... 4. Browser, and Homescreen process may be killed if 3 is happened 5. Check about:memory, heap-unclassified of b2g process will be around 2x~4x MB. From DMD report, the memory was allocated from Pickle::Resize() when OnMessageReceivedFromLink() calls mPending.push_back(). Reproduce rate: ~60% Version: v1.3t gecko = a887469f184a0153e5e398a7878d50ed54353f9c gaia = fed1cc94e76abccea85ae998329430bb46cb7e53 What happened: Content child send multiple script error IPC to b2g process contains the source line, which here it is a one line css with length 190049. Things to clarify: - How is the script error happened, is it because of incomplete download? - Why the pending messages are not digested and free?

Ting-Yu Chou [:ting] (away)

Assignee

Comment 1

•

11 years ago

Attached file dmd-b2g — Details

The dmd of b2g process comes together with attachment 8410140 [details].

Ben Kelly [:bkelly, not reviewing]

Comment 2

•

11 years ago

I know we've discussed this problem in the MemShrink meetings before in relation to other bugs. I can't find those bugs at the moment, though. Weren't we going to truncate these strings at some reasonable length?

Whiteboard: [MemShrink]

Ting-Yu Chou [:ting] (away)

Assignee

Comment 3

•

11 years ago

(In reply to Ben Kelly [:bkelly] from comment #2) > Weren't we going to truncate these strings at some reasonable length? That should be helpful, but I'd like to figure out why the messages are kept in pending, and how is the script error triggered. It seems the css is not downloaded completely, but if this is the case, why the css is still parsed and processed.

Ben Kelly [:bkelly, not reviewing]

Comment 4

•

11 years ago

Ah, I did not see that it was also being leaked. Yea, that seems bad.

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 5

•

11 years ago

noming based on comment 4.

blocking-b2g: --- → 1.3T?

status-b2g-v1.3T: --- → affected

Keywords: footprint, mlk, perf

Whiteboard: [MemShrink] → [c=memory p= s= u=] [MemShrink]

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 6

•

11 years ago

Does porting bug 976777 to 1.3t help?

Andrew McCreight [:mccr8]

Comment 7

•

11 years ago

Yeah, it sounds like we should backport that. I'm removing Tarako from the subject, because 25MB of Pickle data is pretty awful anywhere on B2G. Ben, the DMD log here shows that the parent process has 25MB of Pickle data. One theory we discussed in the MemShrink meeting is that maybe the child process is sending us a number of huge messages, then we kill it and fail to clean up the pending messages. Does that sound plausible?

Flags: needinfo?(bent.mozilla)

Summary: [Tarako] b2g process takes too much memory when there're script errors → b2g process takes too much memory when there're script errors

Whiteboard: [c=memory p= s= u=] [MemShrink] → [c=memory p= s= u=] [MemShrink:P1]

Nicholas Nethercote [inactive]

Comment 8

•

11 years ago

Just to clarify, there appear to be two problems occurring here. - We're sending absurdly large messages. Hopefully backporting bug 976777 will fix that. - We seem to be leaking some stuff in the parent process. So this bug should probably be about that, since bug 976777 already exists for the first problem. I've renamed it accordingly.

Summary: b2g process takes too much memory when there're script errors → Some kind of IPC leak in b2g

Ting-Yu Chou [:ting] (away)

Assignee

Comment 9

•

11 years ago

Attached file log of IPC handling — Details

Processes: 84 = b2g 1009 = Browser Troublesome channel: 0x4039a030 Found: - The channel works fine at the beginning, message enqueued and dequeued, but start from line 7802, the messages are not dequeued anymore. For the pending script error, each one resizes to 380352 bytes, and the number of pending messages grows to 90 in the end. - From #8110, the DequeueOne task starts working, and reports channel error. - PContentParent::OnChannelError() is called in the end, but I didn't find code freeing pending messages from following functions, I'll keep look around. MessageChannel::OnNotifyMaybeChannelError(), and MessageChannel::NotifyMaybeChannelError()

Ting-Yu Chou [:ting] (away)

Assignee

Comment 10

•

11 years ago

(In reply to Ting-Yu Chou [:ting] from comment #9) > - PContentParent::OnChannelError() is called in the end, but I didn't find > code freeing pending messages The memory should be freed from Pickle destructor, but it is not called, neither ~MessageChannel.

Joe Cheng [:jcheng] (please needinfo)

Comment 11

•

11 years ago

triage: 1.3T+ to avoid mem leak on tarako

blocking-b2g: 1.3T? → 1.3T+

Ting-Yu Chou [:ting] (away)

Assignee

Comment 12

•

11 years ago

The ContentParent is instantiated from ContentParent::RecvAddNewProcess(), but its reference count is not 0 after ContentParent::OnChannelError() finishes.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 13

•

11 years ago

(In reply to Ting-Yu Chou [:ting] from comment #12) > The ContentParent is instantiated from ContentParent::RecvAddNewProcess(), > but its reference count is not 0 after ContentParent::OnChannelError() > finishes. It seems PreallocatedProcessManager::PublishSpareProcess() and PreallocatedProcessManager::MaybeForgetSpare() update the reference count, am trying to understand how.

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

11 years ago

Component: General → IPC

Product: Firefox OS → Core

Ben Kelly [:bkelly, not reviewing]

Updated

•

11 years ago

Assignee: nobody → tchou

Priority: -- → P1

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 14

•

11 years ago

It looks to me like once we have a channel error (like, child process dies) we don't immediately do anything with additional messages that were pending from that child. They stay queued and eventually get freed when the channel is destroyed. Since ContentParent is reference counted that can keep the channel alive longer than needed. I would suggest trying to clear mPending, mPendingUrgentRequest, mPendingRPCCall, mDeferred, and mOutOfTurnReplies in MessageChannel::Clear(). That should get you memory back sooner.

Flags: needinfo?(bent.mozilla)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 15

•

11 years ago

Is ContentParent expected to be freed after its DelayedDeleteContentParentTask dies?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 16

•

11 years ago

No, it is reference-counted and held alive by lots of things... I don't think we can guarantee its destruction time.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 17

•

11 years ago

I was trying to make sure the references from Init() and PublishSpareProcess() are released after ContentParent::OnChannelError() finishes, seems I don't have to check it as we can't guarantee DelayedDeleteContentParentTask is the last referece? ContentParent::RecvAddNewProcess(const uint32_t& aPid, const InfallibleTArray<ProtocolFdMapping>& aFds) { nsRefPtr<ContentParent> content; content = new ContentParent(...); content->Init(); PreallocatedProcessManager::PublishSpareProcess(content); return true; }

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 18

•

11 years ago

Yeah, I don't think it's worth trying to hunt down all references to ContentParent here. I know of several, some even attached to reference-counted DOM objects... Let's just try clearing the MessageChannel properly like I suggested in comment 14. If that doesn't work or isn't enough then we can start worrying about the other refs I guess.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 19

•

11 years ago

Attached patch WIP - Clear MessageChannel resource before ContentParent dies (obsolete) — Details — Splinter Review

Ben, I have created a patch as you suggested, please let me know if I should ask someone else for reviewing. I'll still try to understand why the css is broken and parsed, as it prevents user from browsering.

Attachment #8411627 - Flags: review?(bent.mozilla)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 20

•

11 years ago

BTW, I found Init() will be done twice on a ContentParent instance if it is from preallocated process manager in ContentParent::GetNewOrUsed(), is this expected? Note Init() increases reference count.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 21

•

11 years ago

Comment on attachment 8411627 [details] [diff] [review] WIP - Clear MessageChannel resource before ContentParent dies Review of attachment 8411627 [details] [diff] [review]: ----------------------------------------------------------------- This looks like I expected, but it sounds like it breaks something? We should get a handle on that before checking this in. ::: ipc/glue/MessageChannel.cpp @@ +284,5 @@ > mChannelErrorTask = nullptr; > } > + > + // Clear resource now since ContentParent is reference counted and may stay > + // alive longer than needed. This isn't a problem specific to ContentParent so I wouldn't mention it. Just say: // Free up any memory used by pending messages. @@ +290,5 @@ > + mPendingUrgentRequest = nullptr; > + mPendingRPCCall = nullptr; > + mOutOfTurnReplies.clear(); > + while (!mDeferred.empty()) > + mDeferred.pop(); Please add braces here.

Attachment #8411627 - Flags: review?(bent.mozilla)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 22

•

11 years ago

Attached patch WIPv2 - Clear MessageChannel resource before ContentParent dies — Details — Splinter Review

Ben, I updated the patch as you commented, please help review again. For the "broken css", sorry I made you confused, it is not about this patch, but how I ran into the leak. The patch fixes the leak, but not the "broken css", so I was just want to let other people know I will still try to figure out why the css is broken. I'll be careful not to mention things cause interference next time.

Attachment #8411627 - Attachment is obsolete: true

Attachment #8412339 - Flags: review?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 23

•

11 years ago

Comment on attachment 8412339 [details] [diff] [review] WIPv2 - Clear MessageChannel resource before ContentParent dies Review of attachment 8412339 [details] [diff] [review]: ----------------------------------------------------------------- Ok, this looks good, thanks!

Attachment #8412339 - Flags: review?(bent.mozilla) → review+

Ting-Yu Chou [:ting] (away)

Assignee

Comment 24

•

11 years ago

Try results: https://tbpl.mozilla.org/?tree=Try&rev=4fccea6c3166 Following cases are failed with timout but without known bug, how should I check them? Linux Debug: bc1 Linux x64 Opt: bc3 oth Linux x64 Debug: bc1 Android 2.2 Armv6 Opt: X

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 25

•

11 years ago

Some of them look like known intermittent failures. You can certainly check if the failure is intermittent by retriggering that particular test (click the blue "+" icon in TBPL).

Ting-Yu Chou [:ting] (away)

Assignee

Comment 26

•

11 years ago

Thanks dbaron. Yes, they're passed after retriggering. I guess I still need to finish fllowing things to have the patch landed: 1. Star the remaining failed cases 2. Update the patch with commit log r=bent, and maybe a more detailed explanation 3. Add tag checkin-needed Please correct me if I am wrong.

Emanuel Hoogeveen [:ehoogeveen]

Comment 27

•

11 years ago

You don't really need to add the r=bent - the sheriff doing the landing will do that for you - and starring try isn't really required unless you pushed on top of a very orange changeset (and you already have passing retriggers). For the patch description, on the first line you want something describing the goal of the patch*, and something about how if it's short enough - in this case something like "Bug 999351 - Fix an IPC leak by clearing MessageChannel resource before ContentParent dies." On subsequent lines you can put a more detailed description if you think it's warranted (these won't show in the commit log), but remember to wrap to 80 characters (this doesn't apply to the first line). * this doesn't make as much sense in bugs with multiple patches - there you probably just want to describe what each patch does.

Nicholas Nethercote [inactive]

Comment 28

•

11 years ago

> You don't really need to add the r=bent - the sheriff doing the landing will > do that for you I bet the sheriffs prefer patches that require no modification. > - and starring try isn't really required But it can be useful, as the results come in, to keep track of which oranges you've already looked at.

Emanuel Hoogeveen [:ehoogeveen]

Comment 29

•

11 years ago

IIRC they use a script to add the reviewer automatically, but noted :)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 30

•

11 years ago

I have tried to star the failed cases, note I mistakenly star Bug 982100 on B2G ICS Emultator Opt R1.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 31

•

11 years ago

Attached patch Fix an IPC leak by clearing MessageChannel resource before destruction. r=bent — Details — Splinter Review

Updated patch description.

Ting-Yu Chou [:ting] (away)

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 32

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/78e0fd0ed33f

Keywords: checkin-needed

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 33

•

11 years ago

It's great to see this parent process leak diagnosed and fixed. I think I had seen signs of this problem a number of times, but I didn't piece together what was happening or realize the severity. Thanks.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 34

•

11 years ago

Thank you all for helping me go through the process. :) # BTW, I coulnd't reproduce the broken CSS and script errors on today's Tarako build, so I'll just stop hunting it down. gecko: e0401abed8c2cf397b5963b23fecb7fd3b9b7c54 gaia: 8895b180ed636069473703d0e7b73086989601ce

Carsten Book [:Tomcat]

Comment 35

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/78e0fd0ed33f

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla32

Ting-Yu Chou [:ting] (away)

Assignee

Comment 36

•

11 years ago

We need the patch also land on b2g v1.3t, anything else I should do to make it happen?

Ting-Yu Chou [:ting] (away)

Assignee

Comment 37

•

11 years ago

(In reply to Ting-Yu Chou [:ting] from comment #20) > BTW, I found Init() will be done twice on a ContentParent instance if it is > from preallocated process manager in ContentParent::GetNewOrUsed(), is this > expected? Note Init() increases reference count. Filed bug 1002527 for this.

[:fabrice] Fabrice Desré

Comment 38

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/945ba6d7bff3

status-b2g-v1.3T: affected → fixed

Ting-Yu Chou [:ting] (away)

Assignee

Updated

•

11 years ago

Updated

•

11 years ago

Whiteboard: [c=memory p= s= u=] [MemShrink:P1] → [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1]

James Zhang (Spreadtrum)

Updated

•

11 years ago

status-b2g-v1.4: --- → ?

You need to log in before you can comment on or make changes to this bug.