Closed
Bug 999351
Opened 10 years ago
Closed 10 years ago
Some kind of IPC leak in b2g
Categories
(Core :: IPC, defect, P1)
Tracking
()
People
(Reporter: ting, Assigned: ting)
References
Details
(Keywords: memory-footprint, memory-leak, perf, Whiteboard: [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1])
Attachments
(5 files, 1 obsolete file)
30.07 KB,
text/plain
|
Details | |
1005.34 KB,
application/gzip
|
Details | |
792.90 KB,
text/plain
|
Details | |
1.11 KB,
patch
|
bent.mozilla
:
review+
|
Details | Diff | Splinter Review |
1.41 KB,
patch
|
Details | Diff | Splinter Review |
STR: 1. Enable and connect WiFi 2. Launch browser, open http://www.pinterest.com 3. Monitor adb logcat for script error messages, such as: E/GeckoConsole( 1362): [JavaScript Warning: "Unknown property 'box-sizing'. Declaration dropped." {file: "http://passets-ec.pinterest.com/webapp/style/app/mobile/bundle.e0b47832.css" line: 1 column: 22740 source: ... 4. Browser, and Homescreen process may be killed if 3 is happened 5. Check about:memory, heap-unclassified of b2g process will be around 2x~4x MB. From DMD report, the memory was allocated from Pickle::Resize() when OnMessageReceivedFromLink() calls mPending.push_back(). Reproduce rate: ~60% Version: v1.3t gecko = a887469f184a0153e5e398a7878d50ed54353f9c gaia = fed1cc94e76abccea85ae998329430bb46cb7e53 What happened: Content child send multiple script error IPC to b2g process contains the source line, which here it is a one line css with length 190049. Things to clarify: - How is the script error happened, is it because of incomplete download? - Why the pending messages are not digested and free?
Assignee | ||
Comment 1•10 years ago
|
||
The dmd of b2g process comes together with attachment 8410140 [details].
Comment 2•10 years ago
|
||
I know we've discussed this problem in the MemShrink meetings before in relation to other bugs. I can't find those bugs at the moment, though. Weren't we going to truncate these strings at some reasonable length?
Whiteboard: [MemShrink]
Assignee | ||
Comment 3•10 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #2) > Weren't we going to truncate these strings at some reasonable length? That should be helpful, but I'd like to figure out why the messages are kept in pending, and how is the script error triggered. It seems the css is not downloaded completely, but if this is the case, why the css is still parsed and processed.
Comment 4•10 years ago
|
||
Ah, I did not see that it was also being leaked. Yea, that seems bad.
noming based on comment 4.
blocking-b2g: --- → 1.3T?
status-b2g-v1.3T:
--- → affected
Whiteboard: [MemShrink] → [c=memory p= s= u=] [MemShrink]
Does porting bug 976777 to 1.3t help?
Comment 7•10 years ago
|
||
Yeah, it sounds like we should backport that. I'm removing Tarako from the subject, because 25MB of Pickle data is pretty awful anywhere on B2G. Ben, the DMD log here shows that the parent process has 25MB of Pickle data. One theory we discussed in the MemShrink meeting is that maybe the child process is sending us a number of huge messages, then we kill it and fail to clean up the pending messages. Does that sound plausible?
Flags: needinfo?(bent.mozilla)
Summary: [Tarako] b2g process takes too much memory when there're script errors → b2g process takes too much memory when there're script errors
Whiteboard: [c=memory p= s= u=] [MemShrink] → [c=memory p= s= u=] [MemShrink:P1]
Comment 8•10 years ago
|
||
Just to clarify, there appear to be two problems occurring here. - We're sending absurdly large messages. Hopefully backporting bug 976777 will fix that. - We seem to be leaking some stuff in the parent process. So this bug should probably be about that, since bug 976777 already exists for the first problem. I've renamed it accordingly.
Summary: b2g process takes too much memory when there're script errors → Some kind of IPC leak in b2g
Assignee | ||
Comment 9•10 years ago
|
||
Processes: 84 = b2g 1009 = Browser Troublesome channel: 0x4039a030 Found: - The channel works fine at the beginning, message enqueued and dequeued, but start from line 7802, the messages are not dequeued anymore. For the pending script error, each one resizes to 380352 bytes, and the number of pending messages grows to 90 in the end. - From #8110, the DequeueOne task starts working, and reports channel error. - PContentParent::OnChannelError() is called in the end, but I didn't find code freeing pending messages from following functions, I'll keep look around. MessageChannel::OnNotifyMaybeChannelError(), and MessageChannel::NotifyMaybeChannelError()
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #9) > - PContentParent::OnChannelError() is called in the end, but I didn't find > code freeing pending messages The memory should be freed from Pickle destructor, but it is not called, neither ~MessageChannel.
Assignee | ||
Comment 12•10 years ago
|
||
The ContentParent is instantiated from ContentParent::RecvAddNewProcess(), but its reference count is not 0 after ContentParent::OnChannelError() finishes.
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #12) > The ContentParent is instantiated from ContentParent::RecvAddNewProcess(), > but its reference count is not 0 after ContentParent::OnChannelError() > finishes. It seems PreallocatedProcessManager::PublishSpareProcess() and PreallocatedProcessManager::MaybeForgetSpare() update the reference count, am trying to understand how.
Updated•10 years ago
|
Component: General → IPC
Product: Firefox OS → Core
Updated•10 years ago
|
Assignee: nobody → tchou
Priority: -- → P1
It looks to me like once we have a channel error (like, child process dies) we don't immediately do anything with additional messages that were pending from that child. They stay queued and eventually get freed when the channel is destroyed. Since ContentParent is reference counted that can keep the channel alive longer than needed. I would suggest trying to clear mPending, mPendingUrgentRequest, mPendingRPCCall, mDeferred, and mOutOfTurnReplies in MessageChannel::Clear(). That should get you memory back sooner.
Flags: needinfo?(bent.mozilla)
Assignee | ||
Comment 15•10 years ago
|
||
Is ContentParent expected to be freed after its DelayedDeleteContentParentTask dies?
No, it is reference-counted and held alive by lots of things... I don't think we can guarantee its destruction time.
Assignee | ||
Comment 17•10 years ago
|
||
I was trying to make sure the references from Init() and PublishSpareProcess() are released after ContentParent::OnChannelError() finishes, seems I don't have to check it as we can't guarantee DelayedDeleteContentParentTask is the last referece? ContentParent::RecvAddNewProcess(const uint32_t& aPid, const InfallibleTArray<ProtocolFdMapping>& aFds) { nsRefPtr<ContentParent> content; content = new ContentParent(...); content->Init(); PreallocatedProcessManager::PublishSpareProcess(content); return true; }
Yeah, I don't think it's worth trying to hunt down all references to ContentParent here. I know of several, some even attached to reference-counted DOM objects... Let's just try clearing the MessageChannel properly like I suggested in comment 14. If that doesn't work or isn't enough then we can start worrying about the other refs I guess.
Assignee | ||
Comment 19•10 years ago
|
||
Ben, I have created a patch as you suggested, please let me know if I should ask someone else for reviewing. I'll still try to understand why the css is broken and parsed, as it prevents user from browsering.
Attachment #8411627 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 20•10 years ago
|
||
BTW, I found Init() will be done twice on a ContentParent instance if it is from preallocated process manager in ContentParent::GetNewOrUsed(), is this expected? Note Init() increases reference count.
Comment on attachment 8411627 [details] [diff] [review] WIP - Clear MessageChannel resource before ContentParent dies Review of attachment 8411627 [details] [diff] [review]: ----------------------------------------------------------------- This looks like I expected, but it sounds like it breaks something? We should get a handle on that before checking this in. ::: ipc/glue/MessageChannel.cpp @@ +284,5 @@ > mChannelErrorTask = nullptr; > } > + > + // Clear resource now since ContentParent is reference counted and may stay > + // alive longer than needed. This isn't a problem specific to ContentParent so I wouldn't mention it. Just say: // Free up any memory used by pending messages. @@ +290,5 @@ > + mPendingUrgentRequest = nullptr; > + mPendingRPCCall = nullptr; > + mOutOfTurnReplies.clear(); > + while (!mDeferred.empty()) > + mDeferred.pop(); Please add braces here.
Attachment #8411627 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 22•10 years ago
|
||
Ben, I updated the patch as you commented, please help review again. For the "broken css", sorry I made you confused, it is not about this patch, but how I ran into the leak. The patch fixes the leak, but not the "broken css", so I was just want to let other people know I will still try to figure out why the css is broken. I'll be careful not to mention things cause interference next time.
Attachment #8411627 -
Attachment is obsolete: true
Attachment #8412339 -
Flags: review?(bent.mozilla)
Comment on attachment 8412339 [details] [diff] [review] WIPv2 - Clear MessageChannel resource before ContentParent dies Review of attachment 8412339 [details] [diff] [review]: ----------------------------------------------------------------- Ok, this looks good, thanks!
Attachment #8412339 -
Flags: review?(bent.mozilla) → review+
Assignee | ||
Comment 24•10 years ago
|
||
Try results: https://tbpl.mozilla.org/?tree=Try&rev=4fccea6c3166 Following cases are failed with timout but without known bug, how should I check them? Linux Debug: bc1 Linux x64 Opt: bc3 oth Linux x64 Debug: bc1 Android 2.2 Armv6 Opt: X
Some of them look like known intermittent failures. You can certainly check if the failure is intermittent by retriggering that particular test (click the blue "+" icon in TBPL).
Assignee | ||
Comment 26•10 years ago
|
||
Thanks dbaron. Yes, they're passed after retriggering. I guess I still need to finish fllowing things to have the patch landed: 1. Star the remaining failed cases 2. Update the patch with commit log r=bent, and maybe a more detailed explanation 3. Add tag checkin-needed Please correct me if I am wrong.
Comment 27•10 years ago
|
||
You don't really need to add the r=bent - the sheriff doing the landing will do that for you - and starring try isn't really required unless you pushed on top of a very orange changeset (and you already have passing retriggers). For the patch description, on the first line you want something describing the goal of the patch*, and something about how if it's short enough - in this case something like "Bug 999351 - Fix an IPC leak by clearing MessageChannel resource before ContentParent dies." On subsequent lines you can put a more detailed description if you think it's warranted (these won't show in the commit log), but remember to wrap to 80 characters (this doesn't apply to the first line). * this doesn't make as much sense in bugs with multiple patches - there you probably just want to describe what each patch does.
Comment 28•10 years ago
|
||
> You don't really need to add the r=bent - the sheriff doing the landing will > do that for you I bet the sheriffs prefer patches that require no modification. > - and starring try isn't really required But it can be useful, as the results come in, to keep track of which oranges you've already looked at.
Comment 29•10 years ago
|
||
IIRC they use a script to add the reviewer automatically, but noted :)
Assignee | ||
Comment 30•10 years ago
|
||
I have tried to star the failed cases, note I mistakenly star Bug 982100 on B2G ICS Emultator Opt R1.
Assignee | ||
Comment 31•10 years ago
|
||
Updated patch description.
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Comment 32•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/78e0fd0ed33f
Keywords: checkin-needed
It's great to see this parent process leak diagnosed and fixed. I think I had seen signs of this problem a number of times, but I didn't piece together what was happening or realize the severity. Thanks.
Assignee | ||
Comment 34•10 years ago
|
||
Thank you all for helping me go through the process. :) # BTW, I coulnd't reproduce the broken CSS and script errors on today's Tarako build, so I'll just stop hunting it down. gecko: e0401abed8c2cf397b5963b23fecb7fd3b9b7c54 gaia: 8895b180ed636069473703d0e7b73086989601ce
Comment 35•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/78e0fd0ed33f
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Assignee | ||
Comment 36•10 years ago
|
||
We need the patch also land on b2g v1.3t, anything else I should do to make it happen?
Assignee | ||
Comment 37•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #20) > BTW, I found Init() will be done twice on a ContentParent instance if it is > from preallocated process manager in ContentParent::GetNewOrUsed(), is this > expected? Note Init() increases reference count. Filed bug 1002527 for this.
Updated•10 years ago
|
Whiteboard: [c=memory p= s= u=] [MemShrink:P1] → [c=memory p= s=2014.05.09.t u=tarako] [MemShrink:P1]
Updated•10 years ago
|
status-b2g-v1.4:
--- → ?
You need to log in
before you can comment on or make changes to this bug.
Description
•