Closed Bug 859149 Opened 9 years ago Closed 2 years ago

<iframe mozbrowser>s' processes don't seem to get killed if the iframe is removed from the DOM too early

Categories

(Core :: DOM: Content Processes, defect, P5)

defect

Tracking

()

RESOLVED INVALID

People

(Reporter: justin.lebar+bug, Unassigned)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files)

In writing testcases for bug 844323, I noticed that if I remove an <iframe mozbrowser> from the DOM at the right time, its process is never killed!

I'll post a link to my WIP testcase which causes this problem, but first I want this bug number so I can reference it in the test.
Whiteboard: [MemShrink]
This is most likely the problem that causes a bunch of intermittent oranges in the ipc+idb tests.
Can you link to some of those bugs?
Okay, cool.

I don't want to get your hopes up too much: I observed this happening only when I removed the frame very soon after I inserted it into the DOM.  Your tests may be hitting something unfortunately different; I guess it remains to be seen.
Blocks: 783913
jlebar, can you assign a MemShrink priority to this?  We're not sure what to give it.

Also, would https://bugzilla.mozilla.org/show_bug.cgi?id=842003 help with this?
> Also, would https://bugzilla.mozilla.org/show_bug.cgi?id=842003 help with this?

If we're hitting this case in our tests, I'm pretty sure it would turn the tests orange anyway (something about a zombie process).

Now that I know more about this bug, I think this is likely a P3, because I've never seen evidence that we're hitting it in B2G.  I also think if we were hitting it we'd probably set the oom priority of the process to bg first, which would make it not a huge deal.
Whiteboard: [MemShrink] → [MemShrink:P3]
Blocks: 867739
This is probably now a P2 or a P1, because I think I'm seeing this in bug 867739.
Whiteboard: [MemShrink:P3] → [MemShrink]
What's worst about this is that the zombie created in bug 867739 has foreground oom_adj, which means it will essentially never get killed!
Whiteboard: [MemShrink] → [MemShrink:P2]
We have observed a Browser zombie process in Leo, that might be related to this issue.
We run the Orangutan based tests, that consist of opening an app and put it in background after 10 seconds. We do it recursively with Marketplace, E-mail and Contacts.
The zombie process appears after 2 to 10 hours of tests, with memeater consuming 200M.
The zombie Browser has priority of Foreground app. And its size is something between 10M to 24M Rss).
blocking-b2g: --- → leo?
Moving the comment above to a new bug 899437
Blocks: 899437
No longer blocks: 899437
Leo, can you clarify whether you've seen zombie processes that aren't related to bug 899437?

Such zombie processes would probably show up as app processes, not browser processes, as in bug 899437.  You could test for this by launching a bunch of apps but not launching apps which use Persona/Mozilla Identity.
Triage - not blocking for release, but leo would like to leoVB this when it gets fixed.
Noming for koi blocking.
blocking-b2g: leo? → koi?
(In reply to Wayne Chang [:wchang] from comment #13)
> Triage - not blocking for release, but leo would like to leoVB this when it
> gets fixed.
> Noming for koi blocking.

Wayne, this means that we'd take this change in Leo's vendor build, but not in Mozilla's 1.1 build?

That seems extremely unsafe to me.  It also seems logistically very complicated, since Leo would have to resolve any merge conflicts themselves, with no oversight from us.

Could you please clarify?  Thanks.
Flags: needinfo?(wchang)
Hi Justin,

Sorry if my comment was a bit unclear.

Essentially they're saying this wouldn't block the release but if there is a patch that they can take (i.e. a patch ported for v1.1), they will. 

I'll put forward your concern too to make sure they dont attempt to do what you've described.

Thanks for highlighting this.

(In reply to Justin Lebar [:jlebar] from comment #14)
> (In reply to Wayne Chang [:wchang] from comment #13)
> > Triage - not blocking for release, but leo would like to leoVB this when it
> > gets fixed.
> > Noming for koi blocking.
> 
> Wayne, this means that we'd take this change in Leo's vendor build, but not
> in Mozilla's 1.1 build?
> 
> That seems extremely unsafe to me.  It also seems logistically very
> complicated, since Leo would have to resolve any merge conflicts themselves,
> with no oversight from us.
> 
> Could you please clarify?  Thanks.
Flags: needinfo?(wchang)
So do we have permission to land this in the b2g18 repository even though this is not leo+?  Or do we need to come back and re-nom for blocking leo at that point and wait for triage to re-consider the bug?
Flags: needinfo?(wchang)
Triage would need to reconsider it, and if that happens and lands on b2g18, leo will take this when it happens.
Flags: needinfo?(wchang)
If release drivers are going to ask that we re-triage bugs like this, I think it should be release drivers' responsibility to make sure that we actually do re-triage bugs like this after they land.  If it's up to engineers to remember to renom a bug, we're going to miss things, and that itself is risky to the product.

One way you could do this is to keep a list (e.g. an etherpad) of bugs that are in the status of "we'd take a safe patch."  Then on a regular basis you'd see if any bugs on this list were closed recently, and follow up asking if the patch is safe enough to take on Leo.

This really shouldn't be up to engineers to remember; we're going to forget often enough for it to be problematic.
Flags: needinfo?(wchang)
Whenever possible, I'd like to see us block on something (and thus take patches) if it's indeed a release blocker issue and not consider the "safety" of a real or theoretical patch when making blocking decisions.
> Whenever possible, I'd like to see us block on something (and thus take patches)

I understand and in fact share release drivers' hesitancy to mark things that don't strictly block release as blockers.  As Jonas and I have argued ad nauseum, what's missing is a formal will-take-a-safe-patch status.  In the absence of such a formal status, I totally agree with Andrew that it's better to mark bugs as blockers than use an informal system, as is being done here.

But just to be clear, I don't think it would be a satisfactory conclusion to this discussion to say that we're just going to mark these bugs as blockers.  We've had this discussion about "will-take-a-safe-patch" bugs many times and the conclusion has always been a commitment not to give bugs the status that this bug was just given.  I would really like us to stop going in circles like this.

Given the history of triage rules changing in this way, I would like release drivers to make a commitment to follow up with bugs that they give this will-take-a-safe-patch status, contingent on their actually giving any bugs this status.  Then the question of whether any bugs should have this status in the first place can be a separate issue, which we may or may not discuss again.

Does that sound OK?
Attached patch zombie.diffSplinter Review
Not sure if this is related, but I can create zombie processes with this test. By applying this patch, attaching to the b2g process, and then setting base::gKillApp = true, I can create zombie processes in trying to open new apps (and then fail).
This kills the zombie created in the above test.
Adding the following lines

if(mExistingListener)
        mExistingListener->OnChannelError();

in |AsyncChannel::ProcessLink::OnChannelError()| can spot the error when connecting channels, I'll provide more details with a fix patch later.
Assignee: nobody → slin
(In reply to Cervantes Yu from comment #21)
> Created attachment 792139 [details] [diff] [review]
> zombie.diff
> 
> Not sure if this is related, but I can create zombie processes with this
> test. By applying this patch, attaching to the b2g process, and then setting
> base::gKillApp = true, I can create zombie processes in trying to open new
> apps (and then fail).

Well, seems like we couldn't reproduce the zombie processes anymore, unfortunately.

More closely, |mExistingListener->OnChannelError();| calls the OnChannelError() in GeckoChildProcessHost, which is an empty function now, so I don't see how it can fix the problem. However, this should relate to bug 773925.
We can only reproduce the hang or zombie processes through gdb, bug 911009 might be a possible fix.
Assignee: slin → nobody
(In reply to Shelly Lin [:shelly] from comment #25)
> We can only reproduce the hang or zombie processes through gdb, bug 911009
> might be a possible fix.

This makes me think we should at least remove the koi nom here and potentially either dupe this or WONTFIX it.  What do you think, Shelly?
Flags: needinfo?(slin)
Yeah, I think it make sense to remove the koi nom here, but maybe we should keep this bug open, tag this bug with IPC related issue.
Flags: needinfo?(slin)
blocking-b2g: koi? → ---
Flags: needinfo?(wchang)
Component: IPC → DOM: Content Processes
https://bugzilla.mozilla.org/show_bug.cgi?id=1472046

Move all DOM bugs that haven't been updated in more than 3 years and has no one currently assigned to P5.

If you have questions, please contact :mdaly.
Priority: -- → P5

mozbrowser is gone.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.