The default bug view has changed. See this FAQ.

windows intermittently leaked until shutdown via Cpows with e10s-multi

RESOLVED FIXED in Firefox 52

Status

()

Core
JavaScript Engine
RESOLVED FIXED
2 months ago
2 months ago

People

(Reporter: krizsa, Assigned: mccr8)

Tracking

({mlk, regression})

Trunk
mozilla53
mlk, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox51 unaffected, firefox52 fixed, firefox53 fixed, firefox-esr45 unaffected)

Details

(Whiteboard: [e10s-multi:M1][MemShrink:P1])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(4 attachments, 2 obsolete attachments)

(Reporter)

Description

2 months ago
For some reason with multiple content processes on I experienced some tests to leak. After changing one of the leaking test a bit it seems like the problem is with the leak detector.

This modified test created a bunch of windows and closes them. I use the BrowserTestUtils.closeWindow which is waiting for the final session storage message from the children (in case of the test one for each window) + "domwindowclosed" event on the parent side. After this we start the GC on the parent side and send messages to the child processes to do the same.

Problem is that at this point there is no guarantee that the GC/CC will be able to collect the windows in the child process, we're in the middle of the frame loader destroying, and depending how busy the CPU is, in debug mode it can take a while until we get to the point where it make sense to run GC/CC.

Multiple content processes make things worse here but this modified test in debug mode leaks even with single content process (executed with --repeat 5)

side notes:
- on the parent side we are quite aggressive with GC: http://searchfox.org/mozilla-central/source/testing/mochitest/browser-test.js#644
compared to the child side... should we do more on the child side too?

- seems like we don't wait for the child processes finish their GC before launching the next test: http://searchfox.org/mozilla-central/source/testing/mochitest/browser-test.js#647 can that cause a problem?
(Reporter)

Comment 1

2 months ago
Created attachment 8825442 [details] [diff] [review]
leaky test
(Reporter)

Comment 2

2 months ago
I'm blocked by this issue to land multiple content processes on nightly, could you please help me figuring out what to do about this issue?
Flags: needinfo?(continuation)
(Reporter)

Updated

2 months ago
Blocks: 1303113
Whiteboard: [e10s-multi:M1]
(Assignee)

Comment 3

2 months ago
Sure, I can take a look. The basic way this works is that the parent process sends the async browser-test:collect-request message, which the child gets. Then the child runs some GCs synchronously, and prints out a message with its pid. Then the leak detector (which is a Python script that runs on the log) looks for that message, checking that windows created during the test in that process were destroyed before the message.

Here's the various pieces of that: http://searchfox.org/mozilla-central/search?q=Completed%20ShutdownLeaks%20collections%20in%20process&path=

I wasn't thinking at all about e10s multi when I was converting this test to work with e10s, so it is possible something doesn't work.

It is possible that running the same test repeatedly confuses the leak detector, which could explain why you'd see issues with --repeat.
(Assignee)

Comment 4

2 months ago
How do I run mochitests locally with e10s multi enabled?
Assignee: nobody → continuation
Flags: needinfo?(continuation) → needinfo?(gkrizsanits)
I don't know the pref offhand, but you can pass in --setpref <pref>=<value> to the mach command. If that doesn't work for some reason, let me know and just add it to testing/profiles/prefs_general.js in the meantime.
(Reporter)

Comment 6

2 months ago
(In reply to Andrew McCreight [:mccr8] from comment #3)
> Sure, I can take a look. The basic way this works is that the parent process

Thanks!

> sends the async browser-test:collect-request message, which the child gets.
> Then the child runs some GCs synchronously, and prints out a message with
> its pid. Then the leak detector (which is a Python script that runs on the
> log) looks for that message, checking that windows created during the test
> in that process were destroyed before the message.

Yeah, I think I've understood that part. I'm just not sure when is the right time
to do a GC in the child... if we do it too early something might still hold a reference
to the window and we fail to collect, no? I fail to see why BrowserTestUtils way of
waiting for those messages/events is a sure way to know that we can finish the test and start
the sync GC work. And I think in some cases it is too early.

> 
> I wasn't thinking at all about e10s multi when I was converting this test to
> work with e10s, so it is possible something doesn't work.
> 
> It is possible that running the same test repeatedly confuses the leak
> detector, which could explain why you'd see issues with --repeat.

Oh, I though since we have the windows ID's / raw pointers that should not be a problem
but I have not looked into that. The original test fails intermittently if you set the process
count to 2 in all.js

Some more leaky tests are under bug 1303113.

(In reply to Andrew McCreight [:mccr8] from comment #4)
> How do I run mochitests locally with e10s multi enabled?
dom.ipc.processCount in all.js
Flags: needinfo?(gkrizsanits)
(Assignee)

Comment 7

2 months ago
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #6)
> Yeah, I think I've understood that part. I'm just not sure when is the right
> time to do a GC in the child... if we do it too early something might still hold
> a reference to the window and we fail to collect, no? I fail to see why BrowserTestUtils
> way of waiting for those messages/events is a sure way to know that we can finish
> the test and start the sync GC work. And I think in some cases it is too early.

I'm not too clear on how we decide to do this, but the current approach has worked for a few years, so I don't think it can be too terrible. The entire point of this test is that if you close all the tabs a tests open, and GC a bunch, the windows the test creates should go away. I think there's even some kind of observer service thing code can listen for to drop stuff if it really needs to.

> Oh, I though since we have the windows ID's / raw pointers that should not
> be a problem but I have not looked into that.

I can reproduce the issue with --repeat. For at least one window, I see the "Completed ShutdownLeaks collections in process" message for the process before the window is even created. Then it never happens again. So we may not be doing the full cleanup procedure in the --repeat case. I looked over the code in leaks.py and it does look like it should be able to handle repeated testing.

> The original test fails intermittently if you set the process count to 2 in all.js

I'm not able to reproduce a leak with the patch you attached here, and running it alone using
  ./mach mochitest dom/tests/browser/browser_test_toolbars_visibility.js

Does that consistently leak for you, or only intermittently? Also, if you have a log of a full run that fails I can look at that and try to analyze if something is going wrong at the test harness level.
(Assignee)

Comment 8

2 months ago
I'll look around in the bugs blocking bug 1303113 and see if I can figure anything out.
(Assignee)

Updated

2 months ago
Depends on: 1330085
(Assignee)

Updated

2 months ago
No longer depends on: 1330085
(Assignee)

Comment 9

2 months ago
Gabor, do you have any steps to reproduce this locally, without --repeat?
Flags: needinfo?(gkrizsanits)
(Assignee)

Comment 10

2 months ago
Here's what a log from bug 1328396 looks like (with many intermediate lines removed):

11:58:39 INFO - TEST-START | toolkit/mozapps/extensions/test/browser/browser_updateid.js
11:58:39 INFO - ++DOMWINDOW == 3 (0x111d64800) [pid = 2536] [serial = 249] [outer = 0x0]
11:58:41 INFO - TEST-OK | toolkit/mozapps/extensions/test/browser/browser_updateid.js | took 2071ms
11:58:46 INFO - Completed ShutdownLeaks collections in process 2536
11:58:53 (some warnings from nsThread::Shutdown() in process 2536, indicating we're probably pretty late in shut down)
11:58:53 INFO - --DOMWINDOW == 0 (0x111d64800) [pid = 2536] [serial = 249] [outer = 0x0] [url = about:blank]

So, it looks like we are properly running the ShutdownLeaks collection for this process, and I only see it for this process once, so this isn't leaks.py getting confused. The test is in fact just keeping alive an inner/outer window pair alive until very late in shutdown. Note that the window isn't destroyed until 7 seconds after the shutdown collection, which seems like a long time. Though maybe it actually became garbage quickly, and that's just the next time we ran the GC.

I'm not sure how easy it will be to figure out without a way to reproduce it. Maybe rr is the best hope here.
(Reporter)

Comment 11

2 months ago
(In reply to Andrew McCreight [:mccr8] from comment #7)
> I'm not too clear on how we decide to do this, but the current approach has
> worked for a few years, so I don't think it can be too terrible. The entire
> point of this test is that if you close all the tabs a tests open, and GC a
> bunch, the windows the test creates should go away. I think there's even
> some kind of observer service thing code can listen for to drop stuff if it
> really needs to.

Yes. The "memory-pressure" thing. Speaking of which, I did some logging for another
test (browser_captivePortal_certErrorUI.js), and I think there
in the non-leaking case, it's not the GC initiated from the ShutdownLeakCollector that
picks up the window but the one that is requested from here from the parent side
http://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#2360
and then handled here on child side:
http://searchfox.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2355
and then starts somewhere an async CC...

> I can reproduce the issue with --repeat. For at least one window, I see the
> "Completed ShutdownLeaks collections in process" message for the process
> before the window is even created. Then it never happens again. So we may
> not be doing the full cleanup procedure in the --repeat case. I looked over
> the code in leaks.py and it does look like it should be able to handle
> repeated testing.

Let's forget about the --repeat case then, I created 5 different copy of the
test above and run them in a suite (instead of using --repeat on the same test file)
and that did not report any leaks... So you're probably right about --repeat.

> Does that consistently leak for you, or only intermittently? Also, if you

Intermittently, in the most annoying way. I can reproduce it sometimes several times in a row
sometimes I cannot reproduce it at all no matter how much I try. You could try browser_captivePortal_certErrorUI.js it is pretty frequent with two content processes, especially for some reason right after a fresh build (./mach build && ./mach mochitest browser/base/content/test/captivePortal/browser_captivePortal_certErrorUI.js) I know it does not make much sense, but the annoying part is that if I can reproduce it then I can usually reproduce it consistently (and the other way around), so it is worth to try a couple of builds. You could also give some extra stress to the CPU while running maybe, or use a VM with single core.

> have a log of a full run that fails I can look at that and try to analyze if
> something is going wrong at the test harness level.

Yeah I'll attach one. It probably does not help much though since if I run a single test, then the window gets cleaned up only during the process shutdown. The output seems alright, just something seems to hold a reference to the window _sometimes_. Btw I tried using setTimeout before the GC (to see if it's just some event), and it helped locally (I wouldn't trust it because of the intermittent nature) but did not solve to problem on try.

(In reply to Andrew McCreight [:mccr8] from comment #10)
> Here's what a log from bug 1328396 looks like (with many intermediate lines
> removed):
> 
> 11:58:39 INFO - TEST-START |
> toolkit/mozapps/extensions/test/browser/browser_updateid.js
> 11:58:39 INFO - ++DOMWINDOW == 3 (0x111d64800) [pid = 2536] [serial = 249]
> [outer = 0x0]
> 11:58:41 INFO - TEST-OK |
> toolkit/mozapps/extensions/test/browser/browser_updateid.js | took 2071ms
> 11:58:46 INFO - Completed ShutdownLeaks collections in process 2536
> 11:58:53 (some warnings from nsThread::Shutdown() in process 2536,
> indicating we're probably pretty late in shut down)
> 11:58:53 INFO - --DOMWINDOW == 0 (0x111d64800) [pid = 2536] [serial = 249]
> [outer = 0x0] [url = about:blank]
> 
> So, it looks like we are properly running the ShutdownLeaks collection for
> this process, and I only see it for this process once, so this isn't
> leaks.py getting confused. The test is in fact just keeping alive an
> inner/outer window pair alive until very late in shutdown. Note that the
> window isn't destroyed until 7 seconds after the shutdown collection, which
> seems like a long time. Though maybe it actually became garbage quickly, and
> that's just the next time we ran the GC.

This means that we are able to GC the window before the final cleanup during the
process shutdown right?

> 
> I'm not sure how easy it will be to figure out without a way to reproduce
> it. Maybe rr is the best hope here.

On the try server it might be easier to reproduce it... My problem is that even if
I attached a debugger or rr, I don't think I would know what to do / look for. I mean,
I don't see what guarantees that after the window.close() the destruction of the DOM
tree / JS scopes / docshell tree got to the point where a GC/CC can free the window,
other than that it's worked so far with a single content process. If you have some
hints for me where to look, I can give it a try.
Flags: needinfo?(gkrizsanits) → needinfo?(continuation)
(Reporter)

Comment 12

2 months ago
Created attachment 8825820 [details]
leakoutput
(Assignee)

Comment 13

2 months ago
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11)
> Yes. The "memory-pressure" thing.

I was thinking of the "shutdown-leaks-before-check" barrier thing in browser-test.js.

> can reproduce it then I can usually reproduce it consistently (and the other
> way around), so it is worth to try a couple of builds. You could also give
> some extra stress to the CPU while running maybe, or use a VM with single
> core.

That's quite odd!

> > have a log of a full run that fails I can look at that and try to analyze if
> > something is going wrong at the test harness level.
> 
> Yeah I'll attach one. It probably does not help much though since if I run a
> single test, then the window gets cleaned up only during the process
> shutdown. The output seems alright, just something seems to hold a reference
> to the window _sometimes_. Btw I tried using setTimeout before the GC (to
> see if it's just some event), and it helped locally (I wouldn't trust it
> because of the intermittent nature) but did not solve to problem on try.

Hmm strange.

> This means that we are able to GC the window before the final cleanup during
> the process shutdown right?

Yup.

> On the try server it might be easier to reproduce it... My problem is that
> even if I attached a debugger or rr, I don't think I would know what to do / look
> for.

The basic idea is to get a CC log at the point where the window is alive, but you wish it was cleaned up. This will show you that there's some missing reference to an object. Then you look at the releases of that object after that point, and compare it to the known references, to figure out what the missing reference was. Or I have another heap-scanning based approach that does something similar. It is not very easy to do, in a case like this where even reproducing it is hard.

> I mean, I don't see what guarantees that after the window.close() the destruction of
> the DOM tree / JS scopes / docshell tree got to the point where a GC/CC can free the
> window, other than that it's worked so far with a single content process. If you
> have some hints for me where to look, I can give it a try.

I'm not really sure what the guarantees are.

One question is, why is e10s multi different? One possibility is that browser-test.js sets dom.ipc.keepProcessesAlive. It might be worth seeing if setting that causes these leaks even in the regular e10s case, and if not setting it results in leaks in the e10s-multi case (assuming the timeouts are not too much of an issue). Maybe we discard things more aggressively if we're not keeping the process alive?

It might be interesting to try running with MOZ_IPC_MESSAGE_LOG=1 and see if there's any obvious difference in messages sent during shutdown. Maybe it is too noisy.

I've been looking over the various bugs blocking bug 1303113 to see if I can figure out any kind of pattern. One interesting log I found was in bug 1328374 comment 0. Here the test that creates the leaked window is actually the second-to-last one. The next test does not reuse the same process, so you'd think it would just be sitting around for a while, so weird races would be less of an issue.
Flags: needinfo?(continuation)
(Assignee)

Updated

2 months ago
Whiteboard: [e10s-multi:M1] → [e10s-multi:M1][MemShrink]

Updated

2 months ago
Whiteboard: [e10s-multi:M1][MemShrink] → [e10s-multi:M1][MemShrink:P1]
(Reporter)

Comment 14

2 months ago
(In reply to Andrew McCreight [:mccr8] from comment #13)
> The basic idea is to get a CC log at the point where the window is alive,
> but you wish it was cleaned up. This will show you that there's some missing
> reference to an object. Then you look at the releases of that object after
> that point, and compare it to the known references, to figure out what the
> missing reference was. Or I have another heap-scanning based approach that
> does something similar. It is not very easy to do, in a case like this where
> even reproducing it is hard.

Thanks, I'll try this approach.

> One question is, why is e10s multi different? One possibility is that
> browser-test.js sets dom.ipc.keepProcessesAlive. It might be worth seeing if
> setting that causes these leaks even in the regular e10s case, and if not
> setting it results in leaks in the e10s-multi case (assuming the timeouts
> are not too much of an issue). Maybe we discard things more aggressively if
> we're not keeping the process alive?

I suspected that as well, but I could not detect any difference when it's turned on
or off.

> 
> It might be interesting to try running with MOZ_IPC_MESSAGE_LOG=1 and see if
> there's any obvious difference in messages sent during shutdown. Maybe it is
> too noisy.
> 
> I've been looking over the various bugs blocking bug 1303113 to see if I can
> figure out any kind of pattern. One interesting log I found was in bug
> 1328374 comment 0. Here the test that creates the leaked window is actually
> the second-to-last one. The next test does not reuse the same process, so
> you'd think it would just be sitting around for a while, so weird races
> would be less of an issue.

Interesting. On a side note, I wonder if we should just nuke all the windows if
a content process is kept alive but the last tab/window is closed that it has
been hosting.
(Reporter)

Comment 15

2 months ago
Created attachment 8826221 [details]
cclogcrash.txt

I'm trying to get the CC logger to work but it's crashing on me reliably after this scary assertion... Is there something I'm doing wrong? (more errors in the full output)

command:
---------
MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS="all" MOZ_CC_LOG_DIRECTORY="~/Documents/log" MOZ_DISABLE_CONTENT_SANDBOX=1 ./mach mochitest browser/base/content/test/captivePortal/browser_captivePortal_certErrorUI.js


stack:
-------
Assertion failure: !DebugFileIDs.Contains(aHandle), at /Users/gkrizsanits/Documents/development/mozilla-central/xpcom/build/PoisonIOInterposerBase.cpp:231
#01: MozillaRegisterDebugHandle (PoisonIOInterposerBase.cpp:231, in XUL)
#02: MozillaRegisterDebugFD (PoisonIOInterposerBase.cpp:239, in XUL)
#03: MozillaRegisterDebugFILE (PoisonIOInterposerBase.cpp:249, in XUL)
#04: nsCycleCollectorLogSinkToFile::OpenLog(nsCycleCollectorLogSinkToFile::FileInfo*) (nsCycleCollector.cpp:1687, in XUL)
#05: nsCycleCollectorLogSinkToFile::Open(__sFILE**, __sFILE**) (nsCycleCollector.cpp:1579, in XUL)
#06: nsCycleCollectorLogger::Begin() (nsCycleCollector.cpp:1832, in XUL)
#07: nsCycleCollector::BeginCollection(ccType, nsICycleCollectorListener*) (nsCycleCollector.cpp:3828, in XUL)
#08: nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) (nsCycleCollector.cpp:3651, in XUL)
#09: nsCycleCollector_collect(nsICycleCollectorListener*) (nsCycleCollector.cpp:4143, in XUL)
#10: nsJSContext::CycleCollectNow(nsICycleCollectorListener*, int) (nsJSEnvironment.cpp:1439, in XUL)
#11: nsXPCComponents_Utils::ForceCC(nsICycleCollectorListener*) (XPCComponents.cpp:2582, in XUL)
(Assignee)

Updated

2 months ago
Depends on: 1330712
(Assignee)

Comment 16

2 months ago
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #15)
> I'm trying to get the CC logger to work but it's crashing on me reliably
> after this scary assertion... Is there something I'm doing wrong? (more
> errors in the full output)

What OS are you on? We have code that checks if you do IO late in shutdown. The fds used for CC/GC logging are supposed to be whitelisted, but I think bkelly or somebody hit the same issue as you on Windows. Maybe it is broken there?

I managed to reproduce the issue with a variation of what you posted, so thanks for putting that up:
MOZ_DISABLE_CONTENT_SANDBOX=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_DIRECTORY="~/logs/multileaks/1" ./mach mochitest -f browser browser/base/content/test/captivePortal/
(Assignee)

Comment 17

2 months ago
Ok, so that wasn't useful, but by adding all traces I was able to come up with something more useful.

MOZ_CC_ALL_TRACES=all MOZ_DISABLE_CONTENT_SANDBOX=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_DIRECTORY="~/logs/multileaks/2" ./mach mochitest -f browser browser/base/content/test/captivePortal/

The INFO line after the UNEXPECTED-FAIL gives you the serial number for the leaking windows:
  windows(s) leaked: [pid = 13260] [serial = 6], [pid = 13260] [serial = 3]
You can use that to find out the pointer value of the window by looking earlier in the log:
  --DOMWINDOW == 1 (0x7f777b714000) [pid = 13260] [serial = 6]  ...
From this, you can use find_roots.py to see what is holding the window alive:
  python ~/tools/heapgraph/find_roots.py cc-edges.13260-3.log 0x7f777b714000
This just said that the window reflector for the window (0x7f777c861240) was holding it alive, because it was marked black. You then find out why that is alive in the GC graph (with the -bro option to ignore gray roots that don't result in black marking):
  python ~/tools/heapgraph/find_roots.py gc-edges.13260-3.log -bro 0x7f777c861240
The output from that is this:
  via ipc-object :
  0x7f777c872240 [Proxy <no private>]
      --[private]--> 0x7f777c861240 [Window <no private>]

This means that the window reflector is held alive via some proxy, and the proxy is held alive by a root named "ipc-object". If you look for that in search fox you get the method IdToObjectMap::trace(). I think this means that a cpow is holding alive the window.

I don't know about the lifetime of cpows, unfortunately.
(Assignee)

Updated

2 months ago
Summary: Leak detector gives false positives → windows intermittently leaked until shutdown via Cpows with e10s-multi
(Assignee)

Comment 18

2 months ago
Blake, maybe you have some idea of how cleanup of cpows is supposed to work here? I'll try reading the source code in the interim.
Flags: needinfo?(mrbkap)
(Assignee)

Comment 19

2 months ago
Also, the GC log analysis tool is slightly flakey when it comes to deciding what a root is, so it is possible that this is a weak root and the tool just can't tell. If that's the case, we should be able to poke around in the logs for the real root.
(Assignee)

Comment 20

2 months ago
Created attachment 8826357 [details]
GC log for 0x7f777c861240
(Assignee)

Comment 21

2 months ago
Bug 1311212 made these references strong "until the next turn of the event loop". Maybe that's related?
(Assignee)

Comment 22

2 months ago
If I'm reading the comments correctly, the strong references are only really needed when MOZ_DEBUG_DEAD_CPOWS is set, so maybe I can avoid the strong rooting unless that is set.
(Assignee)

Comment 23

2 months ago
Yeah that seems to work. With my patch, I was unable to reproduce the leak in browser browser/base/content/test/captivePortal/ with around 10 repetitions, whereas it was reproducing around 1/3 of the time before.
(Assignee)

Updated

2 months ago
Flags: needinfo?(mrbkap)
(Assignee)

Updated

2 months ago
Attachment #8825442 - Attachment is obsolete: true
(Assignee)

Comment 24

2 months ago
try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a52be61191eb7cf57d910548d8f62869aed512
try run with e10s multi enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=33c909d1348ae666e05d6fdb1d2098d3dbe212a4

I'll put my patches up for review even though the try runs have not finished.
(Assignee)

Updated

2 months ago
Component: Mochitest → JavaScript Engine
Product: Testing → Core
Version: Version 3 → Trunk
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Reporter)

Comment 27

2 months ago
(In reply to Andrew McCreight [:mccr8] from comment #16)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #15)
> > I'm trying to get the CC logger to work but it's crashing on me reliably
> > after this scary assertion... Is there something I'm doing wrong? (more
> > errors in the full output)
> 
> What OS are you on? We have code that checks if you do IO late in shutdown.
> The fds used for CC/GC logging are supposed to be whitelisted, but I think
> bkelly or somebody hit the same issue as you on Windows. Maybe it is broken
> there?

This was on OSX.

By the way I find the detailed steps you posted for the leak tracing super
useful, next time I won't have to bother you if I have to track down a leak :)
Is there any documentation for this? If not we could ask Will to help us put
something together...

Anyway, thanks a lot for fixing this!
(Assignee)

Updated

2 months ago
Attachment #8826394 - Attachment is obsolete: true
Attachment #8826394 - Flags: review?(mrbkap)
(Assignee)

Updated

2 months ago
Attachment #8826395 - Attachment is obsolete: true
Attachment #8826395 - Flags: review?(mrbkap)
(Assignee)

Updated

2 months ago
Attachment #8826394 - Attachment is obsolete: false
Attachment #8826395 - Attachment is obsolete: false
(Assignee)

Comment 28

2 months ago
Ryan said that the original patch fixed a bunch of intermittent CPOW failures, so I guess I can't just revert it. I can probably add some CPOW kill switch that we fire off right before we do the ShutdownLeaks collection in the child.
(Assignee)

Comment 29

2 months ago
> Is there any documentation for this?

Not really. I've been slowly working on writing some up, but I should really finish that off.

Anyways, I spent some more time looking into this, and talking it over with Bill, and what seems to be happening here is that under some circumstances the message manager gets a message, but then doesn't unwrap the CPOWs (like if nobody is listening for messages or there's no frame loader, which I could imagine happening after all tabs are removed maybe).

Without the unwrap, the parent is never aware that the CPOW was sent to it, so nextCPOWNumber_ never gets set, so we end up keeping alive the CPOW until another one is created. Bill said that in some other circumstances (I think when a CPOW is sent in the other direction) the CPOW is strongly held for real, so you'd actually end up leaking something forever.

We want to make sure the CPOWs are always unwrapped, even if we're not going to use them (I had a hacky version that just did the nextCPOWNumber_ thing, but Bill said due to the other problem we should just always unwrap them fully.)
(Assignee)

Updated

2 months ago
Blocks: 1331647
(Assignee)

Comment 30

2 months ago
I have two followups. First, in 6 places, such as ContentChild::RecvAsyncMessage, the CrossProcessCpowHolder is not created unless a message manager exists. Second, my patch only deals with the case where we send a CpowEntry, but there are many other ways we can send a RemoteObject, so I should audit those to see if a similar failure to unwrap can occur.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=82eaf33b73f19bd4e7575ad4aa8befdd034d0456
(Assignee)

Comment 31

2 months ago
I'm leaving those to followups because they don't seem to be needed to fix the intermittent leak issue with e10s-multi, which is my main priority. Moving the holders is simple, but I worry it may introduce weird null pointer crashes, so I want to separate it. Auditing will require a bit of time. I looked at a few cases, and they seemed fine to me, so hopefully nothing will need to be done for the other acses.
(Assignee)

Comment 32

2 months ago
This is a regression from bug 1311212. Bug 1331647 suggests this can result in leaks for users.
status-firefox51: --- → unaffected
status-firefox52: --- → affected
status-firefox53: --- → affected
status-firefox-esr45: --- → unaffected
(Assignee)

Updated

2 months ago
Keywords: mlk, regression
Comment hidden (mozreview-request)
(Assignee)

Updated

2 months ago
Attachment #8826395 - Attachment is obsolete: true
(Assignee)

Comment 34

2 months ago
I'm asking for review from Bobby as XPConnect module owner to approve the additional use of the junk scope. The basic idea here is that we need to go through the steps of creating this JS object and destroying it to avoid a leak, but the object is then immediately garbage, and we may not have a handy global sitting around to allocate it in.
(Assignee)

Updated

2 months ago
Blocks: 1331743
(Assignee)

Updated

2 months ago
See Also: → bug 1331748

Comment 35

2 months ago
mozreview-review
Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

https://reviewboard.mozilla.org/r/104334/#review106088

::: js/ipc/JavaScriptShared.cpp:674
(Diff revision 2)
>          js_ = managerGetter->GetCPOWManager();
>  }
>  
> +CrossProcessCpowHolder::~CrossProcessCpowHolder()
> +{
> +    if (cpows_.Length() && !unwrapped_) {

Please comment that this should only happen if a message manager message gets ignored somehow. We just need to create the CPOW on our side so that it can be GCed on the other side.

::: js/ipc/JavaScriptShared.cpp:676
(Diff revision 2)
>  
> +CrossProcessCpowHolder::~CrossProcessCpowHolder()
> +{
> +    if (cpows_.Length() && !unwrapped_) {
> +        AutoJSAPI jsapi;
> +        if (!jsapi.Init(xpc::PrivilegedJunkScope()))

Perhaps comment that we're creating an object that will immediately become garbage, so the scope doesn't matter too much.
Attachment #8826394 - Flags: review?(wmccloskey) → review+

Comment 36

2 months ago
mozreview-review
Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

https://reviewboard.mozilla.org/r/104334/#review106094

Fine by me. Thanks for flagging, though if there's anyone I trust to use the junk scopes responsibly, it's you and bill. :-)
Attachment #8826394 - Flags: review?(bobbyholley) → review+
Comment hidden (mozreview-request)

Comment 38

2 months ago
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9ca73fb60c36
Ensure we always unwrap CpowEntries. r=bholley,billm

Comment 39

2 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9ca73fb60c36
Status: NEW → RESOLVED
Last Resolved: 2 months ago
status-firefox53: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53

Updated

2 months ago
Blocks: 1320395
(Assignee)

Updated

2 months ago
Blocks: 1311212
(Assignee)

Comment 40

2 months ago
Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

Approval Request Comment
[Feature/Bug causing the regression]: bug 1311212
[User impact if declined]: bad leaks, though maybe only rarely (bug 1331647 is an example of a user leak that looks like this)
[Is this code covered by automated tests?]: yes, though without e10s multi enabled I'm not sure how much it has been tested.
[Has the fix been verified in Nightly?]: it only landed on m-c earlier today
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: not really
[Why is the change risky/not risky?]: it just creates and destroys some objects in rare cases, rather than leaks, so the code running shouldn't be too different.
[String changes made/needed]: none
Attachment #8826394 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 41

2 months ago
I did a bunch of retriggers for the e10s-multi try run in comment 30 without any leaks, so I'm going to optimistically assume this fixes them all.
(Assignee)

Updated

2 months ago
No longer blocks: 1331647
Duplicate of this bug: 1331647
Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

fix leaks with e10s-multi, aurora52+
Attachment #8826394 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 44

2 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/5a6e8c89b306
status-firefox52: affected → fixed
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328428
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328427
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328426
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328358
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328359
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328360
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328366
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328368
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328371
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328372
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328376
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328377
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328379
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328380
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328381
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328382
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328384
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328387
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328389
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328390
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328392
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328395
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1328396
You need to log in before you can comment on or make changes to this bug.