1330018 - windows intermittently leaked until shutdown via Cpows with e10s-multi

Reporter

Description

•

7 years ago

For some reason with multiple content processes on I experienced some tests to leak. After changing one of the leaking test a bit it seems like the problem is with the leak detector.

This modified test created a bunch of windows and closes them. I use the BrowserTestUtils.closeWindow which is waiting for the final session storage message from the children (in case of the test one for each window) + "domwindowclosed" event on the parent side. After this we start the GC on the parent side and send messages to the child processes to do the same.

Problem is that at this point there is no guarantee that the GC/CC will be able to collect the windows in the child process, we're in the middle of the frame loader destroying, and depending how busy the CPU is, in debug mode it can take a while until we get to the point where it make sense to run GC/CC.

Multiple content processes make things worse here but this modified test in debug mode leaks even with single content process (executed with --repeat 5)

side notes:
- on the parent side we are quite aggressive with GC: http://searchfox.org/mozilla-central/source/testing/mochitest/browser-test.js#644
compared to the child side... should we do more on the child side too?

- seems like we don't wait for the child processes finish their GC before launching the next test: http://searchfox.org/mozilla-central/source/testing/mochitest/browser-test.js#647 can that cause a problem?

Gabor Krizsanits (INACTIVE)

Reporter

Comment 1

•

7 years ago

Attached patch leaky test (obsolete) — Details — Splinter Review

Gabor Krizsanits (INACTIVE)

Reporter

Comment 2

•

7 years ago

I'm blocked by this issue to land multiple content processes on nightly, could you please help me figuring out what to do about this issue?

Flags: needinfo?(continuation)

Gabor Krizsanits (INACTIVE)

Reporter

Updated

•

7 years ago

Blocks: 1303113

Whiteboard: [e10s-multi:M1]

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

7 years ago

Sure, I can take a look. The basic way this works is that the parent process sends the async browser-test:collect-request message, which the child gets. Then the child runs some GCs synchronously, and prints out a message with its pid. Then the leak detector (which is a Python script that runs on the log) looks for that message, checking that windows created during the test in that process were destroyed before the message.

Here's the various pieces of that: http://searchfox.org/mozilla-central/search?q=Completed%20ShutdownLeaks%20collections%20in%20process&path=

I wasn't thinking at all about e10s multi when I was converting this test to work with e10s, so it is possible something doesn't work.

It is possible that running the same test repeatedly confuses the leak detector, which could explain why you'd see issues with --repeat.

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

7 years ago

How do I run mochitests locally with e10s multi enabled?

Assignee: nobody → continuation

Flags: needinfo?(continuation) → needinfo?(gkrizsanits)

Andrew Halberstadt [:ahal]

Comment 5

•

7 years ago

I don't know the pref offhand, but you can pass in --setpref <pref>=<value> to the mach command. If that doesn't work for some reason, let me know and just add it to testing/profiles/prefs_general.js in the meantime.

Gabor Krizsanits (INACTIVE)

Reporter

Comment 6

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #3)
> Sure, I can take a look. The basic way this works is that the parent process

Thanks!

> sends the async browser-test:collect-request message, which the child gets.
> Then the child runs some GCs synchronously, and prints out a message with
> its pid. Then the leak detector (which is a Python script that runs on the
> log) looks for that message, checking that windows created during the test
> in that process were destroyed before the message.

Yeah, I think I've understood that part. I'm just not sure when is the right time
to do a GC in the child... if we do it too early something might still hold a reference
to the window and we fail to collect, no? I fail to see why BrowserTestUtils way of
waiting for those messages/events is a sure way to know that we can finish the test and start
the sync GC work. And I think in some cases it is too early.

> 
> I wasn't thinking at all about e10s multi when I was converting this test to
> work with e10s, so it is possible something doesn't work.
> 
> It is possible that running the same test repeatedly confuses the leak
> detector, which could explain why you'd see issues with --repeat.

Oh, I though since we have the windows ID's / raw pointers that should not be a problem
but I have not looked into that. The original test fails intermittently if you set the process
count to 2 in all.js

Some more leaky tests are under bug 1303113.

(In reply to Andrew McCreight [:mccr8] from comment #4)
> How do I run mochitests locally with e10s multi enabled?
dom.ipc.processCount in all.js

Flags: needinfo?(gkrizsanits)

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #6)
> Yeah, I think I've understood that part. I'm just not sure when is the right
> time to do a GC in the child... if we do it too early something might still hold
> a reference to the window and we fail to collect, no? I fail to see why BrowserTestUtils
> way of waiting for those messages/events is a sure way to know that we can finish
> the test and start the sync GC work. And I think in some cases it is too early.

I'm not too clear on how we decide to do this, but the current approach has worked for a few years, so I don't think it can be too terrible. The entire point of this test is that if you close all the tabs a tests open, and GC a bunch, the windows the test creates should go away. I think there's even some kind of observer service thing code can listen for to drop stuff if it really needs to.

> Oh, I though since we have the windows ID's / raw pointers that should not
> be a problem but I have not looked into that.

I can reproduce the issue with --repeat. For at least one window, I see the "Completed ShutdownLeaks collections in process" message for the process before the window is even created. Then it never happens again. So we may not be doing the full cleanup procedure in the --repeat case. I looked over the code in leaks.py and it does look like it should be able to handle repeated testing.

> The original test fails intermittently if you set the process count to 2 in all.js

I'm not able to reproduce a leak with the patch you attached here, and running it alone using
  ./mach mochitest dom/tests/browser/browser_test_toolbars_visibility.js

Does that consistently leak for you, or only intermittently? Also, if you have a log of a full run that fails I can look at that and try to analyze if something is going wrong at the test harness level.

Andrew McCreight [:mccr8]

Assignee

Comment 8

•

7 years ago

I'll look around in the bugs blocking bug 1303113 and see if I can figure anything out.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Depends on: 1330085

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

No longer depends on: 1330085

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

7 years ago

Gabor, do you have any steps to reproduce this locally, without --repeat?

Flags: needinfo?(gkrizsanits)

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

7 years ago

Here's what a log from bug 1328396 looks like (with many intermediate lines removed):

11:58:39 INFO - TEST-START | toolkit/mozapps/extensions/test/browser/browser_updateid.js
11:58:39 INFO - ++DOMWINDOW == 3 (0x111d64800) [pid = 2536] [serial = 249] [outer = 0x0]
11:58:41 INFO - TEST-OK | toolkit/mozapps/extensions/test/browser/browser_updateid.js | took 2071ms
11:58:46 INFO - Completed ShutdownLeaks collections in process 2536
11:58:53 (some warnings from nsThread::Shutdown() in process 2536, indicating we're probably pretty late in shut down)
11:58:53 INFO - --DOMWINDOW == 0 (0x111d64800) [pid = 2536] [serial = 249] [outer = 0x0] [url = about:blank]

So, it looks like we are properly running the ShutdownLeaks collection for this process, and I only see it for this process once, so this isn't leaks.py getting confused. The test is in fact just keeping alive an inner/outer window pair alive until very late in shutdown. Note that the window isn't destroyed until 7 seconds after the shutdown collection, which seems like a long time. Though maybe it actually became garbage quickly, and that's just the next time we ran the GC.

I'm not sure how easy it will be to figure out without a way to reproduce it. Maybe rr is the best hope here.

Gabor Krizsanits (INACTIVE)

Reporter

Comment 11

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #7)
> I'm not too clear on how we decide to do this, but the current approach has
> worked for a few years, so I don't think it can be too terrible. The entire
> point of this test is that if you close all the tabs a tests open, and GC a
> bunch, the windows the test creates should go away. I think there's even
> some kind of observer service thing code can listen for to drop stuff if it
> really needs to.

Yes. The "memory-pressure" thing. Speaking of which, I did some logging for another
test (browser_captivePortal_certErrorUI.js), and I think there
in the non-leaking case, it's not the GC initiated from the ShutdownLeakCollector that
picks up the window but the one that is requested from here from the parent side
http://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#2360
and then handled here on child side:
http://searchfox.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2355
and then starts somewhere an async CC...

> I can reproduce the issue with --repeat. For at least one window, I see the
> "Completed ShutdownLeaks collections in process" message for the process
> before the window is even created. Then it never happens again. So we may
> not be doing the full cleanup procedure in the --repeat case. I looked over
> the code in leaks.py and it does look like it should be able to handle
> repeated testing.

Let's forget about the --repeat case then, I created 5 different copy of the
test above and run them in a suite (instead of using --repeat on the same test file)
and that did not report any leaks... So you're probably right about --repeat.

> Does that consistently leak for you, or only intermittently? Also, if you

Intermittently, in the most annoying way. I can reproduce it sometimes several times in a row
sometimes I cannot reproduce it at all no matter how much I try. You could try browser_captivePortal_certErrorUI.js it is pretty frequent with two content processes, especially for some reason right after a fresh build (./mach build && ./mach mochitest browser/base/content/test/captivePortal/browser_captivePortal_certErrorUI.js) I know it does not make much sense, but the annoying part is that if I can reproduce it then I can usually reproduce it consistently (and the other way around), so it is worth to try a couple of builds. You could also give some extra stress to the CPU while running maybe, or use a VM with single core.

> have a log of a full run that fails I can look at that and try to analyze if
> something is going wrong at the test harness level.

Yeah I'll attach one. It probably does not help much though since if I run a single test, then the window gets cleaned up only during the process shutdown. The output seems alright, just something seems to hold a reference to the window _sometimes_. Btw I tried using setTimeout before the GC (to see if it's just some event), and it helped locally (I wouldn't trust it because of the intermittent nature) but did not solve to problem on try.

(In reply to Andrew McCreight [:mccr8] from comment #10)
> Here's what a log from bug 1328396 looks like (with many intermediate lines
> removed):
> 
> 11:58:39 INFO - TEST-START |
> toolkit/mozapps/extensions/test/browser/browser_updateid.js
> 11:58:39 INFO - ++DOMWINDOW == 3 (0x111d64800) [pid = 2536] [serial = 249]
> [outer = 0x0]
> 11:58:41 INFO - TEST-OK |
> toolkit/mozapps/extensions/test/browser/browser_updateid.js | took 2071ms
> 11:58:46 INFO - Completed ShutdownLeaks collections in process 2536
> 11:58:53 (some warnings from nsThread::Shutdown() in process 2536,
> indicating we're probably pretty late in shut down)
> 11:58:53 INFO - --DOMWINDOW == 0 (0x111d64800) [pid = 2536] [serial = 249]
> [outer = 0x0] [url = about:blank]
> 
> So, it looks like we are properly running the ShutdownLeaks collection for
> this process, and I only see it for this process once, so this isn't
> leaks.py getting confused. The test is in fact just keeping alive an
> inner/outer window pair alive until very late in shutdown. Note that the
> window isn't destroyed until 7 seconds after the shutdown collection, which
> seems like a long time. Though maybe it actually became garbage quickly, and
> that's just the next time we ran the GC.

This means that we are able to GC the window before the final cleanup during the
process shutdown right?

> 
> I'm not sure how easy it will be to figure out without a way to reproduce
> it. Maybe rr is the best hope here.

On the try server it might be easier to reproduce it... My problem is that even if
I attached a debugger or rr, I don't think I would know what to do / look for. I mean,
I don't see what guarantees that after the window.close() the destruction of the DOM
tree / JS scopes / docshell tree got to the point where a GC/CC can free the window,
other than that it's worked so far with a single content process. If you have some
hints for me where to look, I can give it a try.

Flags: needinfo?(gkrizsanits) → needinfo?(continuation)

Gabor Krizsanits (INACTIVE)

Reporter

Comment 12

•

7 years ago

Attached file leakoutput — Details

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11)
> Yes. The "memory-pressure" thing.

I was thinking of the "shutdown-leaks-before-check" barrier thing in browser-test.js.

> can reproduce it then I can usually reproduce it consistently (and the other
> way around), so it is worth to try a couple of builds. You could also give
> some extra stress to the CPU while running maybe, or use a VM with single
> core.

That's quite odd!

> > have a log of a full run that fails I can look at that and try to analyze if
> > something is going wrong at the test harness level.
> 
> Yeah I'll attach one. It probably does not help much though since if I run a
> single test, then the window gets cleaned up only during the process
> shutdown. The output seems alright, just something seems to hold a reference
> to the window _sometimes_. Btw I tried using setTimeout before the GC (to
> see if it's just some event), and it helped locally (I wouldn't trust it
> because of the intermittent nature) but did not solve to problem on try.

Hmm strange.

> This means that we are able to GC the window before the final cleanup during
> the process shutdown right?

Yup.

> On the try server it might be easier to reproduce it... My problem is that
> even if I attached a debugger or rr, I don't think I would know what to do / look
> for.

The basic idea is to get a CC log at the point where the window is alive, but you wish it was cleaned up. This will show you that there's some missing reference to an object. Then you look at the releases of that object after that point, and compare it to the known references, to figure out what the missing reference was. Or I have another heap-scanning based approach that does something similar. It is not very easy to do, in a case like this where even reproducing it is hard.

> I mean, I don't see what guarantees that after the window.close() the destruction of
> the DOM tree / JS scopes / docshell tree got to the point where a GC/CC can free the
> window, other than that it's worked so far with a single content process. If you
> have some hints for me where to look, I can give it a try.

I'm not really sure what the guarantees are.

One question is, why is e10s multi different? One possibility is that browser-test.js sets dom.ipc.keepProcessesAlive. It might be worth seeing if setting that causes these leaks even in the regular e10s case, and if not setting it results in leaks in the e10s-multi case (assuming the timeouts are not too much of an issue). Maybe we discard things more aggressively if we're not keeping the process alive?

It might be interesting to try running with MOZ_IPC_MESSAGE_LOG=1 and see if there's any obvious difference in messages sent during shutdown. Maybe it is too noisy.

I've been looking over the various bugs blocking bug 1303113 to see if I can figure out any kind of pattern. One interesting log I found was in bug 1328374 comment 0. Here the test that creates the leaked window is actually the second-to-last one. The next test does not reuse the same process, so you'd think it would just be sitting around for a while, so weird races would be less of an issue.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Whiteboard: [e10s-multi:M1] → [e10s-multi:M1][MemShrink]

Eric Rahm [:erahm]

Updated

•

7 years ago

Whiteboard: [e10s-multi:M1][MemShrink] → [e10s-multi:M1][MemShrink:P1]

Gabor Krizsanits (INACTIVE)

Reporter

Comment 14

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #13)
> The basic idea is to get a CC log at the point where the window is alive,
> but you wish it was cleaned up. This will show you that there's some missing
> reference to an object. Then you look at the releases of that object after
> that point, and compare it to the known references, to figure out what the
> missing reference was. Or I have another heap-scanning based approach that
> does something similar. It is not very easy to do, in a case like this where
> even reproducing it is hard.

Thanks, I'll try this approach.

> One question is, why is e10s multi different? One possibility is that
> browser-test.js sets dom.ipc.keepProcessesAlive. It might be worth seeing if
> setting that causes these leaks even in the regular e10s case, and if not
> setting it results in leaks in the e10s-multi case (assuming the timeouts
> are not too much of an issue). Maybe we discard things more aggressively if
> we're not keeping the process alive?

I suspected that as well, but I could not detect any difference when it's turned on
or off.

> 
> It might be interesting to try running with MOZ_IPC_MESSAGE_LOG=1 and see if
> there's any obvious difference in messages sent during shutdown. Maybe it is
> too noisy.
> 
> I've been looking over the various bugs blocking bug 1303113 to see if I can
> figure out any kind of pattern. One interesting log I found was in bug
> 1328374 comment 0. Here the test that creates the leaked window is actually
> the second-to-last one. The next test does not reuse the same process, so
> you'd think it would just be sitting around for a while, so weird races
> would be less of an issue.

Interesting. On a side note, I wonder if we should just nuke all the windows if
a content process is kept alive but the last tab/window is closed that it has
been hosting.

Gabor Krizsanits (INACTIVE)

Reporter

Comment 15

•

7 years ago

Attached file cclogcrash.txt — Details

I'm trying to get the CC logger to work but it's crashing on me reliably after this scary assertion... Is there something I'm doing wrong? (more errors in the full output)

command:
---------
MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS="all" MOZ_CC_LOG_DIRECTORY="~/Documents/log" MOZ_DISABLE_CONTENT_SANDBOX=1 ./mach mochitest browser/base/content/test/captivePortal/browser_captivePortal_certErrorUI.js


stack:
-------
Assertion failure: !DebugFileIDs.Contains(aHandle), at /Users/gkrizsanits/Documents/development/mozilla-central/xpcom/build/PoisonIOInterposerBase.cpp:231
#01: MozillaRegisterDebugHandle (PoisonIOInterposerBase.cpp:231, in XUL)
#02: MozillaRegisterDebugFD (PoisonIOInterposerBase.cpp:239, in XUL)
#03: MozillaRegisterDebugFILE (PoisonIOInterposerBase.cpp:249, in XUL)
#04: nsCycleCollectorLogSinkToFile::OpenLog(nsCycleCollectorLogSinkToFile::FileInfo*) (nsCycleCollector.cpp:1687, in XUL)
#05: nsCycleCollectorLogSinkToFile::Open(__sFILE**, __sFILE**) (nsCycleCollector.cpp:1579, in XUL)
#06: nsCycleCollectorLogger::Begin() (nsCycleCollector.cpp:1832, in XUL)
#07: nsCycleCollector::BeginCollection(ccType, nsICycleCollectorListener*) (nsCycleCollector.cpp:3828, in XUL)
#08: nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) (nsCycleCollector.cpp:3651, in XUL)
#09: nsCycleCollector_collect(nsICycleCollectorListener*) (nsCycleCollector.cpp:4143, in XUL)
#10: nsJSContext::CycleCollectNow(nsICycleCollectorListener*, int) (nsJSEnvironment.cpp:1439, in XUL)
#11: nsXPCComponents_Utils::ForceCC(nsICycleCollectorListener*) (XPCComponents.cpp:2582, in XUL)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Depends on: 1330712

Andrew McCreight [:mccr8]

Assignee

Comment 16

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #15)
> I'm trying to get the CC logger to work but it's crashing on me reliably
> after this scary assertion... Is there something I'm doing wrong? (more
> errors in the full output)

What OS are you on? We have code that checks if you do IO late in shutdown. The fds used for CC/GC logging are supposed to be whitelisted, but I think bkelly or somebody hit the same issue as you on Windows. Maybe it is broken there?

I managed to reproduce the issue with a variation of what you posted, so thanks for putting that up:
MOZ_DISABLE_CONTENT_SANDBOX=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_DIRECTORY="~/logs/multileaks/1" ./mach mochitest -f browser browser/base/content/test/captivePortal/

Andrew McCreight [:mccr8]

Assignee

Comment 17

•

7 years ago

Ok, so that wasn't useful, but by adding all traces I was able to come up with something more useful.

MOZ_CC_ALL_TRACES=all MOZ_DISABLE_CONTENT_SANDBOX=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_DIRECTORY="~/logs/multileaks/2" ./mach mochitest -f browser browser/base/content/test/captivePortal/

The INFO line after the UNEXPECTED-FAIL gives you the serial number for the leaking windows:
  windows(s) leaked: [pid = 13260] [serial = 6], [pid = 13260] [serial = 3]
You can use that to find out the pointer value of the window by looking earlier in the log:
  --DOMWINDOW == 1 (0x7f777b714000) [pid = 13260] [serial = 6]  ...
From this, you can use find_roots.py to see what is holding the window alive:
  python ~/tools/heapgraph/find_roots.py cc-edges.13260-3.log 0x7f777b714000
This just said that the window reflector for the window (0x7f777c861240) was holding it alive, because it was marked black. You then find out why that is alive in the GC graph (with the -bro option to ignore gray roots that don't result in black marking):
  python ~/tools/heapgraph/find_roots.py gc-edges.13260-3.log -bro 0x7f777c861240
The output from that is this:
  via ipc-object :
  0x7f777c872240 [Proxy <no private>]
      --[private]--> 0x7f777c861240 [Window <no private>]

This means that the window reflector is held alive via some proxy, and the proxy is held alive by a root named "ipc-object". If you look for that in search fox you get the method IdToObjectMap::trace(). I think this means that a cpow is holding alive the window.

I don't know about the lifetime of cpows, unfortunately.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Summary: Leak detector gives false positives → windows intermittently leaked until shutdown via Cpows with e10s-multi

Andrew McCreight [:mccr8]

Assignee

Comment 18

•

7 years ago

Blake, maybe you have some idea of how cleanup of cpows is supposed to work here? I'll try reading the source code in the interim.

Flags: needinfo?(mrbkap)

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

7 years ago

Also, the GC log analysis tool is slightly flakey when it comes to deciding what a root is, so it is possible that this is a weak root and the tool just can't tell. If that's the case, we should be able to poke around in the logs for the real root.

Andrew McCreight [:mccr8]

Assignee

Comment 20

•

7 years ago

Attached file GC log for 0x7f777c861240 — Details

Andrew McCreight [:mccr8]

Assignee

Comment 21

•

7 years ago

Bug 1311212 made these references strong "until the next turn of the event loop". Maybe that's related?

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

7 years ago

If I'm reading the comments correctly, the strong references are only really needed when MOZ_DEBUG_DEAD_CPOWS is set, so maybe I can avoid the strong rooting unless that is set.

Andrew McCreight [:mccr8]

Assignee

Comment 23

•

7 years ago

Yeah that seems to work. With my patch, I was unable to reproduce the leak in browser browser/base/content/test/captivePortal/ with around 10 repetitions, whereas it was reproducing around 1/3 of the time before.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Flags: needinfo?(mrbkap)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Attachment #8825442 - Attachment is obsolete: true

Andrew McCreight [:mccr8]

Assignee

Comment 24

•

7 years ago

try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a52be61191eb7cf57d910548d8f62869aed512
try run with e10s multi enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=33c909d1348ae666e05d6fdb1d2098d3dbe212a4

I'll put my patches up for review even though the try runs have not finished.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Component: Mochitest → JavaScript Engine

Product: Testing → Core

Version: Version 3 → Trunk

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/104334/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/104334/

Comment hidden (mozreview-request)

It looks like this was only added in bug 1311212 to avoid false
positives in the dead CPOW debugging mode. However, with e10s-multi it
causes windows to be intermittently held alive a little longer, which
causes the Mochitest browser-chrome window leak detector to report
leaks. We avoid this by only doing this rooting when dead CPOW
debugging is enabled.

Review commit: https://reviewboard.mozilla.org/r/104336/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/104336/

Gabor Krizsanits (INACTIVE)

Reporter

Comment 27

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #16)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #15)
> > I'm trying to get the CC logger to work but it's crashing on me reliably
> > after this scary assertion... Is there something I'm doing wrong? (more
> > errors in the full output)
> 
> What OS are you on? We have code that checks if you do IO late in shutdown.
> The fds used for CC/GC logging are supposed to be whitelisted, but I think
> bkelly or somebody hit the same issue as you on Windows. Maybe it is broken
> there?

This was on OSX.

By the way I find the detailed steps you posted for the leak tracing super
useful, next time I won't have to bother you if I have to track down a leak :)
Is there any documentation for this? If not we could ask Will to help us put
something together...

Anyway, thanks a lot for fixing this!

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Attachment #8826394 - Attachment is obsolete: true

Attachment #8826394 - Flags: review?(mrbkap)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Attachment #8826395 - Attachment is obsolete: true

Attachment #8826395 - Flags: review?(mrbkap)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Attachment #8826394 - Attachment is obsolete: false

Attachment #8826395 - Attachment is obsolete: false

Andrew McCreight [:mccr8]

Assignee

Comment 28

•

7 years ago

Ryan said that the original patch fixed a bunch of intermittent CPOW failures, so I guess I can't just revert it. I can probably add some CPOW kill switch that we fire off right before we do the ShutdownLeaks collection in the child.

Andrew McCreight [:mccr8]

Assignee

Comment 29

•

7 years ago

> Is there any documentation for this?

Not really. I've been slowly working on writing some up, but I should really finish that off.

Anyways, I spent some more time looking into this, and talking it over with Bill, and what seems to be happening here is that under some circumstances the message manager gets a message, but then doesn't unwrap the CPOWs (like if nobody is listening for messages or there's no frame loader, which I could imagine happening after all tabs are removed maybe).

Without the unwrap, the parent is never aware that the CPOW was sent to it, so nextCPOWNumber_ never gets set, so we end up keeping alive the CPOW until another one is created. Bill said that in some other circumstances (I think when a CPOW is sent in the other direction) the CPOW is strongly held for real, so you'd actually end up leaking something forever.

We want to make sure the CPOWs are always unwrapped, even if we're not going to use them (I had a hacky version that just did the nextCPOWNumber_ thing, but Bill said due to the other problem we should just always unwrap them fully.)

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Blocks: 1331647

Andrew McCreight [:mccr8]

Assignee

Comment 30

•

7 years ago

I have two followups. First, in 6 places, such as ContentChild::RecvAsyncMessage, the CrossProcessCpowHolder is not created unless a message manager exists. Second, my patch only deals with the case where we send a CpowEntry, but there are many other ways we can send a RemoteObject, so I should audit those to see if a similar failure to unwrap can occur.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=82eaf33b73f19bd4e7575ad4aa8befdd034d0456

Andrew McCreight [:mccr8]

Assignee

Comment 31

•

7 years ago

I'm leaving those to followups because they don't seem to be needed to fix the intermittent leak issue with e10s-multi, which is my main priority. Moving the holders is simple, but I worry it may introduce weird null pointer crashes, so I want to separate it. Auditing will require a bit of time. I looked at a few cases, and they seemed fine to me, so hopefully nothing will need to be done for the other acses.

Andrew McCreight [:mccr8]

Assignee

Comment 32

•

7 years ago

This is a regression from bug 1311212. Bug 1331647 suggests this can result in leaks for users.

status-firefox51: --- → unaffected

status-firefox52: --- → affected

status-firefox53: --- → affected

status-firefox-esr45: --- → unaffected

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Keywords: mlk, regression

Comment hidden (mozreview-request)

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/104334/diff/1-2/

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Attachment #8826395 - Attachment is obsolete: true

Andrew McCreight [:mccr8]

Assignee

Comment 34

•

7 years ago

I'm asking for review from Bobby as XPConnect module owner to approve the additional use of the junk scope. The basic idea here is that we need to go through the steps of creating this JS object and destroying it to avoid a leak, but the object is then immediately garbage, and we may not have a handy global sitting around to allocate it in.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Blocks: 1331743

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Comment 35

•

7 years ago

mozreview-review

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

https://reviewboard.mozilla.org/r/104334/#review106088

::: js/ipc/JavaScriptShared.cpp:674
(Diff revision 2)
>          js_ = managerGetter->GetCPOWManager();
>  }
>  
> +CrossProcessCpowHolder::~CrossProcessCpowHolder()
> +{
> +    if (cpows_.Length() && !unwrapped_) {

Please comment that this should only happen if a message manager message gets ignored somehow. We just need to create the CPOW on our side so that it can be GCed on the other side.

::: js/ipc/JavaScriptShared.cpp:676
(Diff revision 2)
>  
> +CrossProcessCpowHolder::~CrossProcessCpowHolder()
> +{
> +    if (cpows_.Length() && !unwrapped_) {
> +        AutoJSAPI jsapi;
> +        if (!jsapi.Init(xpc::PrivilegedJunkScope()))

Perhaps comment that we're creating an object that will immediately become garbage, so the scope doesn't matter too much.

Attachment #8826394 - Flags: review?(wmccloskey) → review+

Bobby Holley (:bholley)

Comment 36

•

7 years ago

mozreview-review

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

https://reviewboard.mozilla.org/r/104334/#review106094

Fine by me. Thanks for flagging, though if there's anyone I trust to use the junk scopes responsibly, it's you and bill. :-)

Attachment #8826394 - Flags: review?(bobbyholley) → review+

Comment hidden (mozreview-request)

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/104334/diff/2-3/

Pulsebot

Comment 38

•

7 years ago

Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9ca73fb60c36
Ensure we always unwrap CpowEntries. r=bholley,billm

Ryan VanderMeulen [:RyanVM]

Comment 39

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9ca73fb60c36

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox53: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

Kris Maglione [:kmag]

Updated

•

7 years ago

Blocks: 1320395

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Blocks: 1311212

Andrew McCreight [:mccr8]

Assignee

Comment 40

•

7 years ago

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

Approval Request Comment
[Feature/Bug causing the regression]: bug 1311212
[User impact if declined]: bad leaks, though maybe only rarely (bug 1331647 is an example of a user leak that looks like this)
[Is this code covered by automated tests?]: yes, though without e10s multi enabled I'm not sure how much it has been tested.
[Has the fix been verified in Nightly?]: it only landed on m-c earlier today
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: not really
[Why is the change risky/not risky?]: it just creates and destroys some objects in rare cases, rather than leaks, so the code running shouldn't be too different.
[String changes made/needed]: none

Attachment #8826394 - Flags: approval-mozilla-aurora?

Andrew McCreight [:mccr8]

Assignee

Comment 41

•

7 years ago

I did a bunch of retriggers for the e10s-multi try run in comment 30 without any leaks, so I'm going to optimistically assume this fixes them all.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

No longer blocks: 1331647

Julien Cristau [:jcristau] (back April 22)

Comment 43

•

7 years ago

Comment on attachment 8826394 [details]
Bug 1330018 - Ensure we always unwrap CpowEntries.

fix leaks with e10s-multi, aurora52+

Attachment #8826394 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 44

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/5a6e8c89b306

status-firefox52: affected → fixed

leaky test 7 years ago Gabor Krizsanits (INACTIVE) 8.71 KB, patch		Details \| Diff \| Splinter Review
leakoutput 7 years ago Gabor Krizsanits (INACTIVE) 22.70 KB, text/plain		Details
cclogcrash.txt 7 years ago Gabor Krizsanits (INACTIVE) 31.82 KB, text/plain		Details
GC log for 0x7f777c861240 7 years ago Andrew McCreight [:mccr8] 1012.13 KB, application/x-bzip		Details
Bug 1330018 - Ensure we always unwrap CpowEntries. 7 years ago Andrew McCreight [:mccr8] 59 bytes, text/x-review-board-request	bholley : review+ billm : review+ jcristau : approval-mozilla-aurora+	Details
Bug 1330018, part 2 - Only strongly root CPOWs in the child when debugging dead CPOWs. 7 years ago Andrew McCreight [:mccr8] 59 bytes, text/x-review-board-request		Details