Closed Bug 1612364 Opened 6 years ago Closed 5 years ago

Intermittent JSM shared global leaks when Fission is enabled

Categories

(Core :: XPConnect, task, P2)

task

Tracking

()

RESOLVED FIXED
mozilla77
Fission Milestone M6
Tracking Status
firefox-esr68 --- wontfix
firefox74 --- disabled
firefox75 --- disabled
firefox76 --- fixed
firefox77 --- fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

(Blocks 1 open bug)

Details

(Keywords: memory-leak, Whiteboard: [MemShrink:P2])

Attachments

(3 files)

As seen in bug 1576498, in some Fission tests we frequently see intermittent leaks of a few XPConnect related objects. I've papered these over by setting a pref to reuse Fission processes (for same-origin). I suspect this is due to some kind of race, maybe where shutting down a content process is racing with starting it up, in some kind of way, because the tests that hit this leak most often seem to create a lot of child processes.

Investigation is difficult because I have not been able to reproduce it locally. I have managed to figure out that it appears to be the JSM shared global that is leaking.

This patch removes most of the leak-related places I set the Fission process reuse prefs. With this patch, the failure rate is high enough to make try-based investigation feasible.

I haven't spent a ton of time investigating this, because this is not a "real leak", as we expect the JSM global to always be held alive until shutdown. This is mostly a matter of avoiding TreeHerder failures, which are a hassle for people working on developing Firefox.

These XPConnect objects that are leaking could potentially form a cycle, and thus require the cycle collector to destroy them. XPCJSRuntime::mLoaderGlobal is a reference to the loader global that is not cleared until nsXPConnect is being destroyed, which is after the cycle collector shuts down.

Maybe these objects are normally acyclic, but under whatever unknown race conditions they are cyclic. Thus maybe clearing mLoaderGlobal earlier would fix the leak, was one theory I had.

mozJSComponentLoader::mLoaderGlobal is another strong reference to the loader global, but this one is cleared before we do our final CC.

However, another weird thing is that XPCJSRuntime::mLoaderGlobal and mozJSComponentLoader::mLoaderGlobal are lazily created, and as far as I can see there is no guarantee that once we clear either of those references, they can't be created again if something asks for them. Another theory I had was that maybe something is calling the accessor methods for these fields late in shutdown, thus creating a new cyclic object that isn't freed.

To investigate that, I wrote the attached hacky patch. This adds a new state variable, gCCState, that records the current state of the main thread's cycle collector. In state 0, the CC hasn't been initialized. In state 1, the CC has been initialized (the normal state). In state 2, we've begun the shutdown CC, and in state 3, we've shut down the CC.

If we're in state 2 or later, and somebody calls XPCJSRuntime::LoaderGlobal() or mozJSComponentLoader::GetSharedGlobal, then we crash. Also, when we begin state 2, it clears XPCJSRuntime::mLoaderGlobal.

Anyways, with my patch, we don't hit either of those crashes (which is good, and maybe we could try enforcing that more in debug builds), but we still leak. I should figure out if I can get a CC log on try without disrupting whatever condition is causing the leak.

One possible startup/shutdown race related to the JSM loader is the ScriptPreloader, but I don't know how this could cause a leak. I see that it already watches for XPCOM shutdown and does some cleanup.

Tracking for Fission Nightly (M6)

Fission Milestone: --- → M6

I wrote a patch to simplify XPConnect shutdown a tiny bit, so that XPConnect shouldn't be holding onto the JSM global when we start doing the shutdown CCs. Then I added in some hacky patches to record shutdown CCs on treeherder (which ended up recording them for way too many processes, but that's another story...) and managed to get a log where the global is leaking. Here's what is keeping alive the XPCWN for the BackstagePass:

0x7f7c8e21bf70 [JSStackFrame]
--[mStack]--> 0x32b570b78900 [JS Object (SavedFrame)]
--[group_global]--> 0x11eeef679100 [JS Object (BackstagePass)]
--[js::GetObjectPrivate(obj)]--> 0x7f7c9639d940 [XPCWrappedNative (BackstagePass)]
Root 0x7f7c8e21bf70 is a ref counted object with 1 unknown edge(s).

I could only find a few heap references to nsIStackFrame. ConsoleCallData looks pretty dodgy. It isn't cycle collected because it can be used either to hold worker stuff or main thread stuff.

I thought this was about Console, because we clear the Consoles for windows when the windows hit their destroy event, but Consoles are also available to JSMs, and we never clear that Console, which seemed mighty suspicious. However, clearing all Consoles at xpcom-shutdown didn't fix the leak.

However, I did find another heap reference to nsIStackFrame, CycleCollectedJSContext::mPendingException (well, the dom::Exception holds an mLocation). Clearing that before we do the shutdown CC seems to fix the leak. I'm not sure if that's the best fix, but at least I seem to have figured out where the leak is coming from.

Try looks green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e08188fbab98eaece26099199afe6f74c6fc4afe

I removed all of the uses of the shared content process pref. I'll land that separately if the intermitents on TreeHerder clear up.

Blocks: 1628052

This could be causing transient leaks in other places besides shutdown.

Keywords: memory-leak
Whiteboard: [MemShrink:P2]
Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/38deef4ef9c5 Clear mPendingException when returning to the top level event loop. r=nika

I forgot to update the patch summary when I updated the patch. The "top level" part shouldn't be there.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Blocks: 1594985

Comment on attachment 9138914 [details]
Bug 1612364 - Clear mPendingException when returning to the top level event loop.

Beta/Release Uplift Approval Request

  • User impact if declined: This fixes an intermittent leak that shows up on TreeHerder. There should be no user impact, because it is more of a measurement issue than a real leak.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It just clears out a pointer while we're shutting down.
  • String changes made/needed: none
Attachment #9138914 - Flags: approval-mozilla-beta?

Comment on attachment 9138914 [details]
Bug 1612364 - Clear mPendingException when returning to the top level event loop.

Fixes an intermittent leak seen in Treeherder. Approved for 76.0b3.

Attachment #9138914 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

This issue is likely extremely old (maybe dating back to 2004). We're only just noticing it with Fission because it creates so many processes.

Blocks: 1629646
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: