Closed Bug 1612364 Opened 6 years ago Closed 5 years ago

Intermittent JSM shared global leaks when Fission is enabled

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla77

Project Flags:

Fission Milestone

Tracking Flags:

Tracking

Status

firefox-esr68

---

wontfix

firefox74

---

disabled

firefox75

---

disabled

firefox76

---

fixed

firefox77

---

fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

(Blocks 1 open bug)

Details

(Keywords: memory-leak, Whiteboard: [MemShrink:P2])

Attachments

(3 files)

Remove most of the fission process reusing prefs. DO NOT LAND 6 years ago Andrew McCreight [:mccr8] 4.16 KB, patch		Details \| Diff \| Splinter Review
Hacky investigation of dropping XPConnect references to the shared loader global earlier. DO NOT LAND 6 years ago Andrew McCreight [:mccr8] 6.31 KB, patch		Details \| Diff \| Splinter Review
Bug 1612364 - Clear mPendingException when returning to the top level event loop. 5 years ago Andrew McCreight [:mccr8] 47 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-beta+	Details \| Review

Andrew McCreight [:mccr8]

Assignee

Description

•

6 years ago

As seen in bug 1576498, in some Fission tests we frequently see intermittent leaks of a few XPConnect related objects. I've papered these over by setting a pref to reuse Fission processes (for same-origin). I suspect this is due to some kind of race, maybe where shutting down a content process is racing with starting it up, in some kind of way, because the tests that hit this leak most often seem to create a lot of child processes.

Investigation is difficult because I have not been able to reproduce it locally. I have managed to figure out that it appears to be the JSM shared global that is leaking.

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

6 years ago

Attached patch Remove most of the fission process reusing prefs. DO NOT LAND — Details — Splinter Review

This patch removes most of the leak-related places I set the Fission process reuse prefs. With this patch, the failure rate is high enough to make try-based investigation feasible.

Andrew McCreight [:mccr8]

Assignee

Comment 2

•

6 years ago

I haven't spent a ton of time investigating this, because this is not a "real leak", as we expect the JSM global to always be held alive until shutdown. This is mostly a matter of avoiding TreeHerder failures, which are a hassle for people working on developing Firefox.

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

6 years ago

Attached patch Hacky investigation of dropping XPConnect references to the shared loader global earlier. DO NOT LAND — Details — Splinter Review

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

6 years ago

These XPConnect objects that are leaking could potentially form a cycle, and thus require the cycle collector to destroy them. XPCJSRuntime::mLoaderGlobal is a reference to the loader global that is not cleared until nsXPConnect is being destroyed, which is after the cycle collector shuts down.

Maybe these objects are normally acyclic, but under whatever unknown race conditions they are cyclic. Thus maybe clearing mLoaderGlobal earlier would fix the leak, was one theory I had.

mozJSComponentLoader::mLoaderGlobal is another strong reference to the loader global, but this one is cleared before we do our final CC.

However, another weird thing is that XPCJSRuntime::mLoaderGlobal and mozJSComponentLoader::mLoaderGlobal are lazily created, and as far as I can see there is no guarantee that once we clear either of those references, they can't be created again if something asks for them. Another theory I had was that maybe something is calling the accessor methods for these fields late in shutdown, thus creating a new cyclic object that isn't freed.

To investigate that, I wrote the attached hacky patch. This adds a new state variable, gCCState, that records the current state of the main thread's cycle collector. In state 0, the CC hasn't been initialized. In state 1, the CC has been initialized (the normal state). In state 2, we've begun the shutdown CC, and in state 3, we've shut down the CC.

If we're in state 2 or later, and somebody calls XPCJSRuntime::LoaderGlobal() or mozJSComponentLoader::GetSharedGlobal, then we crash. Also, when we begin state 2, it clears XPCJSRuntime::mLoaderGlobal.

Anyways, with my patch, we don't hit either of those crashes (which is good, and maybe we could try enforcing that more in debug builds), but we still leak. I should figure out if I can get a CC log on try without disrupting whatever condition is causing the leak.

Andrew McCreight [:mccr8]

Assignee

Comment 5

•

6 years ago

One possible startup/shutdown race related to the JSM loader is the ScriptPreloader, but I don't know how this could cause a leak. I see that it already watches for XPCOM shutdown and does some cleanup.

Chris Peterson [:cpeterson]

Comment 6

•

5 years ago

Tracking for Fission Nightly (M6)

Fission Milestone: --- → M6

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

5 years ago

I wrote a patch to simplify XPConnect shutdown a tiny bit, so that XPConnect shouldn't be holding onto the JSM global when we start doing the shutdown CCs. Then I added in some hacky patches to record shutdown CCs on treeherder (which ended up recording them for way too many processes, but that's another story...) and managed to get a log where the global is leaking. Here's what is keeping alive the XPCWN for the BackstagePass:

0x7f7c8e21bf70 [JSStackFrame]
--[mStack]--> 0x32b570b78900 [JS Object (SavedFrame)]
--[group_global]--> 0x11eeef679100 [JS Object (BackstagePass)]
--[js::GetObjectPrivate(obj)]--> 0x7f7c9639d940 [XPCWrappedNative (BackstagePass)]
Root 0x7f7c8e21bf70 is a ref counted object with 1 unknown edge(s).

Andrew McCreight [:mccr8]

Assignee

Comment 8

•

5 years ago

I could only find a few heap references to nsIStackFrame. ConsoleCallData looks pretty dodgy. It isn't cycle collected because it can be used either to hold worker stuff or main thread stuff.

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

5 years ago

I thought this was about Console, because we clear the Consoles for windows when the windows hit their destroy event, but Consoles are also available to JSMs, and we never clear that Console, which seemed mighty suspicious. However, clearing all Consoles at xpcom-shutdown didn't fix the leak.

However, I did find another heap reference to nsIStackFrame, CycleCollectedJSContext::mPendingException (well, the dom::Exception holds an mLocation). Clearing that before we do the shutdown CC seems to fix the leak. I'm not sure if that's the best fix, but at least I seem to have figured out where the leak is coming from.

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

5 years ago

Attached file Bug 1612364 - Clear mPendingException when returning to the top level event loop. — Details

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

5 years ago

Try looks green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e08188fbab98eaece26099199afe6f74c6fc4afe

I removed all of the uses of the shared content process pref. I'll land that separately if the intermitents on TreeHerder clear up.

Andrew McCreight [:mccr8]

Assignee

Updated

•

5 years ago

Blocks: 1628052

Andrew McCreight [:mccr8]

Assignee

Comment 12

•

5 years ago

This could be causing transient leaks in other places besides shutdown.

Keywords: memory-leak

Whiteboard: [MemShrink:P2]

Pulsebot

Comment 13

•

5 years ago

Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/38deef4ef9c5 Clear mPendingException when returning to the top level event loop. r=nika

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

5 years ago

I forgot to update the patch summary when I updated the patch. The "top level" part shouldn't be there.

Andreea Pavel [:apavel]

Comment 15

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/38deef4ef9c5

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox77: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla77

Andrew McCreight [:mccr8]

Assignee

Updated

•

5 years ago

Blocks: 1594985

Andrew McCreight [:mccr8]

Assignee

Comment 16

•

5 years ago

Comment on attachment 9138914 [details]
Bug 1612364 - Clear mPendingException when returning to the top level event loop.

Beta/Release Uplift Approval Request

User impact if declined: This fixes an intermittent leak that shows up on TreeHerder. There should be no user impact, because it is more of a measurement issue than a real leak.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): It just clears out a pointer while we're shutting down.
String changes made/needed: none

Attachment #9138914 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 17

•

5 years ago

Comment on attachment 9138914 [details]
Bug 1612364 - Clear mPendingException when returning to the top level event loop.

Fixes an intermittent leak seen in Treeherder. Approved for 76.0b3.

Attachment #9138914 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 18

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/db877849d8ba

status-firefox76: --- → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox74: --- → disabled

status-firefox75: --- → disabled

status-firefox-esr68: --- → unaffected

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

5 years ago

This issue is likely extremely old (maybe dating back to 2004). We're only just noticing it with Fission because it creates so many processes.

status-firefox-esr68: unaffected → wontfix

Andrew McCreight [:mccr8]

Assignee

Updated

•

5 years ago

Blocks: 1629646

You need to log in before you can comment on or make changes to this bug.