Closed Bug 1561887 Opened 4 months ago Closed 3 months ago

Memory leak in iframe when attaching functions to events

Categories

(Core :: DOM: Core & HTML, defect)

67 Branch
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla70
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 69+ fixed
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 + verified
firefox70 + verified

People

(Reporter: grzegorz.w.1597, Assigned: bzbarsky)

References

(Regression)

Details

(Keywords: regression)

Attachments

(4 files)

Attached file index.html

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0

Steps to reproduce:

  1. create IFrame with button
  2. attach function from global context to buttons onclick event, with local function work ok
  3. remove iframe

Reproduce witch attached file:
Press "single" button, it will create iframe, attach action to button and remove it from body after 450 ms. "start" button will do it repeatedly in 500ms interval, pressing it again will stop it.
Unchecking "useBug" will change attaching global function to local function.

Leak start to appear after commit
https://hg.mozilla.org/mozilla-central/rev/1b528e3cac94793db32568726080b6cc1209b0e7
from bug 1523843. I bisected sources with attached script.
Commenting line

return aOptions.setExistingCompartment(data.compartment);
prevents memory leak

Workaround is using addEventListener or wrapping function call in closure.

Actual results:

Data from iframe is not freed.

Expected results:

Memory freed after removing iframe.

:bz, according to the comment 0 bisection, this looks like a regression introduced by bug 1523843, can you please take a look?

Blocks: 1523843
Component: Untriaged → DOM: Core & HTML
Flags: needinfo?(bzbarsky)
Keywords: regression
Product: Firefox → Core
Regressed by: 1523843

So just to make sure we're on the same page, here's what I see.

  1. I load the testcase from a file:// URL; the "useBug" and "useEventListener" checkboxes are both checked.
  2. I click the "start" button and wait for the iframe to flicker 20 times, then click the "stop" button.
  3. I load about:memory in a separate tab, click "Minimize memory usage", then click "Measure".
  4. I click on the "file:// content" link over on the right of about:memory.
  5. I see a bunch of top(none)/detached/window(file:///Users/bzbarsky/test.html) window objects still live.

If I uncheck "useEventListener" before step 2, that does not seem to affect the behavior; I still get the live detached windows.

If I uncheck "useBug", then the detached windows get collected when I minimize memory, no matter whether "useEventListener" is checked or not.

This doesn't quite match my understanding of the comment 0 description, so I would like to understand whether we're really seeing different things or whether I misunderstood the description.

Flags: needinfo?(grzegorz.w.1597)

Oh, and I do see that creating a new compartment for the subframe while following steps 1-5 with "useBug" checked makes the problem go away, so this definitely seems like a regression from bug 1523843.

No longer blocks: 1523843
Status: UNCONFIRMED → NEW
Ever confirmed: true

OK, if I ask for a "full" GC/CC log in about:memory, the buttons inside the iframe show up as "garbage" in the trace after the ========== bit and get collected.

If I ask for a "concise" GC/CC log, then they don't show up in the trace after ========== at all (though they do show up earlier, via the GetParent() edge from their child CharacterData and via the mFirstChild edge from the <body>. Which you'd think would be enough to collect them, since their refcount is 2 (!).

Flags: needinfo?(bzbarsky)

If an object doesn't show up after ====, that means the number of references is equal to the refcount (as you noticed) and the object is not garbage. It isn't garbage because some other object holding onto it is being held alive. You need to use the find_roots.py script in my heapgraph repo to figure out what. https://github.com/amccreight/heapgraph/

If it is being collected from a verbose GC log but not a concise that likely means some CC optimization is incorrectly being applied. Probably the easiest way to figure this out is to run the find_roots.py script on the concise log to figure out what object is holding the buttons alive, then find the equivalent object in the full CC log and see what the extra reference is (assuming that it is a refcounted object acting as the root.). If a GCed object is holding the button alive, then you can see what find_roots.py on the GC log, and the same rooting object, says, with the GC log from the concise case.

If I uncheck "useEventListener" before step 2, that does not seem to affect the behavior; I still get the live detached windows.

Ah, I see. This is due to a typo in the testcase. useEventListener starts out false, but the checkbox state is initially set to useBug, so it's checked. If I fix that state initialization, then the bug reproduces precisely when useEventListener is false and useBug is true.

I think we're definitely on the same page here. Thank you for finding this and for the bug report! Analysis coming up.

Flags: needinfo?(grzegorz.w.1597)

Thank you for finding this and for the bug report!

And for bisecting it to the right change, of course!

The issue in the provided testcase is that after bug 1523843 the callback and callback global of a CallbackObject may actually be from different (same-compartment) globals on the web. Before bug 1523843 this could only happen in system code with compartment-sharing...

In particular, this line in the testcase:

b.onclick = testFunction;

calls into the onclick setter from the subframe, so the current global in that setter is the subframe global. The function passed in, on the other hand, is from the parent global (before bug 1523843 we got a CCW from the subframe global to the parent global instead).

Now we go to cycle-collect this nsJSEventListener. Its CanSkip implementation returns true if the CallbackObject's callable is not gray. Before bug 1523843 this was false: the callable was the CCW, and the only thing holding on to the CCW was the CallbackObject itself, so it was in fact gray.

But now the callable is testFunction, which is of course black, because it's a global function in the parent window, so is definitely live. So we skip CCing the JSEventHandler, miss the edge to the CallbackObject, and now have a cycle we don't break: the CallbackObject references its mCallbackGlobal, which is the subframe's window, which references the subframe document via mDoc, which via mFirstChild/mNextSibling chains reaches the button b, which of course holds a ref to the JSEventHandler.

Before bug 1523843 this specific chain could not arise, because even if the callable CCW were not gray it would then have mCallbackGlobal as its actual global, so would be keeping it alive anyway.

But a similar leak can be produced before bug 1523843 by having the callback be non-gray (and in fact not a CCW) but making its mIncumbentJSGlobal be a different global from the callback's global and entrain things as described above.

Assignee: nobody → bzbarsky

So really, JSEventHandler::IsBlackForCC is kinda buggy. It says:

  // We can claim to be black if all the things we reference are
  // effectively black already.

but then only tests the color of mTypedHandler.Ptr()->mCallback, but references all the other things that CallbackObject traces or traverses: mCallbackGlobal, mCreationStack, mIncumbentJSGlobal, and mIncumbentGlobal.

We can pretty easily add color-testing for the other three JS members (mCallbackGlobal, mCreationStack, mIncumbentJSGlobal). I guess mIncumbentJSGlobal, if black, keeps mIncumbentGlobal alive, so it's good enough to just check mIncumbentJSGlobal's color.

What I don't know is how to write a test for this, or whether it's even sanely possible... Andrew, Olli, any ideas?

Flags: needinfo?(continuation)
Flags: needinfo?(bugs)

I don't think we have too good ways to test some particular leaks, - well if one can make the leak a shutdown leak, then it should be easier.
We should or could add some CC log analysis thingie.

Flags: needinfo?(bugs)

Yeah, I can't really make this one a shutdown leak, because once the parent page is unloaded the callback is no longer black and the cycle gets collected.

One could perhaps utilize ghost window detection for the test case. mccr8 is more familiar with that.
But that is perhaps a tad heavy way.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #13)

What I don't know is how to write a test for this, or whether it's even sanely possible... Andrew, Olli, any ideas?

I didn't read through the details in this bug very closely, but I've written some tests along these lines before. The basic idea is that you can hack up a weak reference to any (non-old DOM bindings) object by using it as a key in a weak map, then use SpecialPowers.exactGC() (with some additional forceCC and forceGC in the callback), and then do SpecialPowers.nondeterministicGetWeakMapKeys() to see if anything is still in the weak map. An example of this is js/xpconnect/tests/mochitest/test_weakmaps.html, but if you look for SpecialPowers.nondeterministicGetWeakMapKeys() you'll find a few more examples.

Flags: needinfo?(continuation)

Ah, that is a nice trick. Had totally forgotten it.

Nice, the suggestion in comment 17 works beatifully for a test!

Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/955593e562a5
part 1.  Add an overload of JS::ObjectIsMarkedGray that works on TenuredHeap.  r=sfink
https://hg.mozilla.org/integration/autoland/rev/e4a2b09bc49c
part 2.  Fix cycle-collection skipping of JSEventHandler to work correctly.  r=smaug

Comment on attachment 9075880 [details]
Bug 1561887 part 2. Fix cycle-collection skipping of JSEventHandler to work correctly. r=smaug

Beta/Release Uplift Approval Request

  • User impact if declined: Memory leaks for the lifetime of the website on some sites.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): It's changing some cycle collector logic, which is always a bit risky. That said, I think as cycle collection patches go this is pretty safe.
  • String changes made/needed:
Attachment #9075880 - Flags: approval-mozilla-beta?
Attachment #9075878 - Flags: approval-mozilla-beta?
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Comment on attachment 9075878 [details]
Bug 1561887 part 1. Add an overload of JS::ObjectIsMarkedGray that works on TenuredHeap. r=sfink

Cycle collector fix to avoid hanging on to memory for too long on some websites. Thanks for writing a patch too. Approved for 69.0b4.

Attachment #9075878 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9075880 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: in-testsuite+

Fwiw, esr60 is affected too, for the more convoluted testcase. That said, I agree there's not much point in uplifting to 60.

It's worth uplifting to 68.

Sure, let's revisit 68 after this has had some bake time on Beta.

Comment on attachment 9075880 [details]
Bug 1561887 part 2. Fix cycle-collection skipping of JSEventHandler to work correctly. r=smaug

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Fixing a site-lifetime memory leak that is somewhat easy to trigger.
  • User impact if declined: Unbounded memory usage growth on some websites.
  • Fix Landed on Version: 70
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): It's changing some cycle collector logic, which is always a bit risky. That said, I think as cycle collection patches go this is pretty safe.
  • String or UUID changes made by this patch: None
Attachment #9075880 - Flags: approval-mozilla-esr68?
Attachment #9075878 - Flags: approval-mozilla-esr68?
QA Whiteboard: [qa-triaged]

I was trying to reproduce the bug based on the steps from comment 2, but I really can't understand step 4 (I click on the "file:// content" link over on the right of about:memory.)

Could you please clarify it to me?
Thanks.

QA Whiteboard: [qa-triaged]
Flags: needinfo?(bzbarsky)

The idea is to look for the memory report for the process that hosts the testcase in about:memory. In about:memory, once you have done a measurement, over on the right there is a box titled "Process index" that lists the processes with links you can click to go to the report for the relevant process. If you load the testcase from file://, that process will be labeled "file:// Content (pid NNNNN)" for some number NNNNN.

Flags: needinfo?(bzbarsky)
QA Whiteboard: [qa-triaged]

Verified as fixed on the latest Firefox Nightly 70.0a1 (2019-07-15) and on Firefox 69.0b4 on Windows 10 x64, Mac OS X 10.14 and on Ubuntu 18.04 x64.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+

Comment on attachment 9075878 [details]
Bug 1561887 part 1. Add an overload of JS::ObjectIsMarkedGray that works on TenuredHeap. r=sfink

This has baked on Nightly and Beta for a couple weeks without any known issues and has been verified by QA. Approved for 68.1esr.

Attachment #9075878 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Attachment #9075880 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.