Closed Bug 1447871 Opened 2 years ago Closed 2 years ago

service worker binding objects with event listeners can leak ghost windows

Categories

(Core :: DOM: Service Workers, defect, P1)

60 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- unaffected
firefox60 + verified
firefox61 --- verified

People

(Reporter: get_logan, Assigned: bkelly)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, regression, Whiteboard: [MemShrink])

Attachments

(3 files, 16 obsolete files)

1.53 KB, patch
asuth
: review+
Details | Diff | Splinter Review
1.06 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
5.08 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Build ID: 20180319175655

Steps to reproduce:

Load various youtube videos in the same tab. The more videos the worse the ram usage is and more severe the delays. May need to load around 5-15 videos before the issue is clear.

I recorded a video showing how to create the problem from scratch using new profile:
https://youtu.be/a_mzc5DDUz0

I also recorded a performance profile in that same video, here it is:
https://perfht.ml/2GNizma


Actual results:

Eventually you will notice any webpage within that tab will have intermittent moments of unresponsiveness. Actions such as hovering over the video seek bar, highlighting text, typing into text fields or trying to adjust the volume using the volume slider can be delayed for a few seconds. Scrolling the page seems smooth and unaffected. Firefox UI also does not seem affected so typing and highlighting text in the address bar is smooth. Also new tabs and windows are not affected and closing the problem tab seems to fix/reset the problem.


Expected results:

There should not have been any performance impact.
Also want to add that I cannot reproduce the issue on stable 59.0.1. But it does affect beta 60.0b5 and yesterdays nightly. So issue may have been introduced in beta 60.
Keywords: regression
Comment 1 seems to indicate 59 isn't affected.
Flags: needinfo?(alice0775)
(In reply to Andrew Overholt [:overholt] from comment #2)
> Comment 1 seems to indicate 59 isn't affected.

Also experiencing this issue in 60, it is not present in 59, have not tried nightly.
Flags: needinfo?(alice0775)
Btw, not sure if I've ever reproduced this issue by clicking on random youtube video links. I think the problem may only occur when clicking on bookmarked videos for some reason. Really not sure about this, but just in case anyone cannot reproduce, try creating bookmarks first and loading from them.
Hi get_logan,

I verified this issue on Windows 10 64bit with  FF Nightly 61.0a1 (2018-03-27), FF Beta 60.0b7 (2018-03-26) and I cannot reproduce it. I also tested all versions between FF 59 (2017-11-13) and FF 60 (2018-01-22) with Mozregression-gui. I was able to reproduce this issue once in FF 59.0a1 (2017-12-19) but I can't get a regression window. 
get_logan you're willing, we have a tool called mozregression that will automate the process of downloading various builds from during the Firefox 60 development cycle to narrow down the change that's causing problems. Information on the tool is available at http://mozilla.github.io/mozregression/.
I believe this could be a duplicate of https://bugzilla.mozilla.org/show_bug.cgi?id=1439222 as I am no longer experiencing issues since 60b7.
(In reply to Paul Winstanley from comment #6)
> I believe this could be a duplicate of
> https://bugzilla.mozilla.org/show_bug.cgi?id=1439222 as I am no longer
> experiencing issues since 60b7.

I just installed beta 60.0b7 and using a new profile I reproduced the issue easily. It is still a problem.
Still affects latest nightly 61.0a1 (2018-03-27).
(In reply to Daniel_A from comment #5)
> Hi get_logan,
> 
> I verified this issue on Windows 10 64bit with  FF Nightly 61.0a1
> (2018-03-27), FF Beta 60.0b7 (2018-03-26) and I cannot reproduce it. I also
> tested all versions between FF 59 (2017-11-13) and FF 60 (2018-01-22) with
> Mozregression-gui. I was able to reproduce this issue once in FF 59.0a1
> (2017-12-19) but I can't get a regression window. 
> get_logan you're willing, we have a tool called mozregression that will
> automate the process of downloading various builds from during the Firefox
> 60 development cycle to narrow down the change that's causing problems.
> Information on the tool is available at
> http://mozilla.github.io/mozregression/.

Hi Daniel_A,

I ran mozregression. Here is the final output:

2018-03-28T10:22:51: INFO : Narrowed inbound regression window from [5b2d608c, 42f6cbb0] (3 builds) to [1d01aa7f, 42f6cbb0] (2 builds) (~1 steps left)
2018-03-28T10:22:51: DEBUG : Starting merge handling...
2018-03-28T10:22:51: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=42f6cbb08529f5edbb1a7069b6e3ebb610e92d3a&full=1
2018-03-28T10:22:52: DEBUG : Found commit message:
Bug 1438211 P9 Remove nsPIDOMWindowInner::GetServiceWorkerRegistration() and InvalidateServiceWorkerRegistration(). r=asuth

2018-03-28T10:22:52: INFO : The bisection is done.
2018-03-28T10:22:52: INFO : Stopped
Blocks: 1438211
Has Regression Range: --- → yes
Has STR: --- → yes
Flags: needinfo?(bkelly)
Probably not the issue here, but it seems bug 1438211 neglected to have workers call nsIGlobalObject::DisconnectEventTargetObjects().
I think the right component of this issue is the same as for bug 1438211, but if you think other ways please move it. Thanks
Component: Untriaged → DOM: Service Workers
Product: Firefox → Core
Still trying to reproduce, but I'll take this for now.
Assignee: nobody → bkelly
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(bkelly)
Priority: -- → P1
I've reproduced.  It does not seem to require bookmarks AFAICT.  I'm investigating now.
The issue is that the windows are leaking, btw.  This bug probably explains some ghost windows I've seen recently on nightly as well.
Blocks: GhostWindows
Summary: Firefox becomes janky with intermittent unresponsiveness when watching youtube videos → service worker enabled sites can leak ghost windows
Whiteboard: [MemShrink]
So it appears the issue is that the DETH DisconnectFromOwner() method is only called when nsGlobalWindowInner::CleanUp() is called.  This only happens explicitly for DetachFromDocshell().  It doesn't get called if removing from bfcache, etc.

Some comments suggest the inner CleanUp() is not really needed.  I have a patch that moves that code into FreeInnerObjects() and it seems to fix the memory leak.
Still some orange, but only on opt now.
Duplicate of this bug: 1449656
Kyle, this is a simple nullptr check.  If the window is closed the DETH's owner will be nulled out.
Attachment #8963368 - Flags: review?(kyle)
Andrew, this makes IDBDatabase implement DisconnectFromOwner().  I'm not sure how it was invalidating before, so not sure if there is something stale now that can be removed.  If you approve this I can file a follow-up to do cleanup.  I am hoping to uplift patches in this bug, though.
Attachment #8963369 - Flags: review?(bugmail)
There is definitely more orange to fix here, but lets see how close we are:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b00c592849e2ef09f08fc869a650053e69fcdbf3
Attachment #8963368 - Flags: review?(kyle) → review+
Comment on attachment 8963369 [details] [diff] [review]
P3 Make IDBDatabase invalidate any transactions on DisconnectFromOwner(). r=asuth

Review of attachment 8963369 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  Thank you!

For my notes:
- InvalidateInternal() is idempotent.
  - InvalidateMutableFiles() clears out the array on first call and is fine being called multiple times.
  - AbortTransactions() is transitively idempotent.  All transactions will move to the DONE state after the first abort.  They don't unregister with the database until their destructors, but the database's mTransaction processing then goes to no-op states.
  - CloseInternal bails if already closed, plus it handles removing the observer topic that might also want to call InvalidateInternal.
Attachment #8963369 - Flags: review?(bugmail) → review+
[Tracking Requested - why for this release]:
Bad memory leak on popular sites.  This should probably block shipping 60.
Looks like the main orange is due to:

1. XHR not checking for valid owner window in a particular dispatch path.
2. test_bug372964-2.html which seems to assume an XHR should continue to work when its owning window has been removed via document.open().  This seems like a bogus test to me, but maybe it falls in the "weird interop compat legacy vortex".
Andrea, do you have any opinions on the test_bug372964-2.html test mentioned in comment 26?

Also, do you think I could land my fix for checking window validity here before the major xhr refactor that is happening right now?  I need to uplift this bug to beta.  I can have the patch to you for review tomorrow.
Flags: needinfo?(amarchesini)
I don't see why that test is valid. Maybe smaug has an opinion?
Flags: needinfo?(amarchesini) → needinfo?(bugs)
(In reply to Ben Kelly [Part-time, not reviewing][:bkelly] from comment #13)
> I've reproduced.  It does not seem to require bookmarks AFAICT.  I'm
> investigating now.

Hi Ben,

You are right, triggering this issue does not require bookmarks. I tried copying and pasting Youtube video links into the address bar and was still able to reproduce the issue easily.

However, I cannot seem to reproduce this issue if I rely entirely on clicking random Youtube video links within the website. So there must be a difference in how a window is created if clicking on links vs bookmarks/address bar? Maybe it isn't important, but just thought I should bring it up just in case as I would not have expected a difference in behaviour.
If test_bug372964-2.html starts to fail, I'd say that reveals a bug in DETH. We should probably not clear the event listeners, but just rebind DETH to a new inner window.
Flags: needinfo?(bugs)
Andrea, this patch makes us avoid dispatching events when there is not a valid inner window for the XHR.  This fixes some assertions that were triggered with P1 calling DisconnectFromOwner() earlier than we used to.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=390738e2d6fd622470a6d61d46c15b682504c408
Attachment #8963572 - Flags: review?(amarchesini)
Comment on attachment 8963572 [details] [diff] [review]
P4 Make XMLHttpRequestMainThread check for a valid inner window before dispatching events. r=baku

Review of attachment 8963572 [details] [diff] [review]:
-----------------------------------------------------------------

It looks good to me. But I want a double check.
Attachment #8963572 - Flags: review?(bugs)
Attachment #8963572 - Flags: review?(amarchesini)
Attachment #8963572 - Flags: review+
Comment on attachment 8963572 [details] [diff] [review]
P4 Make XMLHttpRequestMainThread check for a valid inner window before dispatching events. r=baku

I guess I could live with this, given that DispatchProgressEvent has similar stuff.

But I assume this won't be need if we do the rebind stuff.
Attachment #8963572 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] (only webcomponents and event handling reviews, please) from comment #34)
> Comment on attachment 8963572 [details] [diff] [review]
> P4 Make XMLHttpRequestMainThread check for a valid inner window before
> dispatching events. r=baku
> 
> I guess I could live with this, given that DispatchProgressEvent has similar
> stuff.
> 
> But I assume this won't be need if we do the rebind stuff.

It is needed for another case.  We end up triggering it with the P1 patch when we call FreeInnerObjects() via the stack below.  We are more aggressively disconnecting the old window, so the xhr abort is happening a bit earlier than before.

NI in case this changes your mind about this patch.

[task 2018-03-28T22:01:04.929Z] 22:01:04     INFO - GECKO(1907) | #01: mozilla::EventDispatcher::Dispatch [dom/events/EventDispatcher.cpp:757]
[task 2018-03-28T22:01:04.932Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.933Z] 22:01:04     INFO - GECKO(1907) | #02: mozilla::EventDispatcher::DispatchDOMEvent [dom/events/EventDispatcher.cpp:1000]
[task 2018-03-28T22:01:04.935Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.937Z] 22:01:04     INFO - GECKO(1907) | #03: mozilla::DOMEventTargetHelper::DispatchEvent [dom/events/DOMEventTargetHelper.cpp:271]
[task 2018-03-28T22:01:04.938Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.941Z] 22:01:04     INFO - GECKO(1907) | #04: mozilla::dom::XMLHttpRequestMainThread::DispatchOrStoreEvent [dom/xhr/XMLHttpRequestMainThread.cpp:1336]
[task 2018-03-28T22:01:04.942Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.944Z] 22:01:04     INFO - GECKO(1907) | #05: mozilla::dom::XMLHttpRequestMainThread::FireReadystatechangeEvent [dom/xhr/XMLHttpRequestMainThread.cpp:1252]
[task 2018-03-28T22:01:04.945Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.948Z] 22:01:04     INFO - GECKO(1907) | #06: mozilla::dom::XMLHttpRequestMainThread::RequestErrorSteps [mfbt/RefPtr.h:311]
[task 2018-03-28T22:01:04.949Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.950Z] 22:01:04     INFO - GECKO(1907) | #07: mozilla::dom::XMLHttpRequestMainThread::AbortInternal [dom/xhr/XMLHttpRequestMainThread.cpp:1033]
[task 2018-03-28T22:01:04.952Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.954Z] 22:01:04     INFO - GECKO(1907) | #08: mozilla::dom::XMLHttpRequestMainThread::Abort [dom/xhr/XMLHttpRequestMainThread.h:349]
[task 2018-03-28T22:01:04.956Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.958Z] 22:01:04     INFO - GECKO(1907) | #09: std::_Function_handler<void(mozilla::DOMEventTargetHelper*, bool*), nsIGlobalObject::DisconnectEventTargetObjects()::<lambda(mozilla::DOMEventTargetHelper*, bool*)> >::_M_invoke [xpcom/ds/nsTHashtable.h:135]
[task 2018-03-28T22:01:04.959Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.960Z] 22:01:04     INFO - GECKO(1907) | #10: nsIGlobalObject::ForEachEventTargetObject [dom/base/nsIGlobalObject.cpp:161]
[task 2018-03-28T22:01:04.961Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.962Z] 22:01:04     INFO - GECKO(1907) | #11: nsIGlobalObject::DisconnectEventTargetObjects [gcc/include/c++/6.4.0/functional:1834]
[task 2018-03-28T22:01:04.964Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.966Z] 22:01:04     INFO - GECKO(1907) | #12: nsGlobalWindowInner::FreeInnerObjects [dom/base/nsGlobalWindowInner.cpp:1278]
[task 2018-03-28T22:01:04.967Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.968Z] 22:01:04     INFO - GECKO(1907) | #13: nsGlobalWindowOuter::SetNewDocument [dom/base/nsGlobalWindowOuter.cpp:1858]
[task 2018-03-28T22:01:04.972Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.973Z] 22:01:04     INFO - GECKO(1907) | #14: nsDocumentViewer::InitInternal [layout/base/nsDocumentViewer.cpp:933]
[task 2018-03-28T22:01:04.974Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.975Z] 22:01:04     INFO - GECKO(1907) | #15: nsDocumentViewer::Init [layout/base/nsDocumentViewer.cpp:667]
[task 2018-03-28T22:01:04.976Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.977Z] 22:01:04     INFO - GECKO(1907) | #16: nsDocShell::SetupNewViewer [docshell/base/nsDocShell.cpp:9111]
[task 2018-03-28T22:01:04.978Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.979Z] 22:01:04     INFO - GECKO(1907) | #17: nsDocShell::Embed [docshell/base/nsDocShell.cpp:6919]
[task 2018-03-28T22:01:04.980Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.981Z] 22:01:04     INFO - GECKO(1907) | #18: nsDocShell::CreateContentViewer [docshell/base/nsDocShell.cpp:8911]
[task 2018-03-28T22:01:04.982Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.983Z] 22:01:04     INFO - GECKO(1907) | #19: nsDSURIContentListener::DoContent [docshell/base/nsDSURIContentListener.cpp:197]
[task 2018-03-28T22:01:04.984Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.985Z] 22:01:04     INFO - GECKO(1907) | #20: nsDocumentOpenInfo::TryContentListener [uriloader/base/nsURILoader.cpp:769]
[task 2018-03-28T22:01:04.986Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.987Z] 22:01:04     INFO - GECKO(1907) | #21: nsDocumentOpenInfo::DispatchContent [uriloader/base/nsURILoader.cpp:435]
[task 2018-03-28T22:01:04.988Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.990Z] 22:01:04     INFO - GECKO(1907) | #22: nsDocumentOpenInfo::OnStartRequest [uriloader/base/nsURILoader.cpp:315]
[task 2018-03-28T22:01:04.991Z] 22:01:04     INFO - 
[task 2018-03-28T22:01:04.992Z] 22:01:04     INFO - GECKO(1907) | #23: mozilla::net::HttpChannelChild::DoOnStartRequest [netwerk/protocol/http/HttpChannelChild.cpp:760]
Flags: needinfo?(bugs)
The assertion its hitting in comment 35 is:

###!!! ASSERTION: This is unsafe! Fix the caller!: 'Error', file /builds/worker/workspace/build/src/dom/events/EventDispatcher.cpp, line 757
I see.
Flags: needinfo?(bugs)
Try has been looking pretty green.  Due to the nature of these changes, though, I want to do a full try build on all platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=90b5e1c9dfc6f361ce719dfe467c474160e851fe
Note, I plan to come up with an alternate, simpler patch for beta.
Comment on attachment 8963367 [details] [diff] [review]
P1 Remove nsGlobalWindowInner::CleanUp() method in favor of FreeInnerObjects(). r=smaug

Olli, this patch removes nsGlobalWindowInner::CleanUp() and moves its code into FreeInnerObjects().  It also replaces the various getters related to mCleanedUp with InnerObjectsFreed().

There are comments suggesting that CleanUp was originally intended for outer windows and was not intended for inner windows:

https://searchfox.org/mozilla-central/rev/7e663b9fa578d425684ce2560e5fa2464f504b34/dom/base/nsGlobalWindowInner.cpp#1127-1130

In addition, we called CleanUp() and FreeInnerObjects() inconsistently.

CleanUp() is called from:
 - ~nsGlobalWindowInner()
 - nsGlobalWindowOuter::CleanUp() which in turn gets called from DetachFromDocshell() and ReallyCloseWindow().

FreeInnerObjects() is called from:
 - ~WindowStateHolder()
 - nsGlobalWindowOuter::SetNewDocument()
 - nsGlobalWindowOuter::DetachFromDocshell()

So the only case where we were consistently executing all cleanup code was in DetachFromDocshell().  Otherwise we mostly only ran FreeInnerObjects() code and then waited for cycle collection to delete the inner window so we could run CleanUp() from ~nsGlobalWindowInner().

I'm changing this for a few reasons:

1. I think its confusing to have two different cleanup methods.  Its clear developers don't understand it because there is identical code in both methods.
2. Waiting for cycle collector to run CleanUp() for windows coming out of bfcache probably is putting more load on CC than is necessary.  It would be nice if we could delete these windows faster without relying on CC to unlink things that CleanUp() is explicitly clearing.  Moving the explicit clearing code to FreeInnerObjects() might help reduce CC load.
3. In bug 1438211 I was under the impression that DETH objects got DisconnectFromOwner() called in FreeInnerObjects(), but it actually happens in CleanUp().  Our tests did not catch the mistake because of how we inconsistently call CleanUp()-vs-FreeInnerObjects() for different paths.  Most of our tests end up in DetachFromDocshell instead of free'ing from bfcache or document.open().

Now, I could solve (3) with a somewhat ugly patch specific to service workers.  I will do that beta.  I think (1) and (2), however, are compelling enough to try to move the CleanUp() code into FreeInnerObjects().  It should make window cleanup much more consistent.

Note, I did a simply move of the CleanUp() code to the end of FreeInnerObjects() here.  P6 will remove the code that was duplicated between the methods.
Attachment #8963367 - Flags: review?(bugs)
Olli, this patch does the DETH rebind to the new inner window for document.open().  This allows test_bug372964-2.html to keep passing.  I filed bug 1449992 to consider if we want to change document.open() behavior here.
Attachment #8963582 - Attachment is obsolete: true
Attachment #8963637 - Flags: review?(bugs)
Comment on attachment 8963630 [details] [diff] [review]
P6 Remove duplicate cleanup code from FreeInnerObjects(). r=smaug

Olli, this removes code that was duplicated between CleanUp() and FreeInnerObjects().
Attachment #8963630 - Flags: review?(bugs)
Olli, this is the minimal patch that only calls DisconnectFromOwner() on service worker related binding objects in FreeInnerObjects().

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f4bc97947a239027594f2152fa7a24f41ed495a
Attachment #8963706 - Flags: review?(bugs)
Attachment #8963706 - Attachment description: Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=smaug → BETA ONLY - Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=smaug
I'll write some bfcache and document.open() tests that would have caught this as well.
FWIW, this regression shows up in telemetry: https://mzl.la/2E4uVDt I'll try to think about what we can do to change ghost window telemetry so that big regressions like this will trigger alerts so somebody can look into it.
Comment on attachment 8963367 [details] [diff] [review]
P1 Remove nsGlobalWindowInner::CleanUp() method in favor of FreeInnerObjects(). r=smaug

I've re-thought this approach.  It might be worth unifying the cleanup paths in window, but I don't think its the root problem here.
Attachment #8963367 - Flags: review?(bugs)
Attachment #8963630 - Flags: review?(bugs)
Attachment #8963637 - Flags: review?(bugs)
Attachment #8963706 - Flags: review?(bugs)
This test trigger the leak condition.  I think it suggests I need to fix cycle collection traversal for the service worker binding objects.
Attached patch wip-cc.patch (obsolete) — Splinter Review
This allows the test to pass locally.  I think its a better fix.  I need to clean it up some.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb
Comment on attachment 8963833 [details] [diff] [review]
wip-cc.patch

Olli, I need to clean this up a bit, but what do you think of this approach.  It lets us traverse the service worker objects through the DETH list.  We don't have a separate ref on the global to traverse these objects otherwise.

I made the DETH traversal optional because most of the older DETH objects are traversed in other ways right now.

I need to split the patch a bit and hook this into the WorkerGlobalScope CC as well before I flag for review.

Thanks.
Attachment #8963833 - Flags: feedback?(bugs)
Comment on attachment 8963833 [details] [diff] [review]
wip-cc.patch

Sadly this patch is triggering a crash in CC:

[task 2018-03-30T02:18:11.092Z] 02:18:11     INFO - Thread 0 (crashed)
[task 2018-03-30T02:18:11.093Z] 02:18:11     INFO -  0  libxul.so!PtrInfo::AnnotatedReleaseAssert [nsCycleCollector.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 675 + 0x18]
[task 2018-03-30T02:18:11.093Z] 02:18:11     INFO -     rax = 0x0000000000000000   rdx = 0x0000000000000000
[task 2018-03-30T02:18:11.094Z] 02:18:11     INFO -     rcx = 0x00007f75e9e292dd   rbx = 0x00007ffd099b9238
[task 2018-03-30T02:18:11.095Z] 02:18:11     INFO -     rsi = 0x00007f75ea0f8770   rdi = 0x00007f75ea0f7540
[task 2018-03-30T02:18:11.096Z] 02:18:11     INFO -     rbp = 0x00007ffd099b9270   rsp = 0x00007ffd099b9210
[task 2018-03-30T02:18:11.097Z] 02:18:11     INFO -      r8 = 0x00007f75ea0f8770    r9 = 0x00007f75eb3d9740
[task 2018-03-30T02:18:11.098Z] 02:18:11     INFO -     r10 = 0x000000000000004c   r11 = 0x0000000000000000
[task 2018-03-30T02:18:11.099Z] 02:18:11     INFO -     r12 = 0x00007ffd099b9288   r13 = 0x00007f75e9480700
[task 2018-03-30T02:18:11.100Z] 02:18:11     INFO -     r14 = 0x0000000000000000   r15 = 0x00007ffd099b9328
[task 2018-03-30T02:18:11.100Z] 02:18:11     INFO -     rip = 0x00007f75d87d631c
[task 2018-03-30T02:18:11.101Z] 02:18:11     INFO -     Found by: given as instruction pointer in context
[task 2018-03-30T02:18:11.102Z] 02:18:11     INFO -  1  libxul.so!nsCycleCollector::ScanWhiteNodes [nsCycleCollector.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 3231 + 0x8]
[task 2018-03-30T02:18:11.103Z] 02:18:11     INFO -     rbx = 0x00007f75e9480700   rbp = 0x00007ffd099b92d0
[task 2018-03-30T02:18:11.104Z] 02:18:11     INFO -     rsp = 0x00007ffd099b9280   r12 = 0x00007ffd099b9288
[task 2018-03-30T02:18:11.105Z] 02:18:11     INFO -     r13 = 0x00007f75e9480700   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.106Z] 02:18:11     INFO -     r15 = 0x00007ffd099b9328   rip = 0x00007f75d87d6850
[task 2018-03-30T02:18:11.107Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.108Z] 02:18:11     INFO -  2  libxul.so!nsCycleCollector::ScanRoots [nsCycleCollector.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 3276 + 0xc]
[task 2018-03-30T02:18:11.109Z] 02:18:11     INFO -     rbx = 0x00007f75e9480700   rbp = 0x00007ffd099b9390
[task 2018-03-30T02:18:11.109Z] 02:18:11     INFO -     rsp = 0x00007ffd099b92e0   r12 = 0x0000000000000000
[task 2018-03-30T02:18:11.110Z] 02:18:11     INFO -     r13 = 0x00007ffd099b94f0   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.110Z] 02:18:11     INFO -     r15 = 0x00007ffd099b9328   rip = 0x00007f75d87e2372
[task 2018-03-30T02:18:11.111Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.112Z] 02:18:11     INFO -  3  libxul.so!nsCycleCollector::Collect [nsCycleCollector.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 3767 + 0xb]
[task 2018-03-30T02:18:11.112Z] 02:18:11     INFO -     rbx = 0x00007f75e9480700   rbp = 0x00007ffd099b9420
[task 2018-03-30T02:18:11.113Z] 02:18:11     INFO -     rsp = 0x00007ffd099b93a0   r12 = 0x00007ffd099b9400
[task 2018-03-30T02:18:11.113Z] 02:18:11     INFO -     r13 = 0x00007ffd099b94f0   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.114Z] 02:18:11     INFO -     r15 = 0x0000000000000002   rip = 0x00007f75d87e4acc
[task 2018-03-30T02:18:11.114Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.115Z] 02:18:11     INFO -  4  libxul.so!nsCycleCollector_collectSlice [nsCycleCollector.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 4330 + 0x1b]
[task 2018-03-30T02:18:11.116Z] 02:18:11     INFO -     rbx = 0x00007f75d302b870   rbp = 0x00007ffd099b9480
[task 2018-03-30T02:18:11.117Z] 02:18:11     INFO -     rsp = 0x00007ffd099b9430   r12 = 0x00007ffd099b9440
[task 2018-03-30T02:18:11.117Z] 02:18:11     INFO -     r13 = 0x00007ffd099b9438   r14 = 0x00007ffd099b94f0
[task 2018-03-30T02:18:11.118Z] 02:18:11     INFO -     r15 = 0x0000000000000000   rip = 0x00007f75d87e50d0
[task 2018-03-30T02:18:11.119Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.119Z] 02:18:11     INFO -  5  libxul.so!nsJSContext::RunCycleCollectorSlice [nsJSEnvironment.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 1562 + 0x5]
[task 2018-03-30T02:18:11.120Z] 02:18:11     INFO -     rbx = 0x00007ffd099b9510   rbp = 0x00007ffd099b9560
[task 2018-03-30T02:18:11.121Z] 02:18:11     INFO -     rsp = 0x00007ffd099b9490   r12 = 0x0000000000000031
[task 2018-03-30T02:18:11.122Z] 02:18:11     INFO -     r13 = 0x00007ffd099b9600   r14 = 0x00007ffd099b94d8
[task 2018-03-30T02:18:11.122Z] 02:18:11     INFO -     r15 = 0x00000179eebc74d3   rip = 0x00007f75d9468131
[task 2018-03-30T02:18:11.123Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.124Z] 02:18:11     INFO -  6  libxul.so!ICCRunnerFired [nsJSEnvironment.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 1618 + 0x5]
[task 2018-03-30T02:18:11.125Z] 02:18:11     INFO -     rbx = 0x00007f75c8f49000   rbp = 0x00007ffd099b9590
[task 2018-03-30T02:18:11.125Z] 02:18:11     INFO -     rsp = 0x00007ffd099b9570   r12 = 0x00007ffd099b9688
[task 2018-03-30T02:18:11.126Z] 02:18:11     INFO -     r13 = 0x00007ffd099b9650   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.127Z] 02:18:11     INFO -     r15 = 0x00007ffd099b9788   rip = 0x00007f75d9468188
[task 2018-03-30T02:18:11.128Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.129Z] 02:18:11     INFO -  7  libxul.so!mozilla::IdleTaskRunner::Run [functional:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 2127 + 0xb]
[task 2018-03-30T02:18:11.130Z] 02:18:11     INFO -     rbx = 0x00007f75c8f49030   rbp = 0x00007ffd099b95c0
[task 2018-03-30T02:18:11.132Z] 02:18:11     INFO -     rsp = 0x00007ffd099b95a0   r12 = 0x00007ffd099b9688
[task 2018-03-30T02:18:11.133Z] 02:18:11     INFO -     r13 = 0x00007ffd099b9650   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.134Z] 02:18:11     INFO -     r15 = 0x00007ffd099b9788   rip = 0x00007f75d882f845
[task 2018-03-30T02:18:11.135Z] 02:18:11     INFO -     Found by: call frame info
[task 2018-03-30T02:18:11.136Z] 02:18:11     INFO -  8  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:410fb98fd1e3c8e7609a1f7dbfe6369f42d190eb : 1096 + 0x15]
[task 2018-03-30T02:18:11.137Z] 02:18:11     INFO -     rbx = 0x00007f75e942cae0   rbp = 0x00007ffd099b9ba0
[task 2018-03-30T02:18:11.138Z] 02:18:11     INFO -     rsp = 0x00007ffd099b95d0   r12 = 0x00007ffd099b9688
[task 2018-03-30T02:18:11.139Z] 02:18:11     INFO -     r13 = 0x00007ffd099b9650   r14 = 0x0000000000000000
[task 2018-03-30T02:18:11.140Z] 02:18:11     INFO -     r15 = 0x00007ffd099b9788   rip = 0x00007f75d8849e04
[task 2018-03-30T02:18:11.141Z] 02:18:11     INFO -     Found by: call frame info
Attachment #8963833 - Flags: feedback?(bugs)
Hrm.  Maybe we do need to fix window cleanup-vs-freeinnerobjects.  For example, there are other things relying on DisconnectFromOwner() to cut references like DETH keep alive stuff.
BroadcastChannel is working around this same problem using an nsIObserver on "inner-window-destroyed":

https://searchfox.org/mozilla-central/rev/57bbc1ac58816dc054df242948f3ecf75e12df5f/dom/broadcastchannel/BroadcastChannel.cpp#494-518

That seems less than ideal.
And MediaQueryList leaks if it has an onchange handler referencing the window.
So this is a bigger problem.  I do think we need to pursue calling DETH DisconnectFromOwner from FreeInnerObjects.

Since we need this bug to be fixed in beta, though, I will split things into separate bugs.  Something like:

1. Short term fix here to resolve the immediate service worker leak.
2. Follow-on bug to unify CleanUp()/FreeInnerObjects() so we can call DisconnectFromOwner() consistently.
3. Follow-on bug to remove nsIObserver work-around from BroadcastChannel.
4. Follow-on bug to fix MediaQueryList leaks.
Blocks: 1450266
Comment on attachment 8963367 [details] [diff] [review]
P1 Remove nsGlobalWindowInner::CleanUp() method in favor of FreeInnerObjects(). r=smaug

Moving long-term solution patches to bug 1450266.
Attachment #8963367 - Attachment is obsolete: true
Attachment #8963368 - Attachment is obsolete: true
Attachment #8963369 - Attachment is obsolete: true
Attachment #8963572 - Attachment is obsolete: true
Attachment #8963630 - Attachment is obsolete: true
Attachment #8963637 - Attachment is obsolete: true
Comment on attachment 8963833 [details] [diff] [review]
wip-cc.patch

This patch is wrong.
Attachment #8963833 - Attachment is obsolete: true
Attachment #8963828 - Attachment is obsolete: true
Summary: service worker enabled sites can leak ghost windows → service worker binding objects with event listeners can leak ghost windows
Blocks: 1450274
Comment on attachment 8963964 [details] [diff] [review]
P1 Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=asuth

Andrew, this is the short term fix for this leak.  As you may recall we have a ref-cycle in the SW binding objects that is cut when DisconnectFromOwner() is called.  What I did not realize before, though, is that we do not consistently call DisconnectFromOwner() for windows in bfcache or replaced via document.open().

I have a longer term fix in bug 1450266, but that won't be upliftable.  As a short term fix this patch just calls DisconnectFromOwner() on the SW objects in FreeInnerObjects.

Note, our SW ref-cycle stuff is very similar to DETH's KeepAliveIfEventListenersFor() mechanism.  Other binding objects like BroadcastChannel use that and run into the same issue we are here.  In BroadcastChannel's case, though, it uses an nsIObserver to listen for inner-window-destroyed which is dispatched from FreeInnerObjects().  I feel thats a bit more heavy-weight than what this patch proposes.
Attachment #8963964 - Flags: review?(bugmail)
Comment on attachment 8963965 [details] [diff] [review]
P2 Add a mochitest to test leaks due to service worker binding object usage. r=asuth

Andrew, this patch adds a test that checks for runtime leaks if an event listener is registered on a binding object.  Note that our normal leak checking does not catch this because we clean the binding objects up when SWM tears down.

I might rename the test slightly.  Maybe "test_event_listener_leaks.html" would be better than "test_leak_checks.html".
Attachment #8963965 - Flags: review?(bugmail)
Comment on attachment 8963976 [details] [diff] [review]
P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side effects that remove other DETH objects. r=asuth

I realized that ForEachEventTargetObject() was not quite safe if the callback had side effects that removed over DETH objects from the global.  This is more of a concern if we are disconnecting individual DETH objects from the list like P1 does.  This patch protects against this problem verifying the iteration target is still in the hash table before invoking the callback.

It would be nice if we could use nsTObserverArray, but I worry about making add/remove go from O(c) to O(n).
Attachment #8963976 - Flags: review?(bugmail)
Comment on attachment 8963965 [details] [diff] [review]
P2 Add a mochitest to test leaks due to service worker binding object usage. r=asuth

Review of attachment 8963965 [details] [diff] [review]:
-----------------------------------------------------------------

As always, thank you for the really great comments in the test and explaining the context on the bug.

Restating: This test reproduces the control flow paths that result in FreeInnerObjects() being called, but not CleanUp(), as covered in comment 40.  Namely, removed-from-bfcache and document.open.

::: dom/serviceworkers/test/test_leak_checks.html
@@ +81,5 @@
> +  await waitForState(reg.installing, "activated");
> +
> +  try {
> +    // Test if we leak in the case where we do nothing special to
> +    // te frame before removing it from the DOM.

typo nit: s/te/the/
Attachment #8963965 - Flags: review?(bugmail) → review+
Comment on attachment 8963964 [details] [diff] [review]
P1 Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=asuth

Review of attachment 8963964 [details] [diff] [review]:
-----------------------------------------------------------------

Restating: Per bug 1450266, there are some window cleanup issues; in many cases we depend on the cycle collector to save us, sometimes, like here, it does not.  DETH's DisconnectFromOwner calls break the relevant the cycles.  You can't call DisconnectFromOwner too many times because the DETH implementation removes itself from the nsIGlobalObject.  (And there are asserts in place that assure a single DETH is only added to the owner once.)  The only risk to the additional calls would be event listeners disappearing if FreeInnerObjects() was something called on non-doomed globals.  But it's not, it's just an alternate cleanup path that's broken.
Attachment #8963964 - Flags: review?(bugmail) → review+
Comment on attachment 8963976 [details] [diff] [review]
P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side effects that remove other DETH objects. r=asuth

Review of attachment 8963976 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, if only everything were nsTObserverArray and we didn't have those pesky Talos tests! ;)
Attachment #8963976 - Flags: review?(bugmail) → review+
I decided it would be safer to just use a strong ref to the targets while iterating.  Its probably faster than an extra hash table lookup on each entry as well.
Attachment #8963976 - Attachment is obsolete: true
Attachment #8963980 - Flags: review?(bugmail)
Updated the test:

1. Fixed typo from review comment.
2. Renamed to test_event_listener_leaks.html
3. Added an extra GC to make --verify happy on try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9aeabf675f9d4d57303e7efe46a6dc475354977
Attachment #8963965 - Attachment is obsolete: true
Attachment #8963981 - Flags: review+
(In reply to Andrew Sutherland [:asuth] from comment #66)
> Comment on attachment 8963976 [details] [diff] [review]
> P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side
> effects that remove other DETH objects. r=asuth
> 
> Review of attachment 8963976 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Yes, if only everything were nsTObserverArray and we didn't have those pesky
> Talos tests! ;)

Sorry, I chose to go a different direction with this patch.  So there is a new r? up now.  Sorry for the mid-air collision.
Comment on attachment 8963980 [details] [diff] [review]
P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side effects that might delete other DETH objects. r=asuth

Review of attachment 8963980 [details] [diff] [review]:
-----------------------------------------------------------------

I think you may still need the contains check.  Specifically, this code:

  void
  ServiceWorkerRegistration::DisconnectFromOwner()
  {
    mInner->ClearServiceWorkerRegistration(this);
    mInner = nullptr;
    DOMEventTargetHelper::DisconnectFromOwner();
  }

is not idempotent, but you're now potentially calling it twice and hoping it doesn't crash the second time, but it will.  So I'd add the contains check unless we want to audit the existing disconnect methods.
Ok, I'll just go back to the original P3 patch you r+'d.  Thanks!
Attachment #8963980 - Attachment is obsolete: true
Attachment #8963980 - Flags: review?(bugmail)
Attachment #8963976 - Attachment is obsolete: false
Minor changes to the test:

1. Test event listeners on ServiceWorkerContainer as well for completeness.
2. Pass the contentWindow by itself to the listener's closure instead of the frame object.  This makes it easier to leak what we are actually testing for at the end.
Attachment #8963981 - Attachment is obsolete: true
Attachment #8963996 - Flags: review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e96975ff12b
P1 Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/53c7a68374a8
P2 Add a mochitest to test leaks due to service worker binding object usage. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/96c6921fef5a
P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side effects that might delete other DETH objects. r=asuth
I just manually verified again that it fixes the youtube leak (as I was able to reproduce it).
Comment on attachment 8963964 [details] [diff] [review]
P1 Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=asuth

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1438211
[User impact if declined]: Memory leaks on any site that uses service workers and sets event listeners on its DOM objects.  This includes sites like youtube, facebook, etc.  So kind of bad.
[Is this code covered by automated tests?]: P2 patch here adds a new test that catches the problem.  Our previous tests did not catch it because the memory is cleared at shutdown.  So our normal mochitest leak checking could not see it.  The new test checks for leaks during runtime.
[Has the fix been verified in Nightly?]: Its in inbound right now, but has not merged to central yet.

[Needs manual test from QE? If yes, steps to reproduce]: Not strictly necessary, but given the severity it might be worth having QE verify.  STR:
1. Open a youtube video page.  Doesn't matter which one.
2. Refresh to make sure you have the page controlled by a service worker.
3. After the page is fully loaded and the video is playing, navigate the window to example.com.
4. In a separate tab open about:memory, click minimize, and measure.
5. Verify the youtube window is gone from the output.  You should only see the example.com and maybe an about:newtab in the content process.

[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Minimal
[Why is the change risky/not risky?]: This patchset is focused on the minimum number of changes necessary to fix the problem for service workers.  It should not effect any other features.
[String changes made/needed]: None
Attachment #8963964 - Flags: approval-mozilla-beta?
Comment on attachment 8963996 [details] [diff] [review]
P2 Add a mochitest to test leaks due to service worker binding object usage. r=asuth

See comment 76.
Attachment #8963996 - Flags: approval-mozilla-beta?
Comment on attachment 8963993 [details] [diff] [review]
P3 Make nsIGlobalObject::ForEachEventTargetObject() protect against side effects that remove other DETH objects. r=asuth

See comment 76.
Attachment #8963993 - Flags: approval-mozilla-beta?
I have verified that the leak in comment 0 no longer occurs in the latest nightly.
Comment on attachment 8963964 [details] [diff] [review]
P1 Call DisconnectFromOwner() on service worker binding objects in FreeInnerObjects(). r=asuth

Fix a new memory leak in Fx60 involving sites using Service Workers. Approved for 60.0b9. Thanks for adding new automated test coverage for this too.

That said, I'm also setting qe-verify+ because I do agree that we should get some extra testing around this from QE.
Attachment #8963964 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8963993 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8963996 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
Flags: in-testsuite+
I successfully reproduced the issue on Firefox 60.0b7 under Windows 10 (x64) using STR from Comment 0.

The issue is no longer reproducible on Firefox 60.0b9 and latest Nightly 61.0a1 (2018-04-02). I also checked on Ubuntu 16.04 (x64) and macOS 10.12 and everything is working as expected.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Duplicate of this bug: 1450732
Duplicate of this bug: 1448278
You need to log in before you can comment on or make changes to this bug.