Closed Bug 1265795 Opened 8 years ago Closed 8 years ago

Service worker can continue to control document in bfcache

Categories

(Core :: DOM: Service Workers, defect)

48 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 --- wontfix
firefox46 --- wontfix
firefox47 + fixed
firefox48 + fixed
firefox-esr45 --- disabled

People

(Reporter: bkelly, Assigned: bkelly)

References

(Blocks 2 open bugs)

Details

(Whiteboard: btpp-active)

Attachments

(4 files, 3 obsolete files)

+++ This bug was initially created as a clone of Bug #1265771 +++

While investigating the fix for bug 1265771 I discovered with Boris's help that nsDocument::RemovedFromDocShell() does not get called when a document goes into the bfcache.  This means we don't stop controlling documents on navigation within a window.

Instead we need to move the service worker logic to nsDocument::SetScriptGlobalObject() when nullptr is passed in for the new global.  In addition, we should only start controlling when a non-nullptr global is passed in.

I'm doing this as a separate bug because its important that we uplift this one.
Tracking since we may want to uplift this to 47 and Ben says it is important and may have a fix coming soon.
Boris, this patch is based on our IRC conversation today.  Our intention is to stop controlling a document when it becomes inactive.  Previously we were trying to do this with RemovedFromDocShell(), but that is not called if the document goes into bfcache.

I'm fairly sure this works as expected for the iframe.remove() case because we have many, many use cases of that throughout our mochitests and wpt tests.  They pass with this patch.

I'm going to work on a new test case that navigates a window to demonstrate the old document becomes uncontrolled.
Attachment #8742939 - Flags: review?(bzbarsky)
Comment on attachment 8742939 [details] [diff] [review]
P1 Uncontrolled service workers when global is removed from document. r=bz

Review of attachment 8742939 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsDocument.cpp
@@ +4680,5 @@
> +          loader->ClearCacheForControlledDocument(this);
> +        }
> +      }
> +      swm->MaybeStopControlling(this);
> +    }

Its difficult to tell in splinter, but this is in this conditional block:

  if (mScriptGlobalObject && !aScriptGlobalObject) {

So it only runs when we are clearing a global from the document.
Comment on attachment 8742939 [details] [diff] [review]
P1 Uncontrolled service workers when global is removed from document. r=bz

r=me
Attachment #8742939 - Flags: review?(bzbarsky) → review+
I wrote a test that navigates a window, but I can't actually reproduce this.  I think perhaps RemovedFromDocShell() is getting called through another convoluted path.  Something that calls nsDocument::Destroy().

I think we should still land this patch, though, since it makes it much more clear when the document becomes uncontrolled.  Its also nice to see the symmetry between becoming controlled when setting the global and becoming uncontrolled when clearing the global.  We can see all the logic in a single method.

With that said, I don't think this needs to be uplifted any more.

Boris, does that sound reasonable to you?  I'll get a stack for how RemovedFromDocShell() is getting called tomorrow to make sure we fully understand.
Flags: needinfo?(bzbarsky)
Well, let me do more testing.  I think I just saw a case in my browser where a service worker registration was kept alive after a series of navigate/back/forward/back/forward and then an unregister.  Let me see if that triggers any corner case in my test tomorrow.
Flags: needinfo?(bzbarsky)
> I wrote a test that navigates a window, but I can't actually reproduce this.

In your testcase, was the document you were testing against going into bfcache?  You can check via the pagehide event.
Ah, the test is passing because Clients.matchAll() is filtering out controlled documents that don't have a window.  This is still bad, though, because SWM thinks the documents are controlled and will keep a service worker registration alive because of this.  So we do need to fix and uplift this.

Unfortunately the test fails with the P1 patch applied.  I need to investigate.
Blocks: 1263670
Updated to clear some service worker specific state when the document becomes inactive.  This lets us actually re-control the document when it comes out of bfcache.
Attachment #8742939 - Attachment is obsolete: true
Attachment #8743391 - Flags: review+
Slightly better version of the patch that maintains the same clientID for the document after it comes out of bfcache.
Attachment #8743391 - Attachment is obsolete: true
Attachment #8743423 - Flags: review+
This test fails with the assertion I'm about to add in P3.  It then succeeds once P1 is applied.
Attachment #8743335 - Attachment is obsolete: true
Attachment #8743428 - Flags: review?(bzbarsky)
Assert that we don't have any controlled documents without an outer window.  This is exactly the sort of thing we want to avoid in this bug.  This change is necessary for the test to fail today since previously we filtered these window clients out.
Attachment #8743429 - Flags: review?(bzbarsky)
Comment on attachment 8743428 [details] [diff] [review]
P2 Add a web-platform-test for the window navigation case. r=bz

r=me
Attachment #8743428 - Flags: review?(bzbarsky) → review+
Comment on attachment 8743429 [details] [diff] [review]
P3 Assert that controlled documents have an outer window. r=bz

r=me
Attachment #8743429 - Flags: review?(bzbarsky) → review+
The try build shows we are leaking windows in browser_printpreview.js.  I believe nsDocumentViewer calls nsDocument::Destroy() for the print preview path without first calling SetScriptGlobalObject(nullptr).  Elsewhere it always pairs these two methods together.

Here is a try push that refactors things so nsDocument::Destroy() always calls SetScriptGlobalObject(nullptr) instead of requiring our external callers to do that:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9777c75e722
The try run is looking quite good with this patch.

To recap, the leak occurred because I moved ServiceWorkerManager::MaybeStopControllingDocument() from nsDocument::RemovedFromDocShell() to SetScriptGlobalObject(nullptr).  One of the things MaybeStopControllingDocument() does is remove a reference to the document.  Unfortunately, it seems certain paths through nsDocumentViewer do not call SetScriptGlobalObject(nullptr) when destroying a document.  So this caused the window to leak because SWM was still holding a reference to it.

It seems wrong to me that a document can be destroyed without explicitly having its global cleared and being marked inactive.  This path moves the call to SetScriptGlobalObject(nullptr) into nsDocument::Destroy().  This avoids external callers having to remember to call both these methods together.

To see that SetScriptGlobalObject(nullptr) can fail to be called currently, just run the browser_printpreview.js browser chrome mochitest.

Again, here is the try that looks pretty green with this patch:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9777c75e722
Attachment #8743956 - Flags: review?(bzbarsky)
Comment on attachment 8743956 [details] [diff] [review]
P4 Always call nsDocument::SetScriptGlobalObject(nullptr) from nsDocument::Destroy(). r=bz

OK.  So the issue is that in the print-preview case (and only in the case) nsDocumentViewer::OnDonePrinting doesn't null out mPrintEngine (on purpose).  Then when nsDocumentViewer::Close is called we think that non-null mPrintEngine means we're closing while still printing and we don't call mDocument->SetScriptGlobalObject(nullptr).  We also don't call mDocument->RemovedFromDocShell().

Then we land in nsDocumentViewer::Destroy and mPrintEngine->CheckBeforeDestroy() returns false.  So we press and calls mDocument->Destroy().  This will call RemovedFromDocShell for us, so that part gets handled, at least....

We could add some state tracking for "OnDonePrinting already called for this mPrintEngine" or something, but it's not clear to me that we can't get multiple OnDonePrinting calls for the same mPrintEngine in the print preview case.  This printing stuff is not as well documented or cleanly integrated as one would like...

Anyway, you're right that the other calls to nsIDocument::Destroy are all in this file, and that they're all preceded by SetScriptGlobalObject(nullptr), albeit indirectly in the ~nsDocumentViewer case.   So this looks like the safe thing to do (surprising, I know!) compared to other possible solutions to this print preview craziness.

r=me.
Attachment #8743956 - Flags: review?(bzbarsky) → review+
I'm going to wait to request uplift on this until it has baked a bit.  I'd still like to get it on FF47 if we can, but the change in P4 is a bit more risky than I thought we would need.
Comment on attachment 8743423 [details] [diff] [review]
P1 Uncontrolled service workers when global is removed from document. r=bz

Approval Request Comment
[Feature/regressing bug #]: service workers
[User impact if declined]: This bug creates a very confusing situation for web developers where they may not be able to update their service worker because a document in the history is holding it alive.  In addition, we need this fix in order to uplift bug 1265771 which exposes these history documents to content in Clients.matchAll().  This is a bad compat issue.
[Describe test coverage new/current, TreeHerder]: Automated tests included.
[Risks and why]: Moderate.  Fixing this bug required changing some code in nsDocument and nsDocumentViewer which are used much more widely than service workers.  We believe the changes are correct, pass tests, and have had some bake time on nightly/aurora, but there is some risk.
[String/UUID change made/needed]: None.
Attachment #8743423 - Flags: approval-mozilla-beta?
Comment on attachment 8743428 [details] [diff] [review]
P2 Add a web-platform-test for the window navigation case. r=bz

See comment 23.
Attachment #8743428 - Flags: approval-mozilla-beta?
Comment on attachment 8743429 [details] [diff] [review]
P3 Assert that controlled documents have an outer window. r=bz

See comment 23.
Attachment #8743429 - Flags: approval-mozilla-beta?
Comment on attachment 8743956 [details] [diff] [review]
P4 Always call nsDocument::SetScriptGlobalObject(nullptr) from nsDocument::Destroy(). r=bz

See comment 23.
Attachment #8743956 - Flags: approval-mozilla-beta?
Hello BZ, given the risk assessment, I would like a second opinion on the severity of this issue on end-users and whether this is worth fixing in 47 or let it ride the trains. What do you think? Appreciate your help here.
Flags: needinfo?(bzbarsky)
The user impact is likely not too great.  The impact on web developer goodwill is a lot more significant...

I do think the changes are fairly low risk, though; we vetted them pretty carefully.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #28)
> The user impact is likely not too great.  The impact on web developer
> goodwill is a lot more significant...
> 
> I do think the changes are fairly low risk, though; we vetted them pretty
> carefully.

Thanks BZ for a prompt reply! I am inclined to taking it in Beta47.
Comment on attachment 8743423 [details] [diff] [review]
P1 Uncontrolled service workers when global is removed from document. r=bz

While this doesn't have a clear end-user impact atm, I am still inclined to taking it since we are striving to make our code more robust and it is still early in the Beta cycle, Beta47+
Attachment #8743423 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8743428 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8743429 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8743956 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This needs some re-basing.
You need to log in before you can comment on or make changes to this bug.