service worker install fails if install event is GC'd before waitUntil() promise is fulfilled

RESOLVED FIXED in Firefox 49

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: bkelly, Assigned: bkelly)

Tracking

(Blocks 1 bug)

48 Branch
mozilla49
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 wontfix, firefox47 wontfix, firefox48 wontfix, firefox49+ fixed, firefox-esr45 disabled)

Details

(Whiteboard: btpp-active)

Attachments

(5 attachments, 7 obsolete attachments)

4.96 KB, patch
asuth
: review+
Details | Diff | Splinter Review
2.24 KB, patch
lina
: review+
Details | Diff | Splinter Review
7.35 KB, patch
asuth
: review+
Details | Diff | Splinter Review
3.73 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.62 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
The wpt test register-wait-forever-in-install-worker.https.html blocks an install event waitUntil() forever.  The old spec before the queue rewrite used to explicitly kill workers in this state when register() was called again.  The new spec does not do this because we only let one job run at a time.

Somehow this test is seeing the promise fail as expected, but only after close to 10 seconds.  I need to investigate whats happening.
Blocks: 911216
This is a bad bug!  We are failing the service worker install if the install event is GC'd before the waitUntil promises are fulfilled.  This is very racy and could lead to unexpected behavior in the wild.
Summary: register-wait-forever-in-install-worker.https.html doesn't seem to work quite right any more → service worker install fails if install event is GC'd before waitUntil() promise is fulfilled
[Tracking Requested - why for this release]:
This is a bad compat bug.  I would like to get it into FF47 if I can.  I hope to have a test and fix written today.
Tracking and marking this as blocking 47 based on Ben's assessment.
Ugh.  This affects respondWith() too!
Whiteboard: btpp-active
(Assignee)

Updated

3 years ago
See Also: → 1274100
This is trickier to test than I thought.  If I want to gracefully teardown the test I'd like to resolve the waitUntil() promise after I've verified it doesn't get GC'd.  But if I write any code that holds on to the resolve() function to call it later, then the promise is not GC'd.  This makes sense.

It also means this issue is a lot less critical.  All useful code is going to be resolving or rejecting their waitUntil() promise.

We still probably want to fix this, though, because otherwise it makes GC observable to script.

I'm going to have to investigate further tomorrow.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=773709106b57

I'm not running try on windows since all the TC clipboard errors make it too noisy at the moment.
Comment on attachment 8755080 [details] [diff] [review]
P4 Fix bugs in dom/push/test_serviceworker_lifetime.html test. r=kitcambridge

Kit, this fixes a bug in the test_serviceworker_lifetime.html test where it was trying to call waitUntil() asynchronously.  It must be called synchronously in the event handler.
Attachment #8755080 - Flags: review?(kcambridge)
I don't think we should uplift this any more.  Any useful service worker code is actually going to handle this just find because the promise is held alive via other means.  Its pretty much only a problem for stuff like:

  evt.waitUntil(new Promise(function() { }));

Which isn't very useful in practice.

Since the impact is much less than I originally thought I don't want to uplift this anymore.
Comment on attachment 8755078 [details] [diff] [review]
P2 Fix register-wait-forever-in-install-worker.https.html to expect new unified service worker job queue behavior. r=asuth

This fixes the wpt test to expect the new spec behavior.  In the past the spec would attempt to kill an installing worker when a new register() call for the scope is made.

We now run register jobs sequentially.  The second register job should not run until the first job is timed out.

I modified the test to wait some reasonable time before gracefully ending the first install.  It seems the easiest way to demonstrate that the second register does not abort the first.

Relevant spec starts here:

https://slightlyoff.github.io/ServiceWorker/spec/service_worker/#navigator-service-worker-register
Attachment #8755078 - Flags: review?(bugmail)
Comment on attachment 8755079 [details] [diff] [review]
P3 Add mochitest that demonstrates we cancel sw install if install event is GC'd. r=asuth

This adds a test that shows we incorrectly abort the SW install if the promise passed to evt.waitUntil() is GC'd.

I would have preferred to perform an exact CC/GC, but we don't have the infrastructure to do this for a worker JS context.  I filed bug 1274100 about this.  Instead I had to use a rather lame 10 second timeout.

This does consistently trigger the problem locally for me without the P1 patch.
Attachment #8755079 - Flags: review?(bugmail)
Comment on attachment 8755077 [details] [diff] [review]
P1 Hold strong reference to service worker WaitUntil() promise until its fulfilled. r=asuth

This patch will hold a strong ref to the promise passed to evt.waitUntil() or evt.respondWith() until the promise is fulfilled.

It does this by having the KeepAliveHandler ref the promise and the ServiceWorkerPrivate ref the KeepAliveHandler.  We use the ServiceWorkerPrivate StoreISupports() mechanism for this second ref.  These refs are automatically dropped if the service worker is timed out and killed.

The patch has added complication because the Promise can only be AddRef()/Release()'d on the worker thread.  The ServiceWorkerPrivate StoreISupports(), however, only works from the main thread.

This is mostly straightforward if the promise resolves normally, but gets harder if the worker terminated and the KeepAliveHandler is destroyed on the main thread.  To handle this case we use a WorkerControlRunnable to release the Promise.  We also hold a feature alive to make sure we can dispatch the runnable.
Attachment #8755077 - Flags: review?(bugmail)
Comment on attachment 8755077 [details] [diff] [review]
P1 Hold strong reference to service worker WaitUntil() promise until its fulfilled. r=asuth

Review of attachment 8755077 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/ServiceWorkerPrivate.cpp
@@ +388,5 @@
> +  void
> +  ReleaseOnMainThread()
> +  {
> +    AssertIsOnMainThread();
> +    mKeepAliveToken->GetServiceWorkerPrivate()->RemoveISupports(this);

It occurs to me we should probably explicitly drop the mKeepAliveToken here.  This could avoid an unnecessary main thread proxy runnable to release the token if something on the worker thread still has a ref to the KeepAliveHandler.
Comment on attachment 8755077 [details] [diff] [review]
P1 Hold strong reference to service worker WaitUntil() promise until its fulfilled. r=asuth

Actually, I think I can simplify this a lot.  It creates a cycle ref with the promise, so we shouldn't need the StoreISupports hassle.
Attachment #8755077 - Flags: review?(bugmail)
(In reply to Ben Kelly [:bkelly] from comment #18)
> Comment on attachment 8755077 [details] [diff] [review]
> P1 Hold strong reference to service worker WaitUntil() promise until its
> fulfilled. r=asuth

How would this work with JS Promise? Those aren't CC'd, and we don't keep the wrappers alive, so it seems like this mechanism wouldn't transfer directly.
Flags: needinfo?(bkelly)
Attachment #8755080 - Flags: review?(kcambridge) → review+
(In reply to Till Schneidereit [:till] from comment #21)
> (In reply to Ben Kelly [:bkelly] from comment #18)
> > Comment on attachment 8755077 [details] [diff] [review]
> > P1 Hold strong reference to service worker WaitUntil() promise until its
> > fulfilled. r=asuth
> 
> How would this work with JS Promise? Those aren't CC'd, and we don't keep
> the wrappers alive, so it seems like this mechanism wouldn't transfer
> directly.

You are saying if c++ code holds its dom::Promise reference alive the promise can still be GC'd?  And this will drop references to the native callbacks?

If the out js promise is GC'd before it's fulfilled, does it reject the inner dom Promise?
Flags: needinfo?(bkelly) → needinfo?(till)
The simplified patch works as well.  Its much easier to understand since we don't need to mess with main thread at all.

This try build (with separate interdiff patch before folding together) is green:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0731ae47c13c

Till, I also ran my tests P3 test locally with spidermonkey promises.  The test failed without this P1 patch and passed with the P1 patch.  In the end it does not really matter if the outer JS promise is GC'd as long as the inner C++ Promise is kept alive.
Attachment #8755077 - Attachment is obsolete: true
Flags: needinfo?(till)
Attachment #8755161 - Flags: review?(bugmail)
Comment hidden (Intermittent Failures Robot)
It turns out bug 927740 gave us the gift of the "child-cc-request" observer notification that, when emitted:
- in the parent, dispatches an async message to children to re-emit it
- in the child, causes a CycleCollectorRunnable to be dispatched to all workers

Here's a patch that you can use or scavenge as appropriate.  It causes failures without the P1 fix and passes with the P1 fix.  There are comments in there about how it will break when the serviceworkers start getting spawned in a different process and a hint about how to fix it.  I'm assuming the ordering barrier postMessage will not get bounced off the parent process's main thread when that happens, so I think the attempt would be broken at this time.
Attachment #8755691 - Flags: feedback?(bkelly)
Attachment #8755161 - Flags: review?(bugmail) → review+
Comment on attachment 8755078 [details] [diff] [review]
P2 Fix register-wait-forever-in-install-worker.https.html to expect new unified service worker job queue behavior. r=asuth

Review of attachment 8755078 [details] [diff] [review]:
-----------------------------------------------------------------

r=asuth assuming we're just going with a comment and the dump() calls removed.

::: testing/web-platform/tests/service-workers/service-worker/register-wait-forever-in-install-worker.https.html
@@ +24,5 @@
> +            return;
> +          }
> +          registration.installing.postMessage('STOP_WAITING');
> +          resolve();
> +        }, 2000);

If I understand correctly, the motivation behind the timeout is that the Job Queue abstraction is not not exposed to content so it's hard to know whether the other register() call has:
- A: actually been processed and the determination made to queue and not abort the bad one, or
- B: is still making its way to that processing logic and will do the incorrect thing when it gets there.

The timeout is a hack/compromise.

A more appealing hack if it's possible might be to leverage that "A user agent must maintain a separate job queue for each service worker registration keyed by its scope url."  In other words, do:
- register(bad, '/maxscope')
- register(good, '/maxscope')
- register(cleverInference, '/maxscope/subscope').

The idea would be that the cleverInference worker is able to start installing itself in parallel to "bad" and that all registrations go through a single funnel point which is the processing stage of interest to us.  If cleverInference has begun installing (which we detect by it messaging us), then "good" must have been processed already and enqueued.

It's possible this idea is already forbidden by spec, is wrong about how service workers work, or makes unacceptable assumption about unspecified behavior.  But I think we all hate timeouts, so it's worth proposing.  I can do more research if it doesn't seem nuts but needs more investigation.

If a better hack isn't possible, I think it'd be great to have a comment here about the timeout really being the only option and how even my crazy proposal was worse/impossible.

@@ +41,3 @@
>          })
> +      .then(function() {
> +          dump(registration.installing);

Suspect these dump() calls were accidentally left in here.  If intentional, convert from non-standard dump() to something more idiomatic.
Attachment #8755078 - Flags: review?(bugmail) → review+
Attachment #8755079 - Flags: review?(bugmail) → review+
Good idea!  That's totally allowed per the spec since each scope gets its own job queue.
Attachment #8755078 - Attachment is obsolete: true
Attachment #8755984 - Flags: review+
Comment on attachment 8755691 [details] [diff] [review]
P3b cleanup test to use child-cc-request observer mechanism instead of 10sec timeout

Looks good to me!  I'll just land it as a separate P3b patch.
Attachment #8755691 - Flags: feedback?(bkelly) → review+
I had to add a child-gc-request before and after the CC request in order to trigger the failure for SPIDERMONKEY_PROMISE builds.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3bb1cae06858
Attachment #8755691 - Attachment is obsolete: true
Attachment #8755997 - Flags: review+
You need to log in before you can comment on or make changes to this bug.