Closed Bug 1290116 Opened 8 years ago Closed 8 years ago

service worker unregister() should wait for active worker to go idle before clearing the registration

Categories

(Core :: DOM: Service Workers, defect)

47 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- disabled
firefox50 --- wontfix
firefox51 --- fixed

People

(Reporter: rubys, Assigned: bkelly)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0
Build ID: 20160606113944

Steps to reproduce:

Attempted to use Basic Authentication with Service Workers.  I started with the offline-fallback demo in serviceworke.rs and added a .htaccess and htpasswd file.  Complete instructions here:

https://github.com/rubys/serviceworker-cookbook/tree/authenticated-offline-fallack/offline-fallback#reproduction-instructions-apache-httpd

Diff:

https://github.com/mozilla/serviceworker-cookbook/commit/554a5b86a378bc4783668388f3e3551eac112c3a


Actual results:

Got a Corrupted Content Error:

https://raw.githubusercontent.com/rubys/serviceworker-cookbook/authenticated-offline-fallack/offline-fallback/CorruptedContentError-firefox.png


Expected results:

Should have seen "Yay, you are online!" page.
Component: Untriaged → DOM: Service Workers
Product: Firefox → Core
Flags: needinfo?(bkelly)
I can reproduce.  I didn't use apache, though.  I used this SimpleHTTPServer python script:

https://gist.githubusercontent.com/fxsjy/5465353/raw/c5e5b66f5b5b35247560d19fd2ea5251f88e34b7/SimpleAuthServer.py

I don't think this is actually related to the basic auth.  The failure only occurs when pressing the "View Complete Demo Again" button.  This does an unregister() and reload() without waiting like this:

    navigator.serviceWorker.getRegistration().then(function(registration) {
      registration.unregister();
      window.location.reload();
    });

If I change this code to look like this:

    navigator.serviceWorker.getRegistration().then(function(registration) {
      if (registration) {
        return registration.unregister();
      }
    }).then(_ => {
      window.location.reload();
    });

Then I can no longer reproduce.  It seems we have some unsafe race when reloading before the unregister() promise completes.
Assignee: nobody → bkelly
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(bkelly)
With the original code I see one of four results:

1) "Service worker change, registered the service worker" shown on the page and the web console shows "[oninstall] Cached offline page 'http://localhost:8080/offline.html'".
2) "An active service worker controller was found, no need to register" shown on the page and no logs in the web console.
3) Content corrupted page with this in the web console "Failed to load ‘http://localhost:8080/?1470016121244’. A ServiceWorker intercepted the request and encountered an unexpected error."
4) Only once I saw the same as (1), but also an error about the service worker script fetch failing.

With the code modifications in comment 1 I consistently get only result (1).
Sam, please be aware if you require basic auth on the offline.html page itself (or any pre-cached pages fetch'd in the install event handler) they will be cached as failed authorizations.  We don't have a window to associate the authorization with in the install event, so I think you will always get a failed auth for those fetches.

I think this is unrelated to the content corrupted page, though.
Flags: needinfo?(rubys)
It is not clear what information you need from me.

I will say that my current code base loads the cache entirely from the "client" process (i.e., the browser), and I can verify that the cache contains successful results, and I do NO fetches from the service worker (not precisely true: the code will try a fetch if the object is not in the cache, but I have verified that that path is not being executed). Even so, I still get either content corrupted or failed authorization errors (again, despite verifying that the cache only contains successful responses).

In the case of failed authorization errors, a shift-F5 reload from the server, and in the process throw up a modal dialog prompting for user and password, and after that point the service worker generally behaves, but exiting the browser completely and starting again can cause me to get authentication errors again.

I'll be glad to share my code, but I thought it best to start the conversation with the posted service worker examples.
Sorry.  I guess it was more just an fYI about a problem with the example code.  If that does not reflect your real situation, then that is good.  The behavior I was seeing in the example was a failed auth page when the server was offline.
Flags: needinfo?(rubys)
So, we are hitting this error on the serviceworke.rs site as well:

Failed to load ‘https://serviceworke.rs/offline-fallback/’. A ServiceWorker intercepted the request and encountered an unexpected error.service-worker.js:23:4

But its not showing the error because about:neterror is deemed non-secure and this is an iframe in a secure document:

Blocked loading mixed active content “about:neterror?e=corruptedContentErrorv2&u=https%3A//serviceworke.rs/offline-fallback/&c=UTF-8&f=regular&d=The%20site%20at%20https%3A//serviceworke.rs/offline-fallback/%20has%20experienced%20a%20network%20protocol%20violation%20that%20cannot%20be%20repaired.”[Learn More]

In localhost http testing I don't think we consider this to be mixed content.
Filed bug 1291448 about blocking about:neterror as mixed content.
See Also: → 1291448
For some reason the RespondWithHandler is being GC'd before the promise is fulfilled.  Currently we cancel the intercepted request in this case:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerEvents.cpp#341

We could just reset the channel here, but I'd like to figure out why we are GC'ing the RespondWithHandler first.
I believe the problem here is that we unconditionally terminate workers and remove the registration when the last controlling document goes away:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerUnregisterJob.cpp#142
https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerManager.cpp#2005

We really need to wait to do this until the active worker is idle.  This is very similar to bug 1170543.

I'll make these entry points check the active worker idle state.  I'll also add this to the spec issue.
Depends on: 1170543
With these patches I can no longer trigger the failure locally.  I will write a wpt test that exposes the problem next.
This fails without the P1, P2, and P3 patches.  It succeeds with the patches.
Summary: Attempting to use Basic Auth with ServiceWorkers results in Corrupted Content Error → service worker unregister() should wait for active worker to go idle before clearing the registration
We should release this with bug 1170543 if we can.
Comment on attachment 8777375 [details] [diff] [review]
P1 Expose a ServiceWorkerRegistrationInfo::IsIdle() method. r=asuth

Step one is to expose the "registration is idle" state in a convenient way.  later patches will then use this method.
Attachment #8777375 - Flags: review?(bugmail)
Comment on attachment 8777377 [details] [diff] [review]
P2 Clean up ServiceWorkerManager::StopControllingADocument(). r=asuth

This patch simply cleans up the logic in ServiceWorkerManager::StopControllingADocument() to use our traditional "short circuit return" style.  It has no functional change.
Attachment #8777377 - Flags: review?(bugmail)
Comment on attachment 8777378 [details] [diff] [review]
P3 Delay removing registration until active worker is idle. r=asuth

Do the actual work of delaying registration teardown until after the registration's active worker has gone idle.
Attachment #8777378 - Flags: review?(bugmail)
Comment on attachment 8777396 [details] [diff] [review]
P4 Verify that unregister waits for active worker to go idle. r=asuth

Add a test case verifying we wait to teardown the registration until the active worker is idle.

I put this in activation.https.html because its nearly identical to the other tests there.  Its just a bit of a misnomer because its not about activation but instead about unregister.  It doesn't seem worth the hassle of renaming the test or creating a new test file with duplicate logic.
Attachment #8777396 - Flags: review?(bugmail)
Attachment #8777375 - Flags: review?(bugmail) → review+
Attachment #8777377 - Flags: review?(bugmail) → review+
Attachment #8777378 - Flags: review?(bugmail) → review+
Comment on attachment 8777396 [details] [diff] [review]
P4 Verify that unregister waits for active worker to go idle. r=asuth

Review of attachment 8777396 [details] [diff] [review]:
-----------------------------------------------------------------

Restating the test for my benefit:
- Step one retracts the controlled iframe document that "active" was controlling.  "active" is now being kept alive by the (post)MessageEvent's waitUntil().  Prior to the code in P3 being introduced that checks for the worker being idle, this would have either resulted in the registration being removed in one of:
  - ServiceWorkerUnregisterJob::Unregister if the document was already gone, or
  - ServiceWorkerManager::StopControllingADocument if the document removal raced the unregister.  For this specific test, it won't be this one because nsDocument::SetScriptGlobalObject should get synchronously called by the remove() and call through to this path prior to the unregister call in the test.
- Step two is gated on the unregister job completing (in step 5, after flagging uninstalling in 4, with step 6 skipping with us assuming that your proposed change of "using the service worker" at https://github.com/slightlyoff/ServiceWorker/issues/916#issuecomment-237272060 will go through).  We wait for the the "waiting" SW to be marked redundant, which will occur when ServiceWorkerRegistrationInfo::Clear() is invoked by SWM::RemoveRegistration().  That happens when we postMessage the "active" SW which resolves its MessageEvent waitUntil() which results in SWM::WorkerIsIdle being invoked.
- Step three is just checking that the Clear() totally happened.
Attachment #8777396 - Flags: review?(bugmail) → review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/db014dc92795
P1 Expose a ServiceWorkerRegistrationInfo::IsIdle() method. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/c50283f07370
P2 Clean up ServiceWorkerManager::StopControllingADocument(). r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6a336a1fa97
P3 Delay removing registration until active worker is idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/3e95eb4821b8
P4 Verify that unregister waits for active worker to go idle. r=asuth
(In reply to Ben Kelly [:bkelly] from comment #1)
> I can reproduce.  
> I don't think this is actually related to the basic auth.

I do think that what you are fixing is an important bug, but I'm increasingly convinced that the bug I wanted to report on is different than the one being fixed.

Once there is a stable version (preferably in a nightly build, but if necessary I can build it myself - it has been a while, but I used to be able to build firefox on my machine), I'll follow up (either in this bug or a new bug - I care not) with a reproduction scenario, this time using Python's SimpleHTTPServer.
(In reply to Sam Ruby from comment #22)
> (In reply to Ben Kelly [:bkelly] from comment #1)
> > I can reproduce.  
> > I don't think this is actually related to the basic auth.
> 
> I do think that what you are fixing is an important bug, but I'm
> increasingly convinced that the bug I wanted to report on is different than
> the one being fixed.
> 
> Once there is a stable version (preferably in a nightly build, but if
> necessary I can build it myself - it has been a while, but I used to be able
> to build firefox on my machine), I'll follow up (either in this bug or a new
> bug - I care not) with a reproduction scenario, this time using Python's
> SimpleHTTPServer.

Please do!  I was not aware you saw another error beyond the corrupted content after pressing "View Complete Demo Again".

If you run into more problems, please file them as a new bug and need-info me.  Thanks!
Will do once this lands in a build.

Meanwhile, the scenario I wanted to report on is one where the cache is loaded correctly, and the service worker correctly returns that data via event.respondWith many times.  All is well and good until the browser is completely closed and then the user restarts the browser and attempts to visit the page.  The result is a 401 response.  There is no opportunity to provide a user and password, just an unconditional authentication error.  In fact, no request is sent to the server in this scenario.

Because the cache is available to the browser (client) process, I can verify that the data in the cache is still correct.
I'll try one of those builds either late tonight or tomorrow (I'm East Coast US).
(In reply to Sam Ruby from comment #25)
> Will do once this lands in a build.
> 
> Meanwhile, the scenario I wanted to report on is one where the cache is
> loaded correctly, and the service worker correctly returns that data via
> event.respondWith many times.  All is well and good until the browser is
> completely closed and then the user restarts the browser and attempts to
> visit the page.  The result is a 401 response.  There is no opportunity to
> provide a user and password, just an unconditional authentication error.  In
> fact, no request is sent to the server in this scenario.

Yes, I can reproduce this.

> Because the cache is available to the browser (client) process, I can verify
> that the data in the cache is still correct.

At least for the offline-fallback demo its not pulling any data from the Cache.  The service worker just proxies the connection through.

I'll file a new bug.  Thanks.
Flags: needinfo?(rubys)
This does not reproduce locally, but my best guess is we are delaying the unregister because we happen to catch the worker in the activating state while its processing the no-op activate event.
(In reply to Ben Kelly [:bkelly] from comment #30)
> This does not reproduce locally, but my best guess is we are delaying the
> unregister because we happen to catch the worker in the activating state
> while its processing the no-op activate event.

Actually, that shouldn't matter.  Because the uninstalling flag should be set.

Except... the spec does not say to ignore registrations with the uninstalling flag set.  I think that is probably a spec bug.  I filed:

https://github.com/slightlyoff/ServiceWorker/issues/943

Let's see if that fixes the issue here.
Comment on attachment 8777975 [details] [diff] [review]
P5 Make navigator.serviceWorker.getRegistrations() ignore uninstalling registrations. r=asuth

The try build indicates this fixes the problem with the previous landing.

I also got preliminary spec approval from Jake on IRC.

Basically we getRegistrations() should be doing the same thing as getRegistration().  A registration that has a pending uninstall should not be returned.
Attachment #8777975 - Flags: review?(bugmail)
Comment on attachment 8778002 [details] [diff] [review]
P6 Add a wpt test case verifying getRegistrations() does not return registrations with a pending uninstall. r=asuth

The additional test case verifies that getRegistrations() hides registrations with pending uninstalls.  It does this by holding a controlled frame alive and calling unregister().  In this situation the registration gets the pending uninstall flag, but does not actually go away.

This test fails without the P5 patch.  It succeeds with the P5 patch.
Attachment #8778002 - Flags: review?(bugmail)
Comment on attachment 8777975 [details] [diff] [review]
P5 Make navigator.serviceWorker.getRegistrations() ignore uninstalling registrations. r=asuth

Review of attachment 8777975 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/ServiceWorkerManager.cpp
@@ +696,5 @@
>      }
>  
>      for (uint32_t i = 0; i < data->mOrderedScopes.Length(); ++i) {
> +      RefPtr<ServiceWorkerRegistrationInfo> info =
> +        data->mInfos.GetWeak(data->mOrderedScopes[i]);

Is there an idiom for using GetWeak() when we know the result will already exist?

Or an idiom change to favor this style since the pointer gets directly returned (and AddRefed by the refptr) rather than requiring a separate declaration line and manually typing getter_AddRefs for the obscured outval?

(I ask because I don't see a lot of GetWeak in dom/workers/ and "Weak" always raises my suspicions.)
Attachment #8777975 - Flags: review?(bugmail) → review+
Attachment #8778002 - Flags: review?(bugmail) → review+
(In reply to Andrew Sutherland [:asuth] from comment #36)
> > +      RefPtr<ServiceWorkerRegistrationInfo> info =
> > +        data->mInfos.GetWeak(data->mOrderedScopes[i]);
> 
> Is there an idiom for using GetWeak() when we know the result will already
> exist?

Yea, I just wanted to assign to the RefPtr in a single statement.
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/90006dffbe5b
P1 Expose a ServiceWorkerRegistrationInfo::IsIdle() method. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/74ceee4dc2f4
P2 Clean up ServiceWorkerManager::StopControllingADocument(). r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca998a8cbe48
P3 Delay removing registration until active worker is idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/44b50089c079
P4 Verify that unregister waits for active worker to go idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/c9922c014310
P5 Make navigator.serviceWorker.getRegistrations() ignore uninstalling registrations. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa715b19902b
P6 Add a wpt test case verifying getRegistrations() does not return registrations with a pending uninstall. r=asuth
Given the number of backouts here I don't intend to uplift any more.  This can just ride the trains.  (Once I figure out this latest failure.)
I think the push tests are generally a bit racy in that very few of them wait for the service worker to become active.  Changing the timing could cause these races to finish in different ways.

Let's see if this fixes the problem:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0c8bf9706520
Comment on attachment 8778958 [details] [diff] [review]
P7 Fixed activation races in push service worker tests. r=kitcambridge

Review of attachment 8778958 [details] [diff] [review]:
-----------------------------------------------------------------

Kit, this patch makes the dom/push/test mochitests consistently wait for service workers to be active.  We were only waiting in a few cases here before.  I think to avoid races we should generally wait for active workers all the time.  It did not seem like any of these tests were purposefully trying to test push against installing or waiting workers.

Even if this does not fix the failures in the last backout its still probably a good thing to land.
Attachment #8778958 - Flags: review?(kcambridge)
Comment on attachment 8778958 [details] [diff] [review]
P7 Fixed activation races in push service worker tests. r=kitcambridge

Review of attachment 8778958 [details] [diff] [review]:
-----------------------------------------------------------------

Ouch. Thanks for fixing these, Ben!

::: dom/push/test/test_utils.js
@@ +235,5 @@
> +      resolve(swr);
> +      return;
> +    }
> +    sw.addEventListener('statechange', function onStateChange(evt) {
> +      if (sw.state === 'activated') {

Just so I understand: this is different than `navigator.serviceWorker.ready` because that resolves if the worker is "activating" or "activated", right?
Attachment #8778958 - Flags: review?(kcambridge) → review+
(In reply to Kit Cambridge [:kitcambridge] from comment #43)
> Just so I understand: this is different than `navigator.serviceWorker.ready`
> because that resolves if the worker is "activating" or "activated", right?

No, I think .ready resolves on activated.  Its more to avoid accidentally using a previously active service worker.  If you already have an activated service worker .ready resolves immediately.  So after a .register() you usually want to wait for .installing to really test your new worker.
Also, .ready doesn't let you wait for a different scope activation which some of these tests needed.
Ugh... After 24 hours I finally got two winxp test results.  One of them is a failure.
Although it failed in a different way.  Now its timing out right at the start.  The previous test exercises our "long running SW, kill it with fire" code paths.  I suspect we are leaving a token count non zero there which make the SW seem like its non-idle forever.  This then prevents the previous worker from every being completely evicted.
In the winxp failures I get these additional error messages in test_serviceworker_lifetime.html:

 14:16:48     INFO -  [336] WARNING: '!JS::AddPromiseReactions(cx, promiseObj, resolveFunc, rejectFunc)', file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/promise/Promise.cpp, line 824

14:16:48 INFO - [336] WARNING: A runnable was posted to a worker that is already shutting down!: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/workers/WorkerPrivate.cpp, line 2443
Ugh, I think this is a race in worker holder code.  I'm going to file a separate bug.
Depends on: 1293690
I am fairly confident bug 1293690 is the issue here.  I added some NS_ASSERTION(false) calls to log stacks and managed to trigger the same timing that the winxp automation is hitting.  My patch for the other bug seems to fix it.  Will push a new try build in a bit.
The try build has all green winxp M(3) results with bug 1293690 in the mix.  Just have to wait for baku to come back from PTO next week to review it.
And indeed, I see it hitting the expected error path:

20:02:57     INFO -  [644] WARNING: '!JS::AddPromiseReactions(cx, promiseObj, resolveFunc, rejectFunc)', file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/promise/Promise.cpp, line 824
20:02:57     INFO -  [644] WARNING: '!ref->UseWorkerHolder()', file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/workers/ServiceWorkerPrivate.cpp, line 342
20:02:57     INFO -  [644] WARNING: '!handler', file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/workers/ServiceWorkerPrivate.cpp, line 379
20:02:57     INFO -  [644] WARNING: A runnable was posted to a worker that is already shutting down!: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/dom/workers/WorkerPrivate.cpp, line 2443

The patches in bug 1293690 make the UseWorkerHolder() call fail here if the worker is already terminated.
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a4e8c92faae
P1 Expose a ServiceWorkerRegistrationInfo::IsIdle() method. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb7da0dbfc72
P2 Clean up ServiceWorkerManager::StopControllingADocument(). r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d5ad5e9ed20
P3 Delay removing registration until active worker is idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/8e9a9ba04ce5
P4 Verify that unregister waits for active worker to go idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0b9c25b6125
P5 Make navigator.serviceWorker.getRegistrations() ignore uninstalling registrations. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/a81b6ce5c25f
P6 Add a wpt test case verifying getRegistrations() does not return registrations with a pending uninstall. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/bbc14d107fca
P7 Fixed activation races in push service worker tests. r=kitcambridge
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c0c8cd41219c
P1 Expose a ServiceWorkerRegistrationInfo::IsIdle() method. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/465601c1b933
P2 Clean up ServiceWorkerManager::StopControllingADocument(). r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb891c00bc15
P3 Delay removing registration until active worker is idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/508299ab7a22
P4 Verify that unregister waits for active worker to go idle. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/cf520760c0df
P5 Make navigator.serviceWorker.getRegistrations() ignore uninstalling registrations. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/8207a64e1cf2
P6 Add a wpt test case verifying getRegistrations() does not return registrations with a pending uninstall. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3b6a4e0898f
P7 Fixed activation races in push service worker tests. r=kitcambridge
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: