Service Worker life cycle code should not use mutable shared state on registration across jobs

RESOLVED FIXED in Firefox 45



4 years ago
4 years ago


(Reporter: bkelly, Assigned: bkelly)


(Blocks 1 bug)

Dependency tree / graph

Firefox Tracking Flags

(firefox45 fixed)



(8 attachments, 7 obsolete attachments)

5.59 KB, patch
: review+
Details | Diff | Splinter Review
3.94 KB, patch
: review+
Details | Diff | Splinter Review
1.88 KB, patch
: review+
Details | Diff | Splinter Review
2.59 KB, patch
: review+
Details | Diff | Splinter Review
15.03 KB, patch
: review+
Details | Diff | Splinter Review
2.93 KB, patch
: review+
Details | Diff | Splinter Review
988 bytes, patch
: review+
Details | Diff | Splinter Review
714 bytes, patch
: review+
Details | Diff | Splinter Review
See this spec issue comment:

Basically, the spec defines operations like this:

  1) Set the registration script URL to foo
  2) Queue an async task to do something with the registration (assuming script URL is foo)

So you can get:

  1) Do operation 1
    a) set registration script URL to foo
    b) queue operation 1 to run async
  2) Do operation 2
    a) set registration script URL to bar
    b) queue operation 2 to run async
  3) Operation 1 async runs, but oops, its using bar instead of foo

I think this is a case where the description of what chrome is doing went off the rails in translation to the spec.  Even if the spec says to do this, we should stop doing it in our code.  We should instead have job just use its specific mScriptURL.

Code that wants to read registration->mScriptURL should instead use newestWorker.scriptURL.  For example, the short circuit in the .register() call for doing nothing when registering the same script.

Then we should remove registration->mScriptURL because its a crazy race condition.

I've suggested spec changes to that affect in the linked spec issue.

What do other people think?
To further clarify, I think we should:

1) Make all ServiceWorkerRegisterJob code use its own mScriptURL
2) If the job is created with an empty mScriptURL, then populate it during Start() using newest worker's script URL.
3) Remove ServiceWorkerRegistrationInfo::mScriptURL.

I tried to describe this in spec language as well, here:
Blocks: 1226443
This is a company issue, even if we think the spec is wrong here.  I doubt the spec accurately describes what chrome does.
Matt described Chrome's behavior on github:

I think thats where the spec is going, so I'm going to implement the same here.
Assignee: nobody → bkelly
Eddy, I need to remove the nsIServiceWorkerRegistrationInfo.scriptSpec attribute.  Will this conflict with your plans for the debugger?

Instead of using the script spec for the registration, we should really be using the script specs for the installing/waiting/active workers.
Flags: needinfo?(ejpbruel)
Blocks: 1229795
Updated to not fire a notification that the registration changed before the installing worker is set.  Now that we're not overriding the registration script spec nothing has changed at this point.
Attachment #8694434 - Attachment is obsolete: true
Depends on: 1230164
Attachment #8694432 - Attachment is obsolete: true
Attachment #8696057 - Flags: review?(ehsan)
For now these patches keep nsIServiceWorkerRegistrationInfo.scriptSpec.  I simply return newest()->mScriptSpec in this case.
Flags: needinfo?(ejpbruel)
Try is green.  So this should be good to go once reviewed.
Attachment #8696054 - Flags: review?(ehsan) → review+
Attachment #8696056 - Flags: review?(ehsan) → review+
Attachment #8696057 - Flags: review?(ehsan) → review+
Attachment #8696058 - Flags: review?(ehsan) → review+
Attachment #8696059 - Flags: review?(ehsan) → review+
Attachment #8696060 - Flags: review?(ehsan) → review+
It seems that this fixes the oranges introduced by bug 1225121.
Blocks: 1225121
Attachment #8696806 - Flags: review?(catalin.badea392)
Attachment #8696806 - Flags: review?(catalin.badea392) → review+
Back out of P7 for warning-as-error bustage:

Backed out the whole lot in

Good luck with landing this, because it's got bustification and makes the tests ungoodish. On the push where I was deciding to back you out, of the 10 linux64 debug w-p-t-e10s-8 runs in two were green, three were failures in fetch-frame-resource.https.html which I would just star as the normal too-frequent intermittent except that I knew from seeing it all day that you were making that fail much more frequently and more severely (two or three failures per run rather than the usual one), four were other serviceworker failures, and one maybe wasn't even you.

Your most troublesome suites are Linux mochitest-e10s-4, and Linux and Mac web-platform-tests-e10s, -4 on opt and -8 on debug, and I think you're going to have to not only retrigger until you see green, but also retrigger until you can say that you aren't either adding new intermittents or making existing intermittents more frequent.
I believe the issue is that wpt test ServiceWorkerGlobalScope/update.https.html has this code at the top level of its service worker script:

  // update() during the script evaluation should be ignored.

We don't actually ignore update() calls at the top level of the script, so this starts an infinite async loop.  This loop then throws off the timing of all the other tests that follow it.  I think this only really triggers in e10s because of how the PropagateSoftUpdate() mechanism works there.

As far as I can tell this is not quite spec'd.  I'm testing a patch to implement ignoring these updates and I will write a spec issue.

The event source failures in m-e10s-4 still need investigation.
The ServiceWorkerGlobalScope/update.https.html has multiple issues.  First, it triggers an infinite async loop with that sync update() call.  Even if I fix that, however, its still races.  Sometimes our navigation triggered update fires before the service worker script's update() call can run.  This confuses the test.

We need to disable this test until we delay the navigation update to post load.
Attachment #8697009 - Flags: review?(ehsan)
Attachment #8697009 - Flags: review?(ehsan) → review+
I think this may just need to land in the same push with bug 1226443 in order to fix the update-on-navigate races.
Backed out in - it's not going to be that easy.

Definite, no question about it increase in the frequency of failures in fetch-frame-resource.https.html, plus the unknown and the unknown, and that's just in a very few runs, not in the 20-30 that I would have triggered on try as just a starter if I were you.
You need to log in before you can comment on or make changes to this bug.