Bug 1568597 Comment 16 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(In reply to Razvan Maries from comment #14)
> Backed out for perma failrues.
> 
> Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=linux%2C18.04%2Cx64%2Cwebrender%2Copt%2Cweb%2Cplatform%2Ctests%2Cwith%2Cfission%2Cenabled%2Ctest-linux1804-64-qr%2Fopt-web-platform-tests-fis-e10s%2Cwpt9&revision=07523f6e83410bf3ae73fc69184ca9ebd28139d7&selectedTaskRun=ULTvOqNYQNGsL8iySz1_iA.0
> 
> Log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313982937&repo=autoland&lineNumber=67643
> 
> Backout: https://hg.mozilla.org/integration/autoland/rev/2ee42ae444dcfa828cac0420c03a5bf2627c86da

Well... the issue behind this wpt test failure (triggered while running with fission enabled) was definitely an interesting one.

I did spent some time investigating that failure yesterday, I was already convinced that the underlying issue was very likely some kind of race
(as additional hint about that was the fact that locally I wasn't able to trigger it in a debug build, but I was able to trigger it consistently enough in an optimized build) but I wanted to see clearly why making the remote workers fully fission aware did start to trigger it, what difference did it make in practice.

The wpt test that did fail is testing the creation of 4 shared workers, the first 2 are expected to be allowed (and their worker principal is going to be secure `https://...` one) and the last 2 are expected to be disallowed (and their worker principal is going to be an insecure `http://...` one).

When fission is enabled (and the patches attached to this issue applied):

- for the first 2, we select the child process with remoteType `"webIsolated=https://..."`, which is the same where the tab from which we do register the SharedWorker and so it is going to be kept alive also by the tab and it is not going to shutdown until we also close the tab

- for the last 2, we select the child process with remoteType `"webIsolated=http://..."`, which does not exist yet and so:
  - the first of the last 2 shared worker is going to let us to pick one of the child process with remoteType `"prealloc"` and change it to remoteType `"webIsolated=http://..."` and spawn the first remote worker in this child process
  - for the second of the last 2 shared worker, if the shared worker is being registered fast enough, we will find that there is already a child process for the remoteType `"webIsolated=http://..."` and so `RemoteWorkerManager::SelectTargetActorForSharedWorker` will select that one, but that child process isn't kept alive also by a tab as the secure "https://..." one and only the remote worker spawned by the previous test case is keeping it alive and so there is a chance that we may be shutdown that process between the time when we selected it and the time when we are able to spawn the new remote worker into it

Turned out that we already had a similar issue even before fission for the service workers (which could already been spawned into a new child process without any tab keeping that child process alive) and it is described in a very nice detailed inline comment here:
- https://searchfox.org/mozilla-central/rev/27932d4e6ebd2f4b8519865dad864c72176e4e3b/dom/workers/remoteworkers/RemoteWorkerManager.cpp#411-423

The [new patch attached in comment 15](https://bugzilla.mozilla.org/attachment.cgi?id=9172387) is basically using the same strategy that we are already using in `RemoteWorkerManager::SelectTargetActorForServiceWorker` as part of the shared worker target actor selection implemented in `RemoteWorkerManager::SelectTargetActorForSharedWorker`.

With this fix applied the wpt test did finally pass consistently on a local opt build (also while running with --verify), I did push it on try to double-check that it is also going to pass on the build infrastructure:

- https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f89f2e3eec2408aeb22588bad4b73af556abc42
(In reply to Razvan Maries from comment #14)
> Backed out for perma failrues.
> 
> Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=linux%2C18.04%2Cx64%2Cwebrender%2Copt%2Cweb%2Cplatform%2Ctests%2Cwith%2Cfission%2Cenabled%2Ctest-linux1804-64-qr%2Fopt-web-platform-tests-fis-e10s%2Cwpt9&revision=07523f6e83410bf3ae73fc69184ca9ebd28139d7&selectedTaskRun=ULTvOqNYQNGsL8iySz1_iA.0
> 
> Log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313982937&repo=autoland&lineNumber=67643
> 
> Backout: https://hg.mozilla.org/integration/autoland/rev/2ee42ae444dcfa828cac0420c03a5bf2627c86da

Well... the issue behind this wpt test failure (triggered while running with fission enabled) was definitely an interesting one.

I did spent some time investigating that failure yesterday, I was already convinced that the underlying issue was very likely some kind of race
(as additional hint about that was the fact that locally I wasn't able to trigger it in a debug build, but I was able to trigger it consistently enough in an optimized build) but I wanted to see clearly why making the remote workers fully fission aware did start to trigger it, what difference did it make in practice.

The wpt test that did fail is testing the creation of 4 shared workers, the first 2 are expected to be allowed (and their worker principal is going to be secure `https://...` one) and the last 2 are expected to be disallowed (and their worker principal is going to be an insecure `http://...` one).

When fission is enabled (and the patches attached to this issue applied):

- for the first 2, we select the child process with remoteType `"webIsolated=https://..."`, which is the same where the tab from which we do register the SharedWorker and so it is going to be kept alive also by the tab and it is not going to shutdown until we also close the tab

- for the last 2, we select the child process with remoteType `"webIsolated=http://..."`, which does not exist yet and so:
  - the first of the last 2 shared worker is going to let us to pick one of the child process with remoteType `"prealloc"`, change it to remoteType `"webIsolated=http://..."` and spawn the first remote worker in this child process
  - for the second of the last 2 shared worker, if the shared worker is being registered fast enough, we will find that we do have already available a RemoteWorkerService actor from a child process for the remoteType `"webIsolated=http://..."` and so `RemoteWorkerManager::SelectTargetActorForSharedWorker` will select that one, but that child process isn't kept alive also by a tab as the secure "https://..." one and only the remote worker spawned by the previous test case is keeping it alive and so there is a chance that we may be shutdown that process between the time when we selected it and the time when we are able to spawn the new remote worker into it

Turned out that we already had a similar issue even before fission for the service workers (which could already been spawned into a new child process without any tab keeping that child process alive) and it is described in a very nice detailed inline comment here:
- https://searchfox.org/mozilla-central/rev/27932d4e6ebd2f4b8519865dad864c72176e4e3b/dom/workers/remoteworkers/RemoteWorkerManager.cpp#411-423

The [new patch attached in comment 15](https://bugzilla.mozilla.org/attachment.cgi?id=9172387) is basically using the same strategy that we are already using in `RemoteWorkerManager::SelectTargetActorForServiceWorker` as part of the shared worker target actor selection implemented in `RemoteWorkerManager::SelectTargetActorForSharedWorker`.

With this fix applied the wpt test did finally pass consistently on a local opt build (also while running with --verify), I did push it on try to double-check that it is also going to pass on the build infrastructure:

- https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f89f2e3eec2408aeb22588bad4b73af556abc42

Back to Bug 1568597 Comment 16