prevent service worker updates from running in two processes at the same time

RESOLVED FIXED in Firefox 54

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: bkelly, Assigned: baku)

Tracking

unspecified
mozilla55
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox54 fixed, firefox55 fixed)

Details

Attachments

(3 attachments, 4 obsolete attachments)

Until we land our e10s refactor we have the risk of running service worker updates in 2 processes at once when multi-e10s is enabled.  We should mitigate this in some way.  Some ideas:

1) When a child process would create a ServiceWorkerUpdateJob today it instead sends a message to the parent.  The parent then picks a single child process and sends it the update message.  Then just that child does the update.
2) Child processes need to get a "lock" from the parent process before scheduling a ServiceWorkerUpdateJob.  If the lock is held by another process, then just ignore that update.  (Since another process is doing it.)
Option (1) is probably the least racy and we would get coverage of the path through our normal test.  Option (2) would worry me that we're just not testing the contention condition.
Assignee

Comment 2

2 years ago
Posted patch update.patch (obsolete) — Splinter Review
Flags: needinfo?(bkelly)
Andrea, I'll try to review by the end of the week.  Can you do a try run in the meantime, though?
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
Flags: needinfo?(bkelly)
Assignee

Comment 4

2 years ago
Posted patch update.patch (obsolete) — Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=648fc7a771b47a567b6c77259b42539f0ce7658b

It was green except for 1 wpt test. This patch fixes the case where a process dispatches multiple update() operations.
Attachment #8847179 - Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Comment on attachment 8848130 [details] [diff] [review]
update.patch

Review of attachment 8848130 [details] [diff] [review]:
-----------------------------------------------------------------

So, I don't think this actually blocks overlapping updates.  The Updater actor is deleted after starting the update even though the update process is actually asynchronous.  This means another Updater actor can get created shortly after and overlap with the first one.

Overall, this goes to my concern we discuss on IRC.  With a "check for contention and block" approach we have a testing problem.  Its really hard to verify we are doing the right thing when updates conflict.  Its a huge race condition to make it happen.

I still think a better approach would be to make an update operation actor which forwards to the parent and then on to a deterministic process.  We can do this for every update and verify it works.  We know that overlapping updates won't happen due to the design forcing all updates to run in the same process.

::: dom/workers/PServiceWorkerUpdater.ipdl
@@ +11,5 @@
> +{
> +  manager PServiceWorkerManager;
> +
> +parent:
> +  async __delete__();

I don't think this is safe.  You can't delete an actor child-to-parent if its possible a parent-to-child message might be sent at the same time.

::: dom/workers/ServiceWorkerUpdaterChild.cpp
@@ +26,5 @@
> +  } else if (mFailureRunnable) {
> +    mFailureRunnable->Run();
> +  }
> +
> +  Unused << Send__delete__(this);

AFAICT deletion of the Updater actor signals that this process is done updating.  I don't think we can do that here, though.  Yes, we called the runnable's Run() method, but updating is an asynchronous operation.  I don't think this actor should be deleted until the update job is complete.
Attachment #8848130 - Flags: review-
Flags: needinfo?(bkelly)
Assignee

Comment 6

2 years ago
Posted patch part 2 - test (obsolete) — Splinter Review
Assignee

Comment 8

2 years ago
I applied the comments, except for the deletion of the actor: that happens in the 3rd patch.
Attachment #8848130 - Attachment is obsolete: true
Flags: needinfo?(bkelly)
Assignee

Comment 9

2 years ago
Posted patch part 2 - testSplinter Review
Attachment #8849870 - Attachment is obsolete: true
Comment on attachment 8850060 [details] [diff] [review]
part 2 - test

Review of attachment 8850060 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for writing the test!

::: dom/workers/test/serviceworkers/browser_multie10s_update.js
@@ +69,5 @@
> +    });
> +  });
> +
> +  if (status == 0) {
> +    ok(false, "both succeeded. This is wrong.");

Have you verified this is what happens without your other patches?

@@ +73,5 @@
> +    ok(false, "both succeeded. This is wrong.");
> +  } else if (status == 1) {
> +    ok(true, "one succeded, one failed. This is good.");
> +  } else {
> +    ok(true, "both failed. This is definitely wrong.");

Seems this should be ok(false).
Attachment #8850060 - Flags: review+
Assignee

Comment 11

2 years ago
> Have you verified this is what happens without your other patches?

Yes. It always fails.
Comment on attachment 8849871 [details] [diff] [review]
part 3 - delete the actor when finished

Review of attachment 8849871 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/ServiceWorkerManager.cpp
@@ +538,5 @@
>      if (mInternalMethod) {
> +      RefPtr<ServiceWorkerUpdaterChild::Callback> callback =
> +        new ServiceWorkerUpdaterChild::Callback(nullptr, mActor);
> +
> +      swm->SoftUpdateInternal(mAttrs, mScope, callback);

ServiceWorkerUpdaterChild::Callback() claims its safe to hold a raw pointer to the actor because the actor holds a ref to the Callback object.  Thats not the case here.

TBH, all these raw pointers in various runnables is making me nervous.  Could we instead do something like:

1) Create a MozPromise that resolves when the update is complete.
2) Pass the MozPromise to the ServiceWorkerUpdaterChild constructor or Init
3) ServiceWorkerUpdateChild() then does MozPromise->Then() to know when to call __delete__ on itself.
4) ServiceWorkerUpdaterChild can also safely handle an unexpected early ActorDestroy() by using a MozPromiseHolder and calling DetachIfPresent().
5) The SWM and update code then passes the ref-counted MozPromise around.  They resolve the promise when the update is complete.

Here is some sample code from my clients API patch:

class ClientHandleOpParent final : public PClientHandleOpParent
{
  MozPromiseRequestHolder<ClientOpPromise> mPromiseRequestHolder;

  ClientSourceParent*
  GetSource() const;

  AbstractThread*
  BackgroundThread() const;

  // PClientHandleOpParent interface
  void
  ActorDestroy(ActorDestroyReason aReason) override;

public:
  ClientHandleOpParent() = default;
  ~ClientHandleOpParent() = default;

  void
  Init(const ClientOpConstructorArgs& aArgs);
};

void
ClientHandleOpParent::ActorDestroy(ActorDestroyReason aReason)
{
  mPromiseRequestHolder.DisconnectIfExists();
}

void
ClientHandleOpParent::Init(const ClientOpConstructorArgs& aArgs)
{
  ClientSourceParent* source = GetSource();
  if (!source) {
    Unused << PClientHandleOpParent::Send__delete__(this, NS_ERROR_ABORT);
    return;
  }

  RefPtr<ClientOpPromise> p =
    source->StartOp(aArgs);

  // Capturing `this` is safe here because we disconnect the promise in
  // ActorDestroy() which ensures neither lambda is called if the actor
  // is destroyed before the source operation completes.
  p->Then(BackgroundThread(), __func__,
      [this] (const ClientOpResult& aResult) {
        mPromiseRequestHolder.Complete();
        Unused << PClientHandleOpParent::Send__delete__(this, aResult);
      }, [this] (const ClientOpResult& aResult) {
        mPromiseRequestHolder.Complete();
        Unused << PClientHandleOpParent::Send__delete__(this, aResult);
    })->Track(mPromiseRequestHolder);
}
Attachment #8849871 - Flags: review-
Flags: needinfo?(bkelly)
Oh, you can use the GenericPromise typedef for the MozPromise.  In my example I have a different return type so I defined my own ClientOpPromise.
Oh, and my example uses an internal method to get its promise to track.  You can easily pass it in instead.
Assignee

Comment 15

2 years ago
Attachment #8849871 - Attachment is obsolete: true
Flags: needinfo?(bkelly)
Comment on attachment 8850557 [details] [diff] [review]
part 3 - delete the actor when finished

Review of attachment 8850557 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for doing this.  I know the MozPromise stuff can look a bit weird at first, but it does end up pretty clean in the end. r=me with comments addressed.

::: dom/workers/ServiceWorkerManager.cpp
@@ +2909,5 @@
> +class UpdateJobCallback final : public ServiceWorkerJob::Callback
> +{
> +  RefPtr<ServiceWorkerUpdateFinishCallback> mCallback;
> +
> +  ~UpdateJobCallback()

= default

@@ +2932,5 @@
> +      mCallback->UpdateFailed(aStatus);
> +      return;
> +    }
> +
> +    MOZ_ASSERT(aJob->GetType() == ServiceWorkerJob::Type::Update);

MOZ_DIAGNOSTIC_ASSERT for cheap checks like this.

::: dom/workers/ServiceWorkerUpdaterChild.cpp
@@ +22,5 @@
> +
> +  aPromise->Then(AbstractThread::GetCurrent(), __func__,
> +                 [this]() {
> +                   Unused << Send__delete__(this);
> +                 });

So, you need to handle the case where the actor is destroyed before the promise completes.  Otherwise the `this` pointer can become garbage.

The way to do this with MozPromise is:

1) Replace your GenericPromise mPromise with a MozPromiseeRequestHolder<GenericPromise> mPromiseHolder member.

2) Change this code to "Track()" the promise like this:

aPromise->Then(AbstractThread::GetCurrent(), __func__,
  [this]() {
    mPromiseHolder.Complete();
    Unused << Send__delete__(this);
}).Track(mPromiseHolder);

Note I added the Track() at the end and mPromiseHolder.Complete() in your resolution callback.

3) In your destructor or ActorDestroy() call mPromiseHolder.DisconnectIfExists().

The DisconnectIfExists() is the one that makes sure your promise resolution callback is not invoked any more.  So this makes it safe to capture `this`.

@@ +35,5 @@
>      mSuccessRunnable->Run();
>    } else if (mFailureRunnable) {
>      mFailureRunnable->Run();
> +  } else {
> +    Unused << Send__delete__(this);

Could we assert that mFailureRunnable is not nullptr?  I think you could easily make the failure runnable resolve the promise with something like:

  nsCOMPtr<nsIRunnable> failureRunnable =
    NewRunnableMethod(promise, &GenericPromise::Resolve, true, __func__);

This would be nice because then there would only be on Send__delete__() to have to reason about.

::: dom/workers/ServiceWorkerUpdaterChild.h
@@ +25,5 @@
>    mozilla::ipc::IPCResult
>    RecvProceed(const bool& aAllowed) override;
>  
>  private:
> +  RefPtr<GenericPromise> mPromise;

As I mentioned in other comment, use MozPromiseRequestHolder<GenericPromise> here instead.
Attachment #8850557 - Flags: review+
Flags: needinfo?(bkelly)
Attachment #8849872 - Flags: review+

Comment 17

2 years ago
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7474a804fd1f
Bug 1346247 - Avoid race conditions when SW are updated - part 1 - PServiceWorkerUpdater actor, r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/804f99e05804
Bug 1346247 - Avoid race conditions when SW are updated - part 2 - tests, r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/77fc6f341fef
Bug 1346247 - Avoid race conditions when SW are updated - part 3 - MozPromise, r=bkelly
Backed out for asserting at MozPromise.h:911:

https://hg.mozilla.org/integration/mozilla-inbound/rev/ee590a4887b7c96568a7bd414551ccf2b02bf515
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1167c61611315834bbf07d13423069a155d1ed0
https://hg.mozilla.org/integration/mozilla-inbound/rev/249037d28c7896fadafe4b58e2458b11a4866cf4

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=77fc6f341fef9ff125624a9544c683a3d3b3047d&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=86203074&repo=mozilla-inbound

[task 2017-03-24T13:34:22.979559Z] 13:34:22     INFO - GECKO(2831) | ###!!! [Child][MessageChannel] Error: (msgtype=0x900025,name=PNecko::Msg_RemoveRequestContext) Closed channel: cannot send/recv
[task 2017-03-24T13:34:22.982708Z] 13:34:22     INFO - GECKO(2831) | [Child 2885] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
[task 2017-03-24T13:34:22.984189Z] 13:34:22     INFO - GECKO(2831) | ###!!! [Child][MessageChannel] Error: (msgtype=0x900025,name=PNecko::Msg_RemoveRequestContext) Closed channel: cannot send/recv
[task 2017-03-24T13:34:22.991468Z] 13:34:22     INFO - GECKO(2831) | [Child 2885] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
[task 2017-03-24T13:34:23.112825Z] 13:34:23     INFO - GECKO(2831) | Assertion failure: !IsPending(), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/MozPromise.h:911
[task 2017-03-24T13:34:23.114762Z] 13:34:23     INFO - GECKO(2831) | #01: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x1731df4]
[task 2017-03-24T13:34:23.117314Z] 13:34:23     INFO - GECKO(2831) | #02: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xb4b8ae]
[task 2017-03-24T13:34:23.118656Z] 13:34:23     INFO - GECKO(2831) | #03: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x234a1c8]
[task 2017-03-24T13:34:23.123008Z] 13:34:23     INFO - GECKO(2831) | #04: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x234a1ec]
[task 2017-03-24T13:34:23.123406Z] 13:34:23     INFO - GECKO(2831) | #05: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xb5a609]
[task 2017-03-24T13:34:23.125767Z] 13:34:23     INFO - GECKO(2831) | #06: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x2376c64]
[task 2017-03-24T13:34:23.127551Z] 13:34:23     INFO - GECKO(2831) | #07: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x2376caa]
[task 2017-03-24T13:34:23.130048Z] 13:34:23     INFO - GECKO(2831) | #08: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x23424ef]
[task 2017-03-24T13:34:23.133093Z] 13:34:23     INFO - GECKO(2831) | #09: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x100fae4]
[task 2017-03-24T13:34:23.138087Z] 13:34:23     INFO - GECKO(2831) | #10: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x101a1f1]
[task 2017-03-24T13:34:23.142718Z] 13:34:23     INFO - GECKO(2831) | #11: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xef4e99]
[task 2017-03-24T13:34:23.145373Z] 13:34:23     INFO - GECKO(2831) | #12: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xefd2a9]
[task 2017-03-24T13:34:23.152815Z] 13:34:23     INFO - GECKO(2831) | #13: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xeff4e4]
[task 2017-03-24T13:34:23.155678Z] 13:34:23     INFO - GECKO(2831) | #14: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xeff657]
[task 2017-03-24T13:34:23.159746Z] 13:34:23     INFO - GECKO(2831) | #15: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xb5c470]
[task 2017-03-24T13:34:23.164060Z] 13:34:23     INFO - GECKO(2831) | #16: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xb5e4b9]
[task 2017-03-24T13:34:23.166161Z] 13:34:23     INFO - GECKO(2831) | #17: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xb75ef9]
[task 2017-03-24T13:34:23.168021Z] 13:34:23     INFO - GECKO(2831) | #18: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x32ea87a]
[task 2017-03-24T13:34:23.170000Z] 13:34:23     INFO - GECKO(2831) | #19: ???[/home/worker/workspace/build/application/firefox/libxul.so +0xef3015]
[task 2017-03-24T13:34:23.172025Z] 13:34:23     INFO - GECKO(2831) | #20: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x32eb243]
[task 2017-03-24T13:34:23.178923Z] 13:34:23     INFO - GECKO(2831) | #21: ???[/home/worker/workspace/build/application/firefox/firefox +0x624d]
[task 2017-03-24T13:34:23.181059Z] 13:34:23     INFO - GECKO(2831) | #22: ???[/home/worker/workspace/build/application/firefox/firefox +0x5c8c]
[task 2017-03-24T13:34:23.185031Z] 13:34:23     INFO - GECKO(2831) | #23: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x217ed]
[task 2017-03-24T13:34:23.186953Z] 13:34:23     INFO - GECKO(2831) | #24: ???[/home/worker/workspace/build/application/firefox/firefox +0x5e81]
Flags: needinfo?(amarchesini)
Assertion failure: !IsPending(), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/MozPromise.h:911

I'm pretty sure this means the promise was destroyed without resolving or rejecting it.  You probably want to make ~PromiseResolverCallback() reject the promise with an abort error if mPromise is not nullptr.

Comment 20

2 years ago
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a59303fbb6e
Avoid race conditions when SW are updated - part 1 - PServiceWorkerUpdater actor, r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/01cbf41397bf
Avoid race conditions when SW are updated - part 2 - tests, r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/441ae6e33457
Avoid race conditions when SW are updated - part 3 - MozPromise, r=bkelly
Assignee

Updated

2 years ago
Flags: needinfo?(amarchesini)

Comment 21

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5a59303fbb6e
https://hg.mozilla.org/mozilla-central/rev/01cbf41397bf
https://hg.mozilla.org/mozilla-central/rev/441ae6e33457
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
:Baku Are you feeling confident in this change at this point? If so, can you nom this for Aurora uplift? We are hoping to have this done by Friday, March 31st. Thank you.
Flags: needinfo?(amarchesini)
Assignee

Comment 23

2 years ago
Comment on attachment 8849872 [details] [diff] [review]
part 1 - the actor

Approval Request Comment
[Feature/Bug causing the regression]: multi-e10s requires this set of patches
[User impact if declined]: multi-e10s + sw could have racy code.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: not yet
[Needs manual test from QE? If yes, steps to reproduce]: none 
[List of other uplifts needed for the feature/fix]: just these 2 patches
[Is the change risky?]: low
[Why is the change risky/not risky?]: here we introduce a new protocol to manage a lock in order to avoid race condition in update() calls.
[String changes made/needed]: none
Flags: needinfo?(amarchesini)
Attachment #8849872 - Flags: approval-mozilla-aurora?
Assignee

Updated

2 years ago
Attachment #8850060 - Flags: approval-mozilla-aurora?
Assignee

Updated

2 years ago
Attachment #8850557 - Flags: approval-mozilla-aurora?
Comment on attachment 8849872 [details] [diff] [review]
part 1 - the actor

fix a race with serviceworkers and e10s-multi, aurora54+
Attachment #8849872 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8850060 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8850557 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.