Closed Bug 1013571 Opened 6 years ago Closed 6 years ago

support PBackground on workers

Categories

(Core :: DOM: Workers, defect)

30 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

Attachments

(1 file, 25 obsolete files)

23.60 KB, patch
bent.mozilla
: review+
Details | Diff | Splinter Review
In bug 940273 we'd like to use PBackground to implement the ServiceWorker cache interface between the child and parent processes.  Currently PBackground is not available on worker threads.

From talking with Ben Turner about this, it appears the main issue is that PBackground uses a thread-local to store some state.  Since workers reuse and share existing threads, this thread-local must be cleaned up to avoid having IPC messages routed to the wrong worker.
It seems like RuntimeService::WorkerThread::SetWorker() may be a good place to teardown the previous PBackground child and setup the new PBackground child:

http://dxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#2377
Work-in-progress to allow the background child to be detached from the current thread.

Ben, is there anything I need to do other than null out the thread local and let the registered destructor run?
Attachment #8426549 - Flags: feedback?(bent.mozilla)
Work-in-progress to enable PBackground in workers.  I'm running into a fun link error on windows at the moment, though, so I'm not sure if this actually works yet or not.

It links on linux, but I don't have a convenient way to do adhoc testing on my headless linux VM.  Here is a try build to see if it completely blows up:

https://tbpl.mozilla.org/?tree=Try&rev=ccf1ecc1a7f1
Comment on attachment 8426549 [details] [diff] [review]
Part 1: Support detaching PBackground from current thread. (v0)

Review of attachment 8426549 [details] [diff] [review]:
-----------------------------------------------------------------

I believe that this is the basic approach that should work, yes.

However, it's unclear to me what happens when Close() is called on a channel when there are pending messages in the queue... This will need some testing because I expect that there will be problems.
Attachment #8426549 - Flags: feedback?(bent.mozilla) → feedback+
Comment on attachment 8426554 [details] [diff] [review]
Part 2: Enable PBackground in workers. (v0)

Review of attachment 8426554 [details] [diff] [review]:
-----------------------------------------------------------------

For workers I think that we shouldn't do a 'primer' pattern (like we do with the main thread of child processes) but instead explicitly wait for the actor to be connected before we run the worker's script. It should only be a slight startup delay but then the code that uses such actors can be much simpler by assuming that it always has already been connected.
Fix a worker shutdown crash by making ChildImpl use thread safe reference counting.

I still need to verify if this works when closing a channel with pending messages.
Attachment #8426549 - Attachment is obsolete: true
Delay worker startup until after PBackground actor has been created.  This was necessary not just for the reasons discussed in comment 5, but also to avoid assertions about running non-worker runnables on the worker thread after start.

Note, I currently have an extra bounce to the main thread here to make initialization work with the current assertions.  See the big XXX comment in the patch for details.  I hope to tweak the assertions around with an extra state variable to avoid this.
Attachment #8426554 - Attachment is obsolete: true
> Note, I currently have an extra bounce to the main thread here to make
> initialization work with the current assertions.  See the big XXX comment in
> the patch for details.  I hope to tweak the assertions around with an extra
> state variable to avoid this.

Fix this by treating |mWorkerPrivate == null && PR_CurrentThread() == mThread| as legal in WorkerThread::Dispatch().
Attachment #8428269 - Attachment is obsolete: true
Hrm, you shouldn't need to do any thread bouncing here I don't think...
So these are the current steps as I think I have them arranged:

1) RuntimeService::ScheduleWorker() schedules a WorkerThreadInitPBackgroundRunnable() to run on the selected WorkerThread.
2) WorkerThreadInitPBackgroundRunnable() calls BackgroundChild::GetOrCreateForCurrentThread() from the WorkerThread.  A WorkerBackgroundChildCallback() is registered with the WorkerThread as the target.
3) ChildImpl::GetOrCreateForCurrentThread() calls OpenProtocolOnMainThread() from the WorkerThread.
4) The actor protocol gets opened on the main thread.
5) WorkerBackgroundChildCallback is called on the WorkerThread.
6) WorkerBackgroundChildCallback dispatches a WorkerThreadPrimaryRunnable to the WorkerThread.
7) WorkerThreadPrimaryRunnable executes WorkerPrivate::DoRunLoop() for the WorkerThread.

The issue described in comment 7 was that I initially could not perform step (6) with hitting an assert.  This was due to WorkerThread::Dispatch() expecting mWorkerPrivate to be set before any runnable is dispatched from its own thread.  I worked around this with a main thread bounce.

In attachment 8429382 [details] [diff] [review] I loosened this restriction a bit so that the main thread bounce is not needed.

Does this all sound correct/ok?
Allow IPC DoWorkRunnable objects to be canceled.  This is a requirement for runnables executed on active worker threads.

Right now the Cancel() method is implemented in a somewhat lame manner.  I prevent the Run() method from executing if cancel was previously called, but I don't terminate nested execution within the MessageLoop.
When ENABLE_TESTS is compiled in and pbackground.testing is true, then send PBackgroundTest messages at the beginning and end of worker threads.

Right now this uses a fixed test string, but I plan to make it random or unique per-thread in order to help verify that message routing works correctly.

Note that the test messages at the end of the worker thread do trigger these warnings on the console:

###!!! [Parent][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost

On b2g these are generally deemed harmless.  Is that the case here as well?

I can drain the MessageChannel on worker close if necessary, but it would be nice to avoid the added complexity.
Fix the build. Seems perhaps ./mach build binaries was not enough to catch that for some reason.

https://tbpl.mozilla.org/?tree=Try&rev=fd772b7a06f8
Attachment #8429578 - Attachment is obsolete: true
When detaching the PBackground thread local, explicitly flush the pending queue before closing the actor.  I'm not sure this really buys us much, though, as there will still be a race between the flush and the Close().
Attachment #8428267 - Attachment is obsolete: true
Attachment #8429382 - Attachment is obsolete: true
Randomize PBackgroundTest constructor values to help catch any cross-posting of messages.
Attachment #8429582 - Attachment is obsolete: true
Ben, do you have any thoughts on how to further test PBackground on workers?

Currently I am re-using PBackgroundTest and simply sending a test message at worker startup and shutdown.

Anything else you think I am missing before flagging for review?
Flags: needinfo?(bent.mozilla)
Hmm, this try build suggests there might be a problem in Win7 debug M2 with the patches:

 https://tbpl.mozilla.org/?tree=Try&rev=67265addfa29

An assertion failure in:

 /tests/dom/datastore/tests/test_sync_worker.html
That try also shows OSX 10.6 M5 crashing in DeadlockDetector::CheckAquisition() which was called under MessageChannel::OnMaybeDequeueOne().
I'm having a hard time seeing how this error could be caused by PBackground changes since its in PContentParent's destructor:

  https://tbpl.mozilla.org/php/getParsedLog.php?id=40567673&tree=Try&full=1#error2

Should the IToplevelProtocol LinkedList be protected by a mutex?
(In reply to Ben Kelly [:bkelly] from comment #21)
> Should the IToplevelProtocol LinkedList be protected by a mutex?

This sounds a lot like bug 976479 ...
(In reply to Ben Kelly [:bkelly] from comment #20)
> That try also shows OSX 10.6 M5 crashing in
> DeadlockDetector::CheckAquisition() which was called under
> MessageChannel::OnMaybeDequeueOne().

This looks like bug 966972, so probably unrelated to these changes.
See Also: → 1017639
(In reply to Ben Kelly [:bkelly] from comment #21)
> I'm having a hard time seeing how this error could be caused by PBackground
> changes since its in PContentParent's destructor:
> 
>  
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=40567673&tree=Try&full=1#error2
> 
> Should the IToplevelProtocol LinkedList be protected by a mutex?

As far as I can tell this is a case of a PContentParent being cycle collected before all of its dependent, open actors have been released.  This seems unrelated to the changes here so I opened bug 1017639 to track that issue.  Its very low frequency (hasn't reproduced after 20 pushes) but it would be worth fixing if we can.
Since try looks relatively green I'm going to go ahead and ask for review here.
Flags: needinfo?(bent.mozilla)
Attachment #8430088 - Flags: review?(bent.mozilla)
Attachment #8430103 - Flags: review?(bent.mozilla)
Attachment #8429674 - Flags: review?(bent.mozilla)
Attachment #8430104 - Flags: review?(bent.mozilla)
Comment on attachment 8430088 [details] [diff] [review]
Part 1: Support detaching PBackground from current thread. (v2)

Review of attachment 8430088 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/BackgroundImpl.cpp
@@ +377,5 @@
>    {
>      AssertIsOnMainThread();
>    }
>  
> +  NS_INLINE_DECL_THREADSAFE_REFCOUNTING(ChildImpl)

This shouldn't be necessary... How is this ever being touched on a different thread?

@@ +1662,5 @@
> +  auto threadLocalInfo =
> +    static_cast<ThreadLocalInfo*>(PR_GetThreadPrivate(sThreadLocalIndex));
> +  if (threadLocalInfo) {
> +    if (threadLocalInfo->mActor) {
> +      threadLocalInfo->mActor->FlushPendingInterruptQueue();

Splitting patches like this is hard to follow...
Attachment #8430088 - Flags: review?(bent.mozilla) → review-
Comment on attachment 8430088 [details] [diff] [review]
Part 1: Support detaching PBackground from current thread. (v2)

Review of attachment 8430088 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/BackgroundChild.h
@@ +57,5 @@
>    GetOrCreateForCurrentThread(nsIIPCBackgroundChildCreateCallback* aCallback);
>  
> +  // See above.
> +  static void
> +  DetachFromCurrentThread();

Also let's call this something like "CloseForCurrentThread"... Detaching implies something else.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #26)
> ::: ipc/glue/BackgroundImpl.cpp
> @@ +377,5 @@
> >    {
> >      AssertIsOnMainThread();
> >    }
> >  
> > +  NS_INLINE_DECL_THREADSAFE_REFCOUNTING(ChildImpl)
> 
> This shouldn't be necessary... How is this ever being touched on a different
> thread?

I got consistent crashes when clearing the thread local without this.  I believe it was dying on this line:

  http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.h#429
Comment on attachment 8430103 [details] [diff] [review]
Part 2: Enable PBackground in workers. (v3)

Review of attachment 8430103 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/RuntimeService.cpp
@@ +907,5 @@
>  private:
>    WorkerPrivate* mWorkerPrivate;
>  };
>  
> +class WorkerThreadInitPBackgroundRunnable MOZ_FINAL : public nsRunnable

Rather than splitting this process into two runnables I think it would be much simpler to have WorkerThreadPrimaryRunnable::Run() just use NS_ProcessNextEvent and spin a nested loop waiting for the actor to be created.

Something like:

  WorkerThreadPrimaryRunnable::Run()
  {
    // ...

    bool done;
    bool success;
    nsCOMPtr<nsIIPCBackgroundChildCreateCallback> callback =
      new WorkerBackgroundChildCallback(&done, &success);

    if (!BackgroundChild::GetOrCreateForCurrentThread(callback)) {
      // Handle failure.
    }

    while (!done) {
      if (NS_WARN_IF(!NS_ProcessNextEvent(mThread))) {
        // Handle failure.
      }
    }

    if (!success) {
      // Handle failure.
    }

    // ...
  }

Does that make sense?

@@ +2683,5 @@
>  
>    // It is no longer safe to touch mWorkerPrivate.
>    mWorkerPrivate = nullptr;
>  
> +  BackgroundChild::DetachFromCurrentThread();

Hrm, I think it might be safer to call this earlier... Too many messages may go to components that expect the runtime to exist. Maybe move this to right before we unregister the runtime from the sampler?
Attachment #8430103 - Flags: review?(bent.mozilla) → review-
(In reply to Ben Kelly (PTO, back Mon June 9) [:bkelly] from comment #28)
> I got consistent crashes when clearing the thread local without this.

Hrm, I guess we'll need stacks.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #30)
> (In reply to Ben Kelly (PTO, back Mon June 9) [:bkelly] from comment #28)
> > I got consistent crashes when clearing the thread local without this.
> 
> Hrm, I guess we'll need stacks.

I got this without the threadsafe refcounting on my mac closing a tab:

Hit MOZ_CRASH(ChildImpl not thread-safe) at /Users/bkelly/devel/hg/mozilla-central/ipc/glue/BackgroundImpl.cpp:381
PR_SetThreadPrivate+0x000000A6 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x002159A6]
Hit MOZ_CRASH(ChildImpl not thread-safe) at /Users/bkelly/devel/hg/mozilla-central/ipc/glue/BackgroundImpl.cpp:381
Hit MOZ_CRASH(ChildImpl not thread-safe) at /Users/bkelly/devel/hg/mozilla-central/ipc/glue/BackgroundImpl.cpp:381
mozilla::ipc::BackgroundChild::DetachFromCurrentThread()+0x00000035 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x003FE495]
PR_SetThreadPrivate+0x000000A6 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x002159A6]
PR_SetThreadPrivate+0x000000A6 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x002159A6]
(anonymous namespace)::WorkerThreadPrimaryRunnable::Run()+0x00000790 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x016930D0]
mozilla::ipc::BackgroundChild::DetachFromCurrentThread()+0x00000035 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x003FE495]
mozilla::ipc::BackgroundChild::DetachFromCurrentThread()+0x00000035 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x003FE495]
nsThread::ProcessNextEvent(bool, bool*)+0x0000050B [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x000B819B]
(anonymous namespace)::WorkerThreadPrimaryRunnable::Run()+0x00000790 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x016930D0]
(anonymous namespace)::WorkerThreadPrimaryRunnable::Run()+0x00000790 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x016930D0]
NS_ProcessNextEvent(nsIThread*, bool)+0x00000033 [/Users/bkelly/devel/hg/mozilla-central/obj-x86_64-apple-darwin13.2.0-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00020293]
I think I understand why we encounter MOZ_CRASH() when using the non-threadsafe NS_INLINE_DECL_REFCOUNTING macro in ChildImpl.

1) GetOrCreateForCurrentThread() is called on the worker thread
2) CreateActorRunnable is dispatched to the main thread
3) ChildImpl is constructed on the main thread
4) ChildImpl's owning thread is set to the main thread
5) Any call to AddRef() or Release() from the worker thread will trigger MOZ_CRASH() via NS_ASSERT_OWNINGTHREAD()

Right now we can trigger a ChildImpl::AddRef() just by calling GetForCurrentThread().  It seems we can avoid this by using threadLocalInfo->mActor.get() instead of just returning mActor.

We then just have to handle the release side of things.  It seems we must do that by dispatching a runnable to the main thread to destroy the actor.

I'll see if I can get that working.  To be honest, though, I'm not sure what the advantage of this is over just using thread safe reference counting.
(In reply to Ben Kelly [:bkelly] from comment #32)
> Right now we can trigger a ChildImpl::AddRef() just by calling
> GetForCurrentThread().  It seems we can avoid this by using
> threadLocalInfo->mActor.get() instead of just returning mActor.

That isn't possible, we'd be leaking all over. Returning a nsRefPtr like that will convert it to the raw pointer type and will not add a reference.

> We then just have to handle the release side of things.  It seems we must do
> that by dispatching a runnable to the main thread to destroy the actor.

Yes, from my read this should be correct. In general it's always best to create and destroy things on the same thread.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #33)
> (In reply to Ben Kelly [:bkelly] from comment #32)
> > Right now we can trigger a ChildImpl::AddRef() just by calling
> > GetForCurrentThread().  It seems we can avoid this by using
> > threadLocalInfo->mActor.get() instead of just returning mActor.
> 
> That isn't possible, we'd be leaking all over. Returning a nsRefPtr like
> that will convert it to the raw pointer type and will not add a reference.

It seems that the conditional operator causes a temporary nsRefPtr object to be created here.  If I change the code to use an if-statement I can avoid the temporary.  Alternatively, using .get() avoids the AddRef() call.
(In reply to Ben Kelly [:bkelly] from comment #34)
> It seems that the conditional operator causes a temporary nsRefPtr object to
> be created here.

Oh, ick. That's subtle and horrible.

> If I change the code to use an if-statement I can avoid
> the temporary.

Yes, this one please.
Updated part 1 that reverts back to using non-threadsafe refcounts for ChildImpl.  Ben suggested the code to run the ChildImpl::Release on the main thread.

I'll address the remaining review feedback and submit a combined patch for review tomorrow.
Attachment #8430088 - Attachment is obsolete: true
This addresses previous review feedback and combines the previous parts into a single patch.

https://tbpl.mozilla.org/?tree=Try&rev=fa1d4d2ff3ad
Attachment #8429674 - Attachment is obsolete: true
Attachment #8430103 - Attachment is obsolete: true
Attachment #8430104 - Attachment is obsolete: true
Attachment #8437232 - Attachment is obsolete: true
Attachment #8429674 - Flags: review?(bent.mozilla)
Attachment #8430104 - Flags: review?(bent.mozilla)
Attachment #8437955 - Flags: review?(bent.mozilla)
It occurs to me I could possibly address the XXX comment in MessagePump.cpp by setting |loop->SetNestableTasksAllowed(false)| in DoWorkRunnable::Cancel().  That's assuming I can get the right loop without running on the thread.
Updated with a few things I was thinking about last night:

1) Assert that ChildImpl::ActorDestroy() has been called in the thread local destructor in addition to ~ChildImpl.  This avoids racing with the main thread.

2) Don't MOZ_CRASH if we fail to create the PBackground actor.  Instead return NS_ERROR_FAILURE from the PrimaryWorkerRunnable::Run() similar to what is down if the context can't be created.
Attachment #8437955 - Attachment is obsolete: true
Attachment #8437955 - Flags: review?(bent.mozilla)
Attachment #8438678 - Flags: review?(bent.mozilla)
Comment on attachment 8438678 [details] [diff] [review]
Support PBackground on workers. (merged v1)

Review of attachment 8438678 [details] [diff] [review]:
-----------------------------------------------------------------

This looks really good! The big thing I think that still needs to be worked on is the lengthy comment in BackgroundImpl.cpp

::: dom/workers/RuntimeService.cpp
@@ +62,5 @@
>  #include "SharedWorker.h"
>  #include "WorkerPrivate.h"
>  #include "WorkerRunnable.h"
> +#include "nsIIPCBackgroundChildCreateCallback.h"
> +#include "BackgroundChild.h"

Nit: Please alphabetize these

@@ +917,5 @@
>  
> +class WorkerBackgroundChildCallback MOZ_FINAL :
> +  public nsIIPCBackgroundChildCreateCallback
> +{
> +  bool& mDone;

Nit: I prefer pointers to refs in cases like this because it makes it obvious at the call site that you're modifying the value (e.g. |new WorkerBackgroundChildCallback(&done)|)

@@ +1094,5 @@
> +  {
> +    using mozilla::ipc::BackgroundChild;
> +    using mozilla::ipc::PBackgroundChild;
> +    if (gTestPBackground) {
> +      // Randomize value to validate workers are not cross-posting messages.

Maybe also append the value of PR_GetCurrentThread()?

@@ +1096,5 @@
> +    using mozilla::ipc::PBackgroundChild;
> +    if (gTestPBackground) {
> +      // Randomize value to validate workers are not cross-posting messages.
> +      uint32_t testValue;
> +      PR_GetRandomNoise(&testValue, sizeof(testValue));

Might as well MOZ_RELEASE_ASSERT that PR_GetRandomNoise returns the correct size.

@@ -1537,5 @@
>      return false;
>    }
>  
> -#ifdef DEBUG
> -  thread->SetAcceptingNonWorkerRunnables(false);

I'd actually prefer that we leave this here, and then set true (and then false again) in WorkerThreadPrimaryRunnable::Run() before and after calling SynchronouslyCreatePBackground().

That will reduce the amount of time that something could slip in without triggering the assertions.

@@ +2564,5 @@
>    mWorkerPrivate->AssertIsOnWorkerThread();
> +
> +  // If the PBackground child is not created yet, then we must permit
> +  // blocking event processing to support SynchronouslyCreatePBackground().
> +  MOZ_ASSERT(!aMayWait || !BackgroundChild::GetForCurrentThread());

Hrm, if we're waiting on PBackground then we shouldn't call into mWorkerPrivate here since it's not yet set up. That's probably fine today but I don't want that to change in the future. How about we change this to:

  if (aMayWait) {
    MOZ_ASSERT(aRecursionDepth == 2);
    MOZ_ASSERT(!BackgroundChild::GetForCurrentThread());
    return NS_OK;
  }

@@ +2640,5 @@
> +  if (NS_WARN_IF(NS_FAILED(rv))) {
> +    // XXX need to fire an error at parent.
> +    return rv;
> +  }
> +#ifdef ENABLE_TESTS

Nit: Please add an extra line before #ifdef

@@ +2705,5 @@
>  
> +#ifdef ENABLE_TESTS
> +  mThread->TestPBackground();
> +#endif
> +  BackgroundChild::CloseForCurrentThread();

I think we should do this sooner... Can it be moved to right after the block with DoRunLoop() but before we unregister from the profiler?

@@ +2729,5 @@
> +  bool done = false;
> +  nsCOMPtr<nsIIPCBackgroundChildCreateCallback> callback =
> +    new WorkerBackgroundChildCallback(done);
> +
> +  if (!BackgroundChild::GetOrCreateForCurrentThread(callback)) {

NS_WARN_IF here please, and in the other spots below where you return early due to failures.

@@ +2734,5 @@
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  while (!done) {
> +    if (!NS_ProcessNextEvent(mThread)) {

Nit: Since this matters earlier it would be clearer if this was:

  NS_ProcessNextEvent(mThread, true /* aMayWait */);

::: ipc/glue/BackgroundImpl.cpp
@@ +1612,5 @@
>      static_cast<ThreadLocalInfo*>(PR_GetThreadPrivate(sThreadLocalIndex));
>  
> +  if (threadLocalInfo) {
> +    return threadLocalInfo->mActor;
> +  } else {

Nit: No need for else after a return.

@@ +1678,5 @@
>  
>    return true;
>  }
>  
> +/* static */

Uber-nit: |// static| to match the others.

@@ +1687,5 @@
> +             "BackgroundChild::Startup() was never called!");
> +  auto threadLocalInfo =
> +    static_cast<ThreadLocalInfo*>(PR_GetThreadPrivate(sThreadLocalIndex));
> +  if (threadLocalInfo) {
> +    if (threadLocalInfo->mActor) {

I'm a little concerned about this... If we have a |threadLocalInfo| then someone called |GetOrCreateForCurrentThread()|. If we have an actor already then everything is fine.

If we don't have an actor then the actor is being created somewhere and it just hasn't been assigned yet. In this case we're not going to do anything here which seems wrong. We're basically allowing a race condition to cause different behavior.

I think we could do two things here:

1. MOZ_CRASH if we get into this state.
2. Set some flags and add extra code so that we close the actor once it finally gets created.

What do you think? I would prefer not to allow a racy API like this if it's not too difficult...

::: ipc/glue/MessagePump.cpp
@@ +220,5 @@
>  
>  NS_IMETHODIMP
>  DoWorkRunnable::Run()
>  {
> +  if (mCanceled) {

I actually think we can't bail out here... There's a 1:1 relationship between DoWorkRunnable and tasks in the chromium queue so if we skip one we'll leave something else unprocessed.

For now just have Cancel() call Run() directly and we can revisit if something breaks?
Attachment #8438678 - Flags: review?(bent.mozilla) → review-
> @@ +1687,5 @@
> I think we could do two things here:
> 
> 1. MOZ_CRASH if we get into this state.
> 2. Set some flags and add extra code so that we close the actor once it
> finally gets created.

I chose to MOZ_CRASH() for a few reasons:

1) This could also be a double close.
2) Lets not add complexity unless we find we need it.
3) I don't want to add something I can't test and writing good, racy test cases can be hard without a real use case that triggers it.

New try:

https://tbpl.mozilla.org/?tree=Try&rev=fca7c291b8cf
Attachment #8438678 - Attachment is obsolete: true
Attachment #8442596 - Flags: review?(bent.mozilla)
Also, I assumed if the runnable Cancel() is called, then I can assume Run() won't be called.  Is that correct?  Or do I still need the flag to short-circuit Run() to avoid double execution?
(In reply to Ben Kelly [:bkelly] from comment #42)
> Also, I assumed if the runnable Cancel() is called, then I can assume Run()
> won't be called.  Is that correct?

This is true for the worker run loop but I can't say for other things.

> Or do I still need the flag to short-circuit Run() to avoid double execution?

I would do that to be safe. Maybe assert too.
Sorry, realized after posting the previous patch I needed to fix the CloseForCurrentThread() description in BackgroundChild.h.  It now indicates that the call may only be made once and only after the actor has been created.

Also fix another case of whitespace missing around an #ifdef block.
Attachment #8442596 - Attachment is obsolete: true
Attachment #8442596 - Flags: review?(bent.mozilla)
Attachment #8442601 - Flags: review?(bent.mozilla)
Sigh.  Uploaded wrong patch.
Attachment #8442601 - Attachment is obsolete: true
Attachment #8442601 - Flags: review?(bent.mozilla)
Attachment #8442602 - Flags: review?(bent.mozilla)
Hmm, try is unhappy.  It seems the modified code is allowing the callback from the main thread to occur while SetAcceptingNonWorkerRunnables(false) for some reason.  Investigating.
Depends on: 1027772
No longer depends on: 1027772
See Also: → 1027772
So it turns out at least some, if not all, the try errors were fallout from implementing this:

> >  
> > -#ifdef DEBUG
> > -  thread->SetAcceptingNonWorkerRunnables(false);
> 
> I'd actually prefer that we leave this here, and then set true (and then
> false again) in WorkerThreadPrimaryRunnable::Run() before and after calling
> SynchronouslyCreatePBackground().
> 
> That will reduce the amount of time that something could slip in without
> triggering the assertions.

The problem is that this creates a race between setting SetAcceptingNonWorkerRunnables(false) in ScheduleWorker() and setting it true on the worker thread.

Usually it works, but every now and then we hit this condition:

1) Main thread calls thread->Dispatch() from ScheduleWorker()
2) Worker thread wakes up
3) Worker thread runs PrimaryWorkerRunnable
4) Worker thread calls SetAcceptingNonWorkerRunnables(true)
5) Worker thread blocks waiting for actor callback
6) Main thread wakes up
7) Main thread calls SetAcceptingNonWorkerRunnables(false)
8) Main thread creates actor
9) Main thread calls thread->Dispatch() for callback

At that point we hit the assert because mAcceptingNonWorkerRunnables is false.

I think the correct thing here is to make the final dispatch from the main thread in ScheduleWorker atomic with setting mAcceptingNonWorkerRunnable false.

This patch does this by adding a DispatchAndStopAcceptingNonWorkerRunnables() method.  This performs the Dispatch() and sets the boolean with a lock.

Note, I had to switch to using a ReentrantMonitor because Dispatch() immediately tries to lock again by calling IsAcceptingNonWorkerRunnables().

I attempted to make lock recursion only allowed in IsAcceptingNonWorkerRunnables() by checking mMonitor.AssertNotCurrentThreadIn() in the other methods.  It looks like this is not actually implemented yet, though.  I wrote bug 1027772 to fix that.

Anyway, what do you think of this solution?
Attachment #8442602 - Attachment is obsolete: true
Attachment #8442602 - Flags: review?(bent.mozilla)
This fixed the errors locally for me.  Lets see how the new try does:

  https://tbpl.mozilla.org/?tree=Try&rev=a4180b4248a5
(In reply to Ben Kelly [:bkelly] from comment #47)
> Anyway, what do you think of this solution?

This sounds like a lot of work just to tighten that assertion... Sorry to send you down that rabbit hole, but I think we can just live with the old assertion behavior.
No problem.  That's what I was thinking too. :-)
I believe this addresses all the review feedback to-date.  Also passed on past try runs, I expect this to be green:

  https://tbpl.mozilla.org/?tree=Try&rev=1ac3e93dcf0e
Attachment #8443002 - Attachment is obsolete: true
Attachment #8443214 - Flags: review?(bent.mozilla)
See Also: 1027772
Using the debugger Ben confirmed that early delivery of worker messages is still safe with this patch because the WorkerPrivate is still in a pending state while spinning the event loop in SynchronouslyCreatePBackground().  This causes those early worker messages to be queued separately until its state moves to running.

This patch update adds asserts to SynchronouslyCreatePBackground() to verify that the worker private is indeed in its pending state.

https://tbpl.mozilla.org/?tree=Try&rev=f8affed3ca5d
Attachment #8443214 - Attachment is obsolete: true
Attachment #8443214 - Flags: review?(bent.mozilla)
Attachment #8447268 - Flags: review?(bent.mozilla)
Talking with Ben, it sounds like we can guarantee this safely without the assert.  So remove the assert and just add a comment instead.
Attachment #8447268 - Attachment is obsolete: true
Attachment #8447268 - Flags: review?(bent.mozilla)
Attachment #8447315 - Flags: review?(bent.mozilla)
Comment on attachment 8447315 [details] [diff] [review]
Support PBackground on workers. (merged v8)

Review of attachment 8447315 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, thanks!

::: dom/workers/RuntimeService.cpp
@@ +1095,5 @@
> +  void
> +  TestPBackground()
> +  {
> +    using mozilla::ipc::BackgroundChild;
> +    using mozilla::ipc::PBackgroundChild;

Nit: using namespace mozilla::ipc;

@@ +1103,5 @@
> +      PRSize randomSize = PR_GetRandomNoise(&testValue, sizeof(testValue));
> +      MOZ_RELEASE_ASSERT(randomSize == sizeof(testValue));
> +      nsCString testStr;
> +      testStr.AppendInt(testValue);
> +      testStr.AppendInt(reinterpret_cast<uint64_t>(PR_GetCurrentThread()));

Nit: intptr_t here instead of uint64_t

@@ +2735,5 @@
>    return NS_OK;
>  }
>  
> +nsresult
> +WorkerThreadPrimaryRunnable::SynchronouslyCreatePBackground()

Please assert here that BackgroundChild::GetForCurrentThread returns null.

::: ipc/glue/BackgroundImpl.cpp
@@ +2009,5 @@
>    AssertIsOnBoundThread();
>  
>    BackgroundChildImpl::ActorDestroy(aWhy);
> +
> +  mActorDestroyed = true;

Nit: Let's move this to before you call BackgroundChildImpl::ActorDestroy.

::: ipc/glue/MessagePump.cpp
@@ +264,5 @@
> +  // here.  If we don't process all of these then we will leave something
> +  // unprocessed in the chromium queue.  Therefore, eagerly complete our work
> +  // instead by immediately calling Run().
> +  mCanceled = true;
> +  mCallingRunWhileCanceled = true;

Nit: Please use AutoRestore for this.
Attachment #8447315 - Flags: review?(bent.mozilla) → review+
Fixed review feedback, except for the intptr_t comment.  Turns out we don't have an intptr_t compatible AppendInt() method.  Updated commit log.
Attachment #8447315 - Attachment is obsolete: true
Attachment #8447398 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/303027a0da95
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Depends on: 1032605, 1032783
I pushed a backout of mozilla-inbound since fixing the oranges has been much trickier than I thought:

  https://hg.mozilla.org/integration/mozilla-inbound/rev/306fc8865aae
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla33 → ---
After talking with Ben and much experimentation, we think the best approach is to make MessageChannel::Close() call ActorDestroy() synchronously in the case where the channel is still opening.

This patch does this along with a few other minor fixes we uncovered.  It works in the debugger, but needs a full try run to see if any of the oranges are still a problem.

https://tbpl.mozilla.org/?tree=Try&rev=d6dc74bd5e77
Attachment #8447398 - Attachment is obsolete: true
Comment on attachment 8451878 [details] [diff] [review]
Reland support for PBackground on workers. (v10)

Try looks good, so flagging for review.  The splinter interdiff to v9 is mostly correct, although you have to ignore the first two hunks.
Attachment #8451878 - Flags: review?(bent.mozilla)
And by try being good, I mean that there have been no failures in the previously orange tests.  B2G ICS Emulator Debug M12 was the worst before.  So far no failures after 20 pushes.
Comment on attachment 8451878 [details] [diff] [review]
Reland support for PBackground on workers. (v10)

Review of attachment 8451878 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: dom/workers/RuntimeService.cpp
@@ +919,5 @@
> +public:
> +  WorkerBackgroundChildCallback(bool* aDone)
> +  : mDone(aDone)
> +  {
> +    MOZ_ASSERT(mDone);

Let's also MOZ_ASSERT(!NS_IsMainThread())

::: ipc/glue/BackgroundChild.h
@@ +38,5 @@
>  // time.
>  //
> +// CloseForCurrentThread() will close the current PBackground actor.  Subsequent
> +// calls to GetForCurrentThread will return null.  CloseForCurrentThread() may
> +// only be called exactly once per thread.  Currently it is illegal to call this

CloseForCurrentThread() may be called more than once, you just have to create another actor in between. This comment needs more specificity.

::: ipc/glue/BackgroundImpl.cpp
@@ +422,5 @@
>      if (threadLocalInfo) {
>        if (threadLocalInfo->mActor) {
>          threadLocalInfo->mActor->Close();
> +        threadLocalInfo->mActor->AssertActorDestroyed();
> +        // Since the actor is created on the main thread it must only

Nit: Please add a newline between the assert and the comment.

@@ +1697,5 @@
> +    MOZ_CRASH("Attempting to close a non-existent PBackground actor!");
> +  }
> +
> +  if (threadLocalInfo->mActor) {
> +    threadLocalInfo->mActor->FlushPendingInterruptQueue();

This really shouldn't be needed (it only does anything if we have rpc messages, which we shouldn't...) but I don't think it hurts either.

@@ +1699,5 @@
> +
> +  if (threadLocalInfo->mActor) {
> +    threadLocalInfo->mActor->FlushPendingInterruptQueue();
> +  }
> +  DebugOnly<PRStatus> status = PR_SetThreadPrivate(sThreadLocalIndex, nullptr);

Nit: Please add a newline before this one, and maybe a comment saying that clearing the threadlocal will close the actor synchronously.

@@ +2007,5 @@
>  ChildImpl::ActorDestroy(ActorDestroyReason aWhy)
>  {
>    AssertIsOnBoundThread();
>  
> +  mActorDestroyed = true;

Nit: Please also assert that mActorDestroyed is false before you set it here.

::: ipc/glue/MessageChannel.cpp
@@ +1662,5 @@
>          }
>  
>          if (ChannelOpening == mChannelState) {
> +            // SynchronouslyClose() waits for an ack from the other side, so
> +            // the opening process should complete before this returns.

Nit: s/process/sequence/ maybe? "process" is overloaded in this context.

::: ipc/glue/MessagePump.cpp
@@ +12,5 @@
>  #include "base/basictypes.h"
>  #include "base/logging.h"
>  #include "base/scoped_nsautorelease_pool.h"
>  #include "mozilla/Assertions.h"
> +#include "mozilla/AutoRestore.h"

Nit: This isn't needed any longer.
Attachment #8451878 - Flags: review?(bent.mozilla) → review+
Updated with review feedback.  Retested on debugger for good measure due to that extra assert we added in ChildImpl::ActorDestroy.
Attachment #8451878 - Attachment is obsolete: true
Attachment #8452511 - Flags: review+
Lets see if we can reproduce these android ThreadLink crashes in this try:

  https://tbpl.mozilla.org/?tree=Try&rev=02dff17f5139
I got it to reproduce once in that try.  Took about 40 pushes of Android 2.3 Opt M6.  That's about a 2% or 3% reproduction rate.
It appears this is easy to reproduce locally just by adding a printf in here:

  http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageLink.cpp#189

This code looks very racy and could use more checking.

Also, whats the history of this comment:

  // :TODO: MonitorAutoLock lock(*mChan->mMonitor);

Ben, do you know?
Flags: needinfo?(bent.mozilla)
I think that TODO comment is about prevent messages from being sent on the channel, not about racing with the other ThreadLink destructor.  So probably not strictly relevant here.
Flags: needinfo?(bent.mozilla)
This patch adds a static mutex to protect the operations in ~ThreadLink().  This code needs to run automically with respect to the other side's ~ThreadLink().

In theory a static mutex is a bit broad as we could have separate channels tearing down, but this does not seem too common a situation and probably won't contend much.  Also, it seems the most straightforward as its not clear where else to stick a mutex that would be common to both ThreadLink objects.  Taking both channel monitors seems like a recipe for deadlock.

Here's a new try:

  https://tbpl.mozilla.org/?tree=Try&rev=3b44eaef7693

Based on local debugging I'm fairly confident this will solve the problem, so flagging for review now.
Attachment #8452511 - Attachment is obsolete: true
Attachment #8453057 - Flags: review?(bent.mozilla)
I suppose to avoid that mutex being abused elsewhere, I could declare it as a static within the ~ThreadLink() method scope itself.
On IRC Ben pointed out that mMonitor is shared between the two MessageChannel objects, so we can just lock that.

I also dug into why it was commented out with a TODO.  It appears that it was an unintended bug from merging channels together into on class in bug 901789.

This patch restores the mMonitor lock in ~ThreadLink(), adds a comment about the monitor being shared, and adds some MOZ_ASSERTs for good measure.

https://tbpl.mozilla.org/?tree=Try&rev=2cb59580ab0c
Attachment #8453057 - Attachment is obsolete: true
Attachment #8453057 - Flags: review?(bent.mozilla)
Attachment #8453181 - Flags: review?(bent.mozilla)
Attachment #8453181 - Flags: review?(bent.mozilla) → review+
Try looks good with over 40 triggers without a failure.  Back to mozilla-inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/fc4d6b5ee8ce
https://hg.mozilla.org/mozilla-central/rev/fc4d6b5ee8ce
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Depends on: 1036970
Depends on: 1037294
Depends on: 1040288
Depends on: 1049552
Depends on: 1049801
Depends on: 1034309
You need to log in before you can comment on or make changes to this bug.