Consider making the state machine thread a task queue

RESOLVED FIXED in Firefox 39

Status

()

RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

unspecified
mozilla39
x86
Mac OS X
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox39 fixed)

Details

Attachments

(4 attachments)

Right now MediaDecoderStateMachineScheduler::GetStateMachineThread returns mEventTarget, which is an nsIThreadPool. Because it's a thread pool, tasks dispatched to it can run concurrently with other tasks (in particular, the state machine). So anybody trying to dispatch things "to the state machine thread" in the hopes of avoiding races is doing the wrong thing. And grepping through the code, we have at least a couple of instances of this.

I think we should switch the idea of a "state machine thread" to a "state machine task queue". This should be pretty straightforward to do, and I'm happy to do it as part of the other work that I'm doing. Any thoughts or objections?
Flags: needinfo?(cpearce)
Facepalm - I just realized that the shared thread pool sets a thread limit of 1, making it not much of a pool at all. So I don't think there are any races here.

That still leaves the question of whether it makes sense to run the MDSM task-queue-on-thread-pool to decrease interference between MDSMs and make the code more consistent. I've got patches that get us most of the way there already, though the timer is certainly still a sticky issue. We should probably discuss this in a larger discussion about our threading model.
Group: core-security
Flags: needinfo?(cpearce)
Summary: Usage of GetStateMachineThread potentially unsafe → Consider making the state machine thread a task queue
No longer blocks: 778617, 1135170
I'm hoping we can remove the timer in future. Once roc moves the video queue and A/V sync into the compositor, if we can figure out how to update HTMLMediaElement.currentTime for audio only streams we can probably remove the timer.

Removing the audio thread will likely help here too.
Per bug 1135785 comment 10, we should try to do this shortly after bug 1135424 lands.
Depends on: 1135785
Created attachment 8576345 [details] [diff] [review]
Part 1 - Allow MediaPromise dispatch to fail if the ThenValue has been disconnected. v1

The original idea behind the current model was that we wanted ironclad guarantees
that consumers would always get a callback on their promise. But we now have use
cases where the consumer wants to forget about a promise (using the new
Disconnect()) feature, and in some cases wants to shut down the task queue that
the response is going to be dispatched on. In the case of this bug, we want to
avoid waiting for the longest outstanding timer promise to be resolved before
shutting down the MDSM.

So this patch fixes up the pieces needed to make this work:
* Loosening our invariants to allow dispatch targets to be released on any thread,
  since MediaTaskQueue and nsIEventTarget both have thread-safe refcounting.
* Releasing mThisVal in Disconnect, so that we no longer depend on successful
  dispatch to release it on the correct (dispatch) thread.
* Fiddling with various assertions.

We also make some assertions fatal in nightly/aurora builds while we're at it.
Attachment #8576345 - Flags: review?(matt.woodrow)
Created attachment 8576346 [details] [diff] [review]
Part 2 - Implement MediaTimer. v1
Attachment #8576346 - Flags: review?(matt.woodrow)
Created attachment 8576347 [details] [diff] [review]
Part 3 - Switch the MDSM to a task queue. v1
Attachment #8576347 - Flags: review?(matt.woodrow)
Comment on attachment 8576345 [details] [diff] [review]
Part 1 - Allow MediaPromise dispatch to fail if the ThenValue has been disconnected. v1

Review of attachment 8576345 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/MediaPromise.h
@@ +245,5 @@
>        PROMISE_LOG("%s Then() call made from %s [Runnable=%p, Promise=%p, ThenValue=%p]",
>                    resolved ? "Resolving" : "Rejecting", ThenValueBase::mCallSite,
>                    runnable.get(), aPromise, this);
> +      nsresult rv = detail::DispatchMediaPromiseRunnable(mResponseTarget, runnable);
> +      (void) rv;

mozilla::unused?
Attachment #8576345 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8576346 [details] [diff] [review]
Part 2 - Implement MediaTimer. v1

Review of attachment 8576346 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/MediaTimer.h
@@ +74,5 @@
> +    nsRefPtr<MediaTimerPromise::Private> mPromise;
> +
> +    explicit Entry(const TimeStamp& aTimeStamp)
> +      : mTimeStamp(aTimeStamp)
> +      , mPromise(new MediaTimerPromise::Private(__func__))

This __func__ is probably going to be fairly useless. We could make WaitUntil take a const char* parameter and have the callers of that use __func__ to get more useful values.
Attachment #8576346 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8576347 [details] [diff] [review]
Part 3 - Switch the MDSM to a task queue. v1

Review of attachment 8576347 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/MediaDecoderStateMachine.h
@@ +322,5 @@
>  
> +  // Invokes ScheduleStateMachine to run at |aTimeStamp|, unless it's
> +  // already scheduled to run earlier, in which case the request is
> +  // discarded.
> +  void ScheduleStateMachineAt(const TimeStamp aTimeStamp);

Why did this change to an absolute time instead of a relative one (TimeDuration)?

All the callers have to use TimeStamp::Now() to convert relative to absolute, and I'm not really sure what passing a value < Now() would actually mean.
Attachment #8576347 - Flags: review?(matt.woodrow) → review+
Apparently you need mochitest-e10s-3 on your try push too, to persuade me of your innocence.

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/1e99f9913951 on suspicion of being the cause of 330 seconds without output hangs in test_playback.html in mochitest-e10s-3. I don't feel great about putting the blame on you, but retriggering I got three of them on you, and none on the pushes below you.
(In reply to Phil Ringnalda (:philor) from comment #12)
> Apparently you need mochitest-e10s-3 on your try push too, to persuade me of
> your innocence.

Will do!

> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/1e99f9913951 on
> suspicion of being the cause of 330 seconds without output hangs in
> test_playback.html in mochitest-e10s-3. I don't feel great about putting the
> blame on you, but retriggering I got three of them on you, and none on the
> pushes below you.

Well, there's also a legitimate assertion failure (which I've fixed). But yes, let's investigate this on try.

Control: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9557ba6176a4
Logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcc31427f8cf
After several iterations of logging, it looks like we're hanging in the AwaitShutdownAndIdle call in nsDecoderDisposeEvent::Run. I'm not sure why exactly, but I would guess that one of the tasks in the task queue is blocking on the main thread, either with synchronous dispatch or because we're spinning the event loop and something higher up on the stack is holding a blocking resource (the event loop spinning stuff might explain why this is e10s-only).

To be honest, shutting down the task queue synchronously on the main thread was kind of cutting a corner to begin with. I'll write up a patch to do it properly with the MediaPromise that BeginShutdown already returns.
Created attachment 8577870 [details] [diff] [review]
Part 4 - Run MDSM disposal off the MediaPromise returned by initiating shutdown on the task queue. v1
Attachment #8577870 - Flags: review?(matt.woodrow)
(In reply to Bobby Holley (:bholley) from comment #16)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=0de522268b51

Looks green. \o/
Attachment #8577870 - Flags: review?(matt.woodrow) → review+
I get the following error doing a self build of current trunk, which I suspect was caused by the patches for this bug:

clang++ -o nsTypeAheadFind.o -c  -fvisibility=hidden -DSTATIC_EXPORTABLE_JS_API -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DAB_CD=en-US -DNO_NSPR_10_SUPPORT -I/usr/local/src/Mozilla/
bugzilla1110911/blah/mozilla-central/toolkit/components/typeaheadfind -I.  -I../../../dist/include   -I/usr/local/src/Mozil
la/bugzilla1110911/blah/mozilla-central/obj-firefox-64bit/dist/include/nspr -I/usr/local/src/Mozilla/bugzilla1110911/blah/mozilla-central/obj-firefox-64bit/dist/include/nss       -fPIC  -Qunused-arguments  -DMOZILLA_CLIENT -include ../../../mozilla-config.h -MD -MP -MF .deps/nsTypeAheadFind.o.pp -Qunused-arguments  -Qunused-arguments -Wall -Wempty-body -Woverloaded-virtual -Wsign-compare -Wwrite-strings -Wno-invalid-offsetof -Wno-c++0x-extensions -Wno-extended-offsetof -Wno-unknown-warning-option -Wno-return-type-c-linkage -g -gfull -isysroot /Developer/SDKs/MacOSX10.7.sdk -fno-exceptions -fno-strict-aliasing -fno-exceptions -fno-math-errno -std=gnu++0x -pthread -DNO_X11 -pipe  -DNDEBUG -DTRIMMED -g -fno-omit-frame-pointer       /usr/local/src/Mozilla/bugzilla1110911/blah/mozilla-central/toolkit/components/typeaheadfind/nsTypeAheadFind.cpp
In file included from /usr/local/src/Mozilla/bugzilla1110911/blah/mozilla-central/obj-firefox-64bit/layout/generic/Unified_cpp_layout_generic3.cpp:38:
In file included from /usr/local/src/Mozilla/bugzilla1110911/blah/mozilla-central/layout/generic/nsVideoFrame.cpp:14:
In file included from ../../dist/include/mozilla/dom/HTMLVideoElement.h:11:
In file included from ../../dist/include/mozilla/dom/HTMLMediaElement.h:22:
In file included from ../../dist/include/MediaDecoder.h:192:
../../dist/include/MediaPromise.h:267:16: error: 'Disconnect' marked 'override' but does not override any member functions
  virtual void Disconnect() MOZ_OVERRIDE
               ^
(In reply to Steven Michaud from comment #20)
> I get the following error doing a self build of current trunk, which I
> suspect was caused by the patches for this bug:

You need to update your clang. See bug 1129499.
> You need to update your clang. See bug 1129499.

That's going to be difficult.  I use Apple's clang (which comes with XCode), but because I'm building on OS X 10.7.5 I can't use Apple's latest clang (or XCode).  I suppose I can hack something up, though (perhaps by building my own clang).

I expect this isn't the last you'll hear of this.  In other words, I expect others will also have run into this difficulty.

Thanks, though, for pointing out the bug that made the change (and triggered the problem).
(In reply to Steven Michaud from comment #22)
> That's going to be difficult.  I use Apple's clang (which comes with XCode),
> but because I'm building on OS X 10.7.5 I can't use Apple's latest clang (or
> XCode).  I suppose I can hack something up, though (perhaps by building my
> own clang).

The easiest thing would probably be to write a local patch to remove the MOZ_OVERRIDE annotation.

> I expect this isn't the last you'll hear of this.  In other words, I expect
> others will also have run into this difficulty.

I am open to suggestions on how to improve the situation.

> Thanks, though, for pointing out the bug that made the change (and triggered
> the problem).

To be clear, that was a different instance of the same problem.
>> I expect this isn't the last you'll hear of this.  In other words,
>> I expect others will also have run into this difficulty.
>
> I am open to suggestions on how to improve the situation.

I just found I can use a patched build of (non-Apple) clang r185949,
which I'd created to do Mac ASan builds on OS X 10.7.5
(http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt).

So you don't need to worry about me.

But if enough other people show up with the same problem, who can't
easily upgrade their clang, I suggest you continue to accomodate old
clang versions (those with a bug in how they handle the "override"
keyword) -- here and in future patches.

Here you'd presumably need to get rid of the OVERRIDE keyword in the
following line:

https://hg.mozilla.org/mozilla-central/annotate/2e2222a40262/dom/media/MediaPromise.h#l267
(In reply to Steven Michaud from comment #24)
> Here you'd presumably need to get rid of the OVERRIDE keyword in the
> following line:
> 
> https://hg.mozilla.org/mozilla-central/annotate/2e2222a40262/dom/media/
> MediaPromise.h#l267

We can't do that, because then static analysis builds fail. We now require that all overrides are marked MOZ_OVERRIDE.
Oh well :-(

Let's hope there aren't too many others who still need to build on OS X 10.7.  Or if there are some, let's hope they can use my workaround from comment #24.  (As I remember it, the latest non-Apple clang doesn't build/work on OS X 10.7, so you can't just upgrade to the latest version.)
I compile in a 10.7 VM with clang from macport (so I can get the clang plugin working) and it works fine for me.
> clang from macport

It's good to know that also works.

Updated

3 years ago
Depends on: 1147555
You need to log in before you can comment on or make changes to this bug.