Open Bug 997844 Opened 7 years ago Updated 3 years ago

Race condition that causes timer callback not fired

Categories

(Core :: Audio/Video: Playback, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: jwwang, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.72 Safari/537.36

Steps to reproduce:

This bug happens in running the test case, test_seek.html.


Actual results:

[media decoder state machine thread]

A1. begin of runnng nsTimerEvent

A2. enter nsTimerImpl::Fire // nsTimerImpl::mRefCnt == 2

A3. enter MediaDecoderStateMachine::CallRunStateMachine

A4. enter MediaDecoderStateMachine::ScheduleStateMachine

A5. mTimer->Cancel(); // ref count not changed for the timer is already removed from timer thread

A6. mTimer->InitWithFuncCallback // nsTimerImpl::mRefCnt == 3

A7. exit MediaDecoderStateMachine::ScheduleStateMachine

A8. exit MediaDecoderStateMachine::CallRunStateMachine

A9. exit nsTimerImpl::Fire



=============== context switch ===================



[decoding thread]

B1. enter MediaDecoderStateMachine::ScheduleStateMachine

B2. mTimer->Cancel(); // nsTimerImpl::mRefCnt == 2, remove the timer added in step A6



=============== context switch ===================



[media decoder state machine thread]

A10. end of running nsTimerEvent

A11. destructor of nsTimerEvent

A12. enter nsTimerImpl::Release

A13. count = --mRefCnt; // count == 1, mRefCnt == 1



=============== context switch ===================



[decoding thread]

B3. enter nsTimerImpl::InitWithFuncCallback

B4. enter TimerThread::AddTimer

B5. enter TimerThread::AddTimerInternal

B6. mTimers.InsertElementSorted(aTimer, c)

B7. aTimer->mArmed = true;

B8. NS_ADDREF(aTimer); // nsTimerImpl::mRefCnt == 2



=============== context switch ===================



[media decoder state machine thread]

A14. if (count == 1 && mArmed) // true for A13 and B7

A15. gThread->RemoveTimer(this) // bang! timer added in B6 is removed and won't fire anymore
Blocks: 994877
OS: Windows 7 → All
Hardware: x86_64 → All
Nice catch!

Do I understand correctly that the _same_ nsITimer instance is being used from two different threads at once?  That's _so_ not supported.  The API documentation for nsITimer explicitly talks about this not being supported.
Blocks: 857424, 981153
Component: XPCOM → Video/Audio
That said, simply doing a mutex around the cancel/rearm combination in ScheduleStateMachine might be good enough here...
Er, but ScheduleStateMachine already has AssertCurrentThreadInMonitor, so it's clearly not; I didn't read the steps above carefully enough; the non-atomic bit here is timer firing, which is not under this mutex.

So back to the rules for using XPCOM timers: they are not supposed to be used from multiple threads.
(In reply to Boris Zbarsky [:bz] from comment #3)
> So back to the rules for using XPCOM timers: they are not supposed to be
> used from multiple threads.

So we should be dynamically asserting this somehow, then?
Sure would be nice.  I'll do a try push with such an assert.
Great find!  And yay to adding an assertion here, nay to relying on people to read the docs!
I filed bug 998006 on adding the assertion.  It's triggered by several places in our code.  :(
Blocks: 998006
Is the correct usage for ScheduleStateMachine to allocate a timer for each of state machine thread and decoding thread to fix this race?
(In reply to Boris Zbarsky [:bz] from comment #1)
> Nice catch!
> 
> Do I understand correctly that the _same_ nsITimer instance is being used
> from two different threads at once?  That's _so_ not supported.  The API
> documentation for nsITimer explicitly talks about this not being supported.

I can't find the API document about the thread-safe usage here: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsITimer. Can you show me the way?
Flags: needinfo?(bzbarsky)
(In reply to JW Wang [:jwwang] from comment #8)
> Is the correct usage for ScheduleStateMachine to allocate a timer for each
> of state machine thread and decoding thread to fix this race?

This doesn't seem right. You can assign a thread pool as the event target of an nsITimer as is used in MediaDecoderStateMachine where timer callbacks will be dispatched to different threads if pool size > 1. I will open another bug to fix the incorrect usage of nsITimer in MediaDecoderStateMachine.
(In reply to JW Wang [:jwwang] from comment #9)
> (In reply to Boris Zbarsky [:bz] from comment #1)
> > Nice catch!
> > 
> > Do I understand correctly that the _same_ nsITimer instance is being used
> > from two different threads at once?  That's _so_ not supported.  The API
> > documentation for nsITimer explicitly talks about this not being supported.
> 
> I can't find the API document about the thread-safe usage here:
> https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/
> Interface/nsITimer. Can you show me the way?

Ok, I find it here http://lxr.mozilla.org/mozilla-central/source/xpcom/threads/nsITimer.idl#51. I wonder why the 2nd paragraph doesn't appear on the MDN. Btw, does the statement still hold when applying to an event target which is actually a thread pool?
> I wonder why the 2nd paragraph doesn't appear on the MDN.

Probably because the MDN documentation was created before bug 792920 was fixed, and then didn't get updated.

> Btw, does the statement still hold when applying to an event target which is actually a
> thread pool?

Yes, because you get the race you described.  My proposed assert in bug 998006 would not catch misuses here in that it would not fire if one threadpool thread messes with the timer that's firing on another threadpool thread at the same time... I'm not sure how to really check for that sanely.  :(
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #12)
> Yes, because you get the race you described.  My proposed assert in bug
> 998006 would not catch misuses here in that it would not fire if one
> threadpool thread messes with the timer that's firing on another threadpool
> thread at the same time... I'm not sure how to really check for that sanely.
> :(

At least I can start to fix the timer misusage in MediaDecoderStateMachine where thread pool size is 1. Btw, It looks hard to get such thread contention right. Would it be easier to make nsTimerImpl thread-safe?
Maybe.  We would need to sit down figure out and figure out what's needed to actually make it safe, as opposed to just looking safe....
We have a discussion on this issue today. Update what we draw on the whiteboard. It might be useful to help people who want to understand what happen here.
Ignoring any possible existing TSAN issues here around mArmed (we touch it outside of the timerthread monitor, which is concerning...), would swapping the aTimer->mArmed = true; ADDREF(aTimer); in AddTimerInternal() resolve *this* instance of cross-thread usage breaking it?  The refcount here is atomic.  If this is fixed by that change, what else would still require not touching across threads, and could we in those cases grab the timerthread monitor?

How much does it hurt us perf-wise/mem-wise to have a per-timer lock to avoid cross-thread issues, which would have close to 0 contention?  I assume contention was one of the reasons to avoid grabbing the TimerThread mLock in all these cases - but what's the perf (and timer accuracy/latency) impact of that?
Switching the order of those calls would not change anything in this case, since in comment 0 there are no context switches between those calls.

The race comment 0 describes is that nsTimerImpl::Release as a method is not very atomic.  In particular, it will decrement the refcount and save the new value (atomically), then test for a particular combination of that new value and mArmed.  But in the meantime someone could have raced in to set mArmed and change the refcount.

> I assume contention was one of the reasons to avoid grabbing the TimerThread mLock in all
> these cases

It's not clear that there was a principled reason for it.
No longer blocks: 857424
For it is really hard to get thread contention right and it can never be thread-safe when the event target of an nsITimer is a thread pool, we should really think about making it thread-safe if perf measurement is acceptable, just my 2 cents...
If nsITimer can't be used with a thread pool, then its target attribute should be an nsIThread rather than an nsIEventTarget.

It's orthogonal to the issue here, but in this instance the state machine thread pool always has size 1, (as already stated here) and the only reason that it is a thread pool is beacause that's a convenient way to make it a shared singleton using SharedThreadPool. We could pretty easily change SharedThreadPool to be able to wrap nsIThreads as well as nsIThreadPools.
> If nsITimer can't be used with a thread pool,

It depends on what you mean by "used".

You can use a single-shot timer that you never cancel with a threadpool target just fine.

If you want to mess with the stat of the timer after it's first initialized, that's not safe given the current timer code.
No longer blocks: 981153
Component: Audio/Video → Audio/Video: Playback
You need to log in before you can comment on or make changes to this bug.