Closed Bug 792920 Opened 12 years ago Closed 11 years ago

Ensure that nsITimers can be called off main-thread

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: jduell.mcbugs, Assigned: jduell.mcbugs)

References

Details

Attachments

(2 files, 5 obsolete files)

It was recently discussed that nsITimers aren't safe to use off-main thread:

  https://groups.google.com/forum/?fromgroups=#!searchin/mozilla.dev.platform/nsITimer$20main/mozilla.dev.platform/OfF8f-UO8Z8/8qKzUWSlC0sJ

I landed bug 778296 to add a comment in nsITimer.idl to that effect.

It turns out the socket transport thread has been using off-main timers for a long time.  So far the world hasn't ended, but we should presumably make sure that this is either OK, or change nsITimer, or change necko.   Given how useful off-main thread timers are, I'd suggest we shoot for thread-safe nsITimer (if it's not already).  I presume this is just a matter of adding a lock somewhere?
You probably need to lock around delay line updates, at the very least.  Would need to audit the rest of the timer code to say what else would need it...
michal: if we fix this we obviously don't need to fix the bug where nsDeleteDir is setting a timer off-main thread (forgot bug #: mark this as blocking that one?)
Blocks: 832474
isn't there already a race when using nsITimer on the main thread in the case of repeating-precise timers? e.g.:

on timerthread nstimerimpl::posttimerevent- > setdelayinternal()

on main thread nstimerimpl::setdelay -> setdelayinternal()
Yes, you're right, for cases that change the delay of an existing timer.  :(
So I've got a couple different patches to offer here (I'll post once they are try verified).. Boris, do you have a thought on a preference?

patch 1] make nsITimer safe for use off the main thread - a] locks at the timer object level setdelayinternal() as per comment 3, and b] uses the TimerThread monitor to globally protect the delay line updates as per comment 1.. I couldn't identify other issues.

patch 2] make nsITimer genuinely thread safe.. given that Cancel() is already defined to be threadsafe we can make the whole interface work this way. It basically adds a few locks to the timer object from version 1 which should be 99% of the time uncontended and therefore cheap. Personally I like the consistency of this for the timer object and slightly favor this approach. Lowering barriers to better multithread processing is certainly in our interest of fighting existing bottlenecks.
I'm not sure I have a preference, honestly.  If we can make it threadsafe without any particular perf impact, that sounds great to me.
The try stuff looks fine here.

but my tests with nsitimer show a lot less precision than I was hoping for.. with no patches installed if I just uncomment the debug info in TimerThread I get output like this on debug linux-64

Timer thread woke up 0.043123ms from when it was supposed to
Timer thread woke up 0.106519ms from when it was supposed to
Timer thread woke up 38.869742ms from when it was supposed to
Timer thread woke up 81.693538ms from when it was supposed to
Timer thread woke up 0.113828ms from when it was supposed to
Timer thread woke up 9.955143ms from when it was supposed to
Timer thread woke up 22.991378ms from when it was supposed to
Timer thread woke up 80.233455ms from when it was supposed to
Timer thread woke up 15.945845ms from when it was supposed to
Timer thread woke up 28.859999ms from when it was supposed to
Timer thread woke up 13.920384ms from when it was supposed to
Timer thread woke up 25.989377ms from when it was supposed to
Timer thread woke up 18.994192ms from when it was supposed to
Timer thread woke up 30.981072ms from when it was supposed to
Timer thread woke up 19.969280ms from when it was supposed to
Timer thread woke up 22.982577ms from when it was supposed to
Timer thread woke up 31.951721ms from when it was supposed to
Timer thread woke up 20.980178ms from when it was supposed to
Timer thread woke up 31.989299ms from when it was supposed to
Timer thread woke up 26.976200ms from when it was supposed to
Timer thread woke up 20.895698ms from when it was supposed to
Timer thread woke up 24.885038ms from when it was supposed to
Timer thread woke up 18.869338ms from when it was supposed to
Timer thread woke up 13.888422ms from when it was supposed to
Timer thread woke up 8.273196ms from when it was supposed to
Timer thread woke up 8.799733ms from when it was supposed to
Timer thread woke up 2.843936ms from when it was supposed to
Timer thread woke up 1.522810ms from when it was supposed to
Timer thread woke up 1.475282ms from when it was supposed to
Timer thread woke up 0.861760ms from when it was supposed to

Is that expected? Many of these timers are firing much earlier than I anticipated. (these are timerthread reports, so the depth of events on the main thread where they will be delivered shouldn't be relevant, right?)
Attached patch off main thread v0 (obsolete) — Splinter Review
I'll just post the off-main-thread version (the simpler version) for fyi as an alternative. Neither of the approaches moves the needle on tpn at all
Attached patch thread safe v0 (obsolete) — Splinter Review
Attachment #705865 - Flags: review?(bzbarsky)
Comment on attachment 705865 [details] [diff] [review]
thread safe v0

I'd like to have Chris look at this too...

At the very least we need to document in TimerImpl what fields/methods have to be protected by the lock, right?
Attachment #705865 - Flags: review?(jones.chris.g)
Attachment #705865 - Flags: review?(bzbarsky)
Attachment #705865 - Flags: review+
Blocks: 819734
(In reply to Patrick McManus [:mcmanus] from comment #7)
> The try stuff looks fine here.
> 
> but my tests with nsitimer show a lot less precision than I was hoping for..
> Timer thread woke up 38.869742ms from when it was supposed to
> Timer thread woke up 81.693538ms from when it was supposed to
[..]
> Is that expected? Many of these timers are firing much earlier than I
> anticipated. (these are timerthread reports, so the depth of events on the
> main thread where they will be delivered shouldn't be relevant, right?)

I've tracked this down a bit. The issue is very large values in mTimeoutAdjustment (commonly 100+ms when there is a large innacuracy).

My understanding this is essentially a measured slack factor based on the recent delays in delivering timer events from the timer thread onto the main thread.. If anyone can confirm that would help.

I'm not really sure if that was meant to cover OS and xpcom overhead (threads, events, etc..) or the depth of the actual work queue on the callback execution thread. Its clear (by watching the variability of it in practice just by logging it) that it is the latter that is actually dominating it - and it seems to me that it might be changing too fast to be helpful. I'm not sure about that.

but I am sure that it provides accuracy problems for non-main thread uses where delivery times are hoped to be a lot better. And in general I think splitting things off the main thread where we can do so is something to be encouraged.

I'll ponder an enhancement if nobody counter indicates.
> My understanding this is essentially a measured slack factor based on the recent delays
> in delivering timer events from the timer thread onto the main thread..

It's based on recent delays between when timers "should" have fired on the main thread to when they really did.  It covers issues in the OS timing mechanism (and in fact was added largely because of the mess that is Windows timing resolution) and everything else under the sun...

Note that there is an existing bug to rip this thing out: bug 590422.
(In reply to Boris Zbarsky (:bz) from comment #12)
> > My understanding this is essentially a measured slack factor based on the recent delays
> > in delivering timer events from the timer thread onto the main thread..
> 
> It's based on recent delays between when timers "should" have fired on the
> main thread to when they really did.  It covers issues in the OS timing
> mechanism (and in fact was added largely because of the mess that is Windows
> timing resolution) and everything else under the sun...

Thanks!

the windows 15ms granularity issue is what I suspected this was aimed at - but its pretty obvious to me that what dominates the slack measurements is really the depth of the work queue on the main thread. And when delivering events to other less backed up threads it gets laughably inaccurate because of that data.

an additional timer thread without the adjustment might help (and callers could specify which they wanted). My use case would benefit from decent resolution, which I can get on many platforms, and I can just disable it or scale it back on xp to cases where resolution is less important. Main thread uses (or anyone that doesn't opt out explicitly) can continue hobbling along with what is there. Stubbing out the mTimeoutAdjustment code and checking delivery times to the socket thread on both win7 and linux gives reasonably accurate results (most of the time within 3ms).

does that seem reasonable?
Only touching mTimeoutAdjustment stuff on the main thread would be an excellent idea, imo.
(In reply to Boris Zbarsky (:bz) from comment #10)
> Comment on attachment 705865 [details] [diff] [review]
> thread safe v0
> 
> I'd like to have Chris look at this too...
> 
> At the very least we need to document in TimerImpl what fields/methods have
> to be protected by the lock, right?

Apologies for the stupid-long review lag, but I'm pretty underwater right now.  Clearing up soon but feel free to tag someone else for review in the meantime.
In chromium, when there is added a timer with delay smaller then 32ms (or so) they use timeBeginPeriod(1)/timeEndPeriod(1) (see [1]) that changes ("accurates") the minimal time you can sleep the timer (or any) thread.  This, however, affects the whole system thread scheduling and has bad performance impact.

Timer uses TimeStamp::Now() that has the best possible (in cost of performance loose) resolution and since we check for events time to fire w/o sleeping the thread under some circumstances, the best precision for very small delays can be achieved.  However, you cannot sleep the timer thread for less then 15.6 milliseconds, on some systems even a longer time unless you move time*Period.

The main question is: do we really need such good precision for our timer firing?


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=676349#c1
comment 0 says that the socket thread has been accessing nsITimer off the main thread (true!), comment 2 says that the cache deletion thread has been doing the same thing (true!).

It turns out I've also seen the Cache I/O thread (separate from the deletion thread), DOM Worker thread, and somehow the timer thread itself executing callbacks (the latter vial DummyCallback associated with workers).
(In reply to Honza Bambas (:mayhemer) from comment #16)

> The main question is: do we really need such good precision for our timer
> firing?

The resolution talk is really not core to this bug. I'll move it to a separate bug when there is a patch, but it potentially influences the design so its worth talking about here.

I'm trying to make the nsITimer resolution match the system resolution a bit better. I'm interested in doing this for some networking rate pacing in a couple scenarios. I probably want numbers on the order of 10ms and right now I'm seeing 100+ms. Obviously decent data is needed here before commiting to anything - but this is part of the data gathering process.

Its certainly possible that 15ms will be enough.. or maybe we'll want to engage in the timeBeginPeriod/timeEndPeriod game for the duration of the timer (which at least in my case would be expected to be quite brief compared to overall uptime - just during moments of extreme network parallelism). Decent resolution certainly isn't an unreasonable expectation on modern hardware - these days 15ms can encompass a lot more stuff than it used to (both network and cpu)!

but the main impediment right now is not the OS granularity - it is the TimerThread::mTimeoutAdjustment that is accounting for the delivery time of events to the main thread (plus the os granularity) and that's just swamping everything else by an order of magnitude.

hth.
(In reply to Patrick McManus [:mcmanus] from comment #18)
> but the main impediment right now is not the OS granularity - it is the
> TimerThread::mTimeoutAdjustment that is accounting for the delivery time of
> events to the main thread (plus the os granularity) and that's just swamping
> everything else by an order of magnitude.

Yes, I've encoutered that code when I was trying to inject NowLoRes() where possible to the timer, since Now() is heavily used and may really has perf influence on affected platforms.

Patrick, please check bug 836883 and think of whether you could use something like this when changing the timer impl.  I can give you a draft impl of those two helper methods if you wish.
WebRTC has also been using timers off MainThread for a bunch of things (ICE for example). I'll note that nsITimer docs and .h file don't even mention thread issues other than that one comment in the idl (they don't even mention if Fire()s on the thread it's created on - you have to read the code to verify it works).  Having a timing mechanism that only works (correctly) on MainThread is/would-be... freaky.  :-)

Obviously there are competing needs here (high resolution vs low, overhead, etc), and MainThread timers in particular may (if they work) need some additional mechanisms.  Many OS's at least make a fine/coarse timer distinction, and we may want to consider having the MainThread compensation stuff only happen (if we think it works) if it's a MainThread timer (which is easy to know at timer creation time).
Comment on attachment 705865 [details] [diff] [review]
thread safe v0

Is the goal here to allow timers to be used /single-threaded/, where that thread may not be the main thread, or to be used from multiple threads concurrently?  I'm having a bit of trouble wrapping my head around the latter.
Attachment #705865 - Flags: review?(jones.chris.g)
this can be closed now that 590422 has removed the delay line stuff.

I've read the timer code and it now appears safe to use a timer in a single threaded way off the main thread. It still can't be used multi-threaded but nobody appears to be asking for that (including me).
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
reopening just to update IDL docs
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I'm going to assert this doesn't need sr since we're not changing IDL, just updating comments.
Attachment #705864 - Attachment is obsolete: true
Attachment #705865 - Attachment is obsolete: true
Attachment #733142 - Flags: review?(mcmanus)
Attachment #733142 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/d4499dadb6e7
Assignee: nobody → jduell.mcbugs
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
(In reply to Patrick McManus [:mcmanus] from comment #22)
> ... it now appears safe to use a timer in a single
> threaded way off the main thread.

> It still can't be used multi-threaded...

Could you please clarify the difference between those? What cases should be avoided?
(In reply to Avi Halachmi (:avih) from comment #27)
> (In reply to Patrick McManus [:mcmanus] from comment #22)
> > ... it now appears safe to use a timer in a single
> > threaded way off the main thread.
> 
> > It still can't be used multi-threaded...
> 
> Could you please clarify the difference between those? What cases should be
> avoided?

avoid:

timer->setTarget(threadA)
thread a: timer->init
thread b: timer->cancel

ok to do them all on thread A. However thread A no longer has to be the main thread, which was true until recently.
> thread a: timer->init
> thread b: timer->cancel

Erm, but this exact scenario is mentioned as OK in the IDL changes you +r'd :) 

  "You can cancel() a timer on a thread other than its target, as long as
   its callback object has a thread-safe release() function."

IIRC hsivonen said he's been cancelling timers on different threads in the HTML parser for a long time.  Are we sure that's not OK?  We should change the IDL if it's not.
(In reply to Jason Duell (:jduell) from comment #29)
> > thread a: timer->init
> > thread b: timer->cancel
> 
> Erm, but this exact scenario is mentioned as OK in the IDL changes you +r'd
> :) 
> 
>   "You can cancel() a timer on a thread other than its target, as long as
>    its callback object has a thread-safe release() function."
> 

yes, that's wrong imo. I just read that as you moving the cancel() description to be colacted with the cancel function.. but it was more than that and I didn't notice.

> IIRC hsivonen said he's been cancelling timers on different threads in the
> HTML parser for a long time.  Are we sure that's not OK?  We should change
> the IDL if it's not.

what do you think? Its not going to crash or anything - but the semantics are broken.

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#492

::Fire checks mCanceled and then proceeds to run without locks.. eventually invoking the callback pointers.. so if you're looking for the semantic of "after ->Cancel() returns the timer will not call me back" you're not going to get it if you mix the event target and cancel threads.

It also looks like you're going to have some troubles if your timer callback re-inits the timer and you are doing other-thread cancel()s. It will be ambiguous which timer is being canceled and you could end up just not getting any callback.
Patrick is right that cancel() by other thread presents the possibility of timers firing after they've been canceled, definitely in a wall-clock sense, and possibly also in a way that violates normal notions of program memory consistency, like

  thread A 
    timer.cancel(); 
    gFoo=4,

  thread B
    reads gFoo=4
    timer goes off on threadB

This patch changes the IDL comments to warn against such use.  We could also possibly add assertions that "currentThread == target" in cancel and all the member accessors (probably NS_ASSERTION, since we know the HTML parser at least is doing this).  Let me know if that seems worth it and I can add them (it will take some fiddling as PR_GetCurrentThread() returns a PRThread* while target is a nsIEventTarget).
Attachment #734802 - Flags: review?(mcmanus)
Henri,

Will it be easy enough to change the HMTL parser code to not cancel() timers on threads other than the timer target?  I'm not sure how much work it would be to support non-target-thread cancel() w/o any race.
Flags: needinfo?(hsivonen)
Attached patch part 2, v2: fix typo in comment (obsolete) — Splinter Review
Attachment #734802 - Attachment is obsolete: true
Attachment #734802 - Flags: review?(mcmanus)
Attachment #734808 - Flags: review?(mcmanus)
Attached patch part 2 v3: fix typo (obsolete) — Splinter Review
and... I forgot to qref before uploading the last patch.
Attachment #734808 - Attachment is obsolete: true
Attachment #734808 - Flags: review?(mcmanus)
managed to qref the wrong way.  One of those days...
Attachment #734853 - Attachment is obsolete: true
Attachment #734858 - Flags: review?(mcmanus)
Comment on attachment 734808 [details] [diff] [review]
part 2, v2: fix typo in comment

I think that's right. but lets let someone who has more ownership of this code than me do the review.
Attachment #734808 - Flags: review?(bzbarsky)
Attachment #734808 - Flags: feedback+
It seems to me that since bug 551344, the HTML5 parser only cancels timers from their target threads.
Flags: needinfo?(hsivonen)
Attachment #734808 - Flags: review?(bzbarsky)
Comment on attachment 734858 [details] [diff] [review]
part 2, v4: fix typo

bz: I read comment 36 as meaning patrick would like you to take a look at this.
Attachment #734858 - Flags: review?(mcmanus)
Attachment #734858 - Flags: review?(bzbarsky)
Attachment #734858 - Flags: feedback+
Comment on attachment 734858 [details] [diff] [review]
part 2, v4: fix typo

Ah, the patch that was on was marked obsolete....

This looks fine to me.
Attachment #734858 - Flags: review?(bzbarsky) → review+
pushed fix v2 to IDL comments: reopen just for tracking m-c landing.

 https://hg.mozilla.org/integration/mozilla-inbound/rev/9f10c5b3c418
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/9f10c5b3c418
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: