Closed Bug 138791 Opened 18 years ago Closed 18 years ago

nsTimer does not handle interval rollover

Categories

(Core :: XPCOM, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: emaijala+moz, Assigned: emaijala+moz)

References

Details

Attachments

(5 files, 4 obsolete files)

I was debugging biff to find out why it stops working after a while and found
out that if a timer is added when PR_IntervalTime()+delay is more than
0xFFFFFFFF, the timer will never fire. I'm in the process of making this work.
Blocks: 124051
*** Bug 138789 has been marked as a duplicate of this bug. ***
Summary: nsTimer doesn't handle interval rollover → nsTimer does not handle interval rollover
There seems to be another problem also. If an existing timer is given a new 
delay, it might be set to past. I will need to ask the author of this change 
why it is done the way it is (I think author is brendan@mozilla.org).
The reason NS_TYPE_REPEATING_PRECISE timers update their timeout based only on
the programmed delay is so that they don't accumulate error (the skew between
when nsTimerImpl::PostTimerEvent enqueues a plevent and when nsTimerImpl::Fire
calls SetDelayInternal for an NS_TYPE_REPEATING_SLACK timer.

Oops, a caller to SetDelay for an NS_TYPE_REPEATING_PRECISE timer should update
mTimeout = now first.  Patch coming up.

/be
Does this help?

I wonder if someone is using NS_TYPE_REPEATING_PRECISE when they ought to be
using NS_TYPE_REPEATING_SLACK.	PRECISE combined with event-loop hogging (say,
by Gecko) could easily lead to timers always firing late.

/be
I'll try it. There's also another case: anyone who calls SetDelay() for an
existing timer. There are many of these and I think it's correct from the caller
to assume that calling SetDelay() is sufficient.
Comment on attachment 80301 [details] [diff] [review]
proposed fix to repeating precise timer overflow

Note: this should fix the precise timer but not the original issue
Attachment #80301 - Attachment description: proposed fix → proposed fix to repeating precise timer overflow
Why is the precise timer reinitialized in PostTimerEvent() and also in Fire()?
There's also another problem in Fire(): mDelay should be converted to interval
before decrementing mTimeout, right? Could this cause the delay problem seen in
117061 as milliseconds are (in Win2K at least) a lot less than same time in
intervals?

My suggestion would be to always set mTimeout to now + delay in
SetDelayInternal() unless it is precise repeating timer and mTimeout != 0. How
does that sound?

Note: the changes to make timers fully handle interval rollovers are many. I'll
get a patch together soon when these issues are solved.
This is starting to have quite a few things that might affect 117061 also. 
Blocks: 117061
Blocks: 129953
I guess it's time for me to take this..
Assignee: dougt → ere
Brendan, 

I had to already get rid of 'if (mTimeout < now) mTimeout = 0xffffffff;' in
SetDelayInternal to be able to even try handling rollover. I will create a patch
today (which for me is Tuesday)
Sorry, the reinitialization part of my comment #7 is bogus.
This patch also incorporates Brendan's patch. I'd love to have input about it.
Attachment #80301 - Attachment is obsolete: true
The reason the 'if (mTimeout < now) mTimeout = 0xffffffff;' code was introduced
(by me, I'm afraid) was that some web pages were setting huge refresh timeouts
that overflowed the timer and led to immediate refresh (in an infinite loop). 
See bug 124103, which has a description of how to construct a testcase in case
the original page has changed.

Does this patch handle that situation gracefully?  If not, we should consider
moving that logic out of timers and into the docshell, where the page-refresh
timer gets set....
Yes, they are now handled differently as the timer thread can now handle timers
that won't fire before the interval has rolled over. The problem before was that
when the interval time was near the upper boundary, even a moderately short
timeout could overflow it. 
I've been using this now for one and a half days. Feels good.
Attachment #80553 - Attachment is obsolete: true
Comment on attachment 80766 [details] [diff] [review]
Same patch diffed against current cvs

Add a 1-line comment to AddTimerInternal() documenting the return values and
r=rjesup@wgate.com (I'm also going to mark this as All instead of Win2000).
Attachment #80766 - Flags: review+
Thanks, comment added.
I'm sorry I got behind on my bugmail, because I've been hacking a patch for this
that's a little bit simpler, although I didn't catch the lack of mDelay scaling
when decreasing the timeout local for the NS_TYPE_REPEATING_PRECISE case that
Ere fixed in nsTimerImpl::Fire; nor did I see the missing Notify calls that he
added (pavlov!!!).  I've incorporated Ere's fine fixes into my patch.

My patch clamps the maximum delay at just under half the PRIntervalTime domain,
so that no two timers (any two) can differ by 0x80000000 or more.  Thus, to tell
whether a timer t is "less than" a timer u (t's deadline comes before u's), we
can test (t < u || t - u > 0x80000000).  I think it's best to clamp delay at
some reasonable upper bound, rather than pretend that we implement arbitrarily
large delay correctly.  Without a larger PRLongIntervalTime from NSPR, or a
wraparound callback, we can't distinguish wrapping twice from once.

Patch coming up, comments and testing appreciated.

/be
If this works as well, I favor it (not out of ego, I'm very grateful to Ere for
quickly finding so many latent bugs, as well as addressing wraparound directly)
because it avoids adding a member to nsTimerImpl, and because the
TIMER_LESS_THAN condition is cheaper than the triple-disjunction test in Ere's
patch.

But I sure hope my line-number proof is correct!

/be
Pav did point out yesterday, that PRIntervalTime is higher resolution than ms by
1 to 2 decimal orders (per NSPR min and max), so we could raise the upper bound
on aDelay that my patch implements by storing mTimeout in ms, not in interval
ticks.  I didn't make that change, but I could if people think it's important
for 1.0.  The main thing for 1.0 is to fix testcases in dependent bugs, I say.

/be
Direct mozilla1.0 RC2 dependency, and mozilla1.0+ preapproval to fix.

/be
Blocks: 138000
Keywords: mozilla1.0+
Comment on attachment 80796 [details] [diff] [review]
alterna-patch incorporating Ere's non-wraparound fixes

sr=waterson. I'd make TIMER_LESS_THAN be a `static inline' fn (instead of
macro) to avoid double-evaluations.
Attachment #80796 - Flags: superreview+
Comment on attachment 80796 [details] [diff] [review]
alterna-patch incorporating Ere's non-wraparound fixes

r=rjesup@wgate.com if a comment documenting AddTimerInternal's return values is
added
I'll check this into the trunk later today if everyone's ok with it.  I still
crave Ere's review.

/be
Attachment #80796 - Attachment is obsolete: true
This is interesting. I wrote a comment before, but I've no idea where it went. A
A few things: 

I strongly suggest we also take my change from SetDelayInternal() where mTimeout
is always set unless it is a precise timer and mTimeout is already set. i.e.:
-  if (mTimeout == 0 || mType == NS_TYPE_REPEATING_SLACK)
-    mTimeout = now;
+  if (mTimeout == 0 || mType != NS_TYPE_REPEATING_PRECISE)
+    mTimeout = now;

While debugging the code (I've used a good amount of time in that :) I saw
underflows before this change. 

Another problem is that with this code it is still possible for a timer near the
0xFFFFFFFF boundary to not fire, right? If PR_IntervalNow() rolls over before
the thread checks for this, it's too late and the timer is left dangling. This
will kill biff and others (if there are such) that rely on a timer firing.

About resolution, the smallest interval resolution is 1000 ticks per second.
Isn't that exactly the same as milliseconds? 
Wow, I was not thinking clearly.

Regarding the 1ms minimum resolution of PRIntervalTime, Ere is also right -- I
was going by pav's whiteboard numbers, which were 10000 and 100000 -- should be
1000 and 100000 ticks/sec.  So we should store mTimeout in ms, always, but I'll
leave that for the 1.1alpha trunk.

Resoliciting r= and sr=.

/be
Attachment #80808 - Attachment is obsolete: true
Comment on attachment 80842 [details] [diff] [review]
Ere scores again: TIMER_LESS_THAN was wrong, plus the SetDelayInternal ONE_SHOT fix

sr=waterson
Attachment #80842 - Flags: superreview+
Comment on attachment 80842 [details] [diff] [review]
Ere scores again: TIMER_LESS_THAN was wrong, plus the SetDelayInternal ONE_SHOT fix

Add that comment documenting the return values from AddTimerInternal() for real
this time, and r=rjesup@wgate.com

Good catch by ere
Attachment #80842 - Flags: review+
Comment on attachment 80842 [details] [diff] [review]
Ere scores again: TIMER_LESS_THAN was wrong, plus the SetDelayInternal ONE_SHOT fix

a=blizzard on behalf of drivers for the 1.0 branch
Attachment #80842 - Flags: approval+
Just a comment change, nothing else to see.

/be
Checked into trunk and 1.0 branch.

Ere, can you mark VERIFIED once you've verified with biff?  Many thanks,

/be
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
I've run my own build for two days without problems. I'll run the trunk build
overnight and verify then.
adding branch resolution keyword. 
Keywords: fixed1.0.0
Marking verified. Of course this is difficult to verify 100% but it has been
working for several days now.
Status: RESOLVED → VERIFIED
Just a nitpick:

u - t < DELAY_INTERVAL_LIMIT (u <> t) is equivalent to
~(u - t) > ~DELAY_INTERVAL_LIMIT (u <> t) is equivalent to
(t - u - 1) > (DELAY_INTERVAL_LIMIT - 1) (u <> t) is equivalent to
t - u > DELAY_INTERVAL_LIMIT (u <> t)

Since u - t < DELAY_INTERVAL_LIMIT is never evaluated when u == t the condition
is unnecessary and t - u > DELAY_INTERVAL_LIMIT can be used in all cases.
I think that's not true. Say, if t=0xfffffff0 and u=1, TIMER_LESS_THAN must
return false (current interval rolled over before we managed to trigger the
timer at 0xfffffff0). Note that t can be > DELAY_INTERVAL_LIMIT and often is. 
*** Bug 127693 has been marked as a duplicate of this bug. ***
>     return (t < u)
>            ? u - t < DELAY_INTERVAL_LIMIT
>            : t - u > DELAY_INTERVAL_LIMIT;
Ere, when t=0xfffffff0 and u=1 we don't have t < u so this calculates t - u >
DELAY_INTERVAL_LIMIT anyway.
Crap, I had the variables mixed. New try:

Say, if t=1 and u=0xfffffff0, TIMER_LESS_THAN must
return false (current interval rolled over before we managed to trigger the
timer at 0xfffffff0). 
Ere, when t=1 and u=0xfffffff0 we have t < u, so you then calculate u - t =
0xffffffef which is not < DELAY_INTERVAL_LIMIT as you say. I say calculate t - u
= 0x11 which is not > DELAY_INTERVAL_LIMIT thus the answer is still correct.

This might be easier to understand if t and u were integers. Assume t = (int)t64
and u = (int)u64. We want abs(t64-u64) < DELAY_INTERVAL_LIMIT by definition of
wrap. Then if t64 < u64 then t64-u64 < 0 and (int)(t64-u64) < 0 and
(int)t64-(int)u64 < 0 thus t - u < 0!
Right you are, it seems. There was a problem with the original one, and the
latest version seemed to work in all cases, so I didn't even think that it might
have been even easier. Additionally, in comments #36 and #39 I had a bad thinko.. 

Should it be fixed? Brendan?
*** Bug 141131 has been marked as a duplicate of this bug. ***
neil: thanks, ere and I were cross-eyed from the case analysis, and missed the
neat fact that ~D == D - 1 if D is 0x80000000 and represented by an unsigned 32
bit type.  I'll fix fast with drivers' blessing.

/be
Looking to get this into 1.0 and the trunk.

/be
Not gonna reopen, because not broken -- just less efficient than the fix we want.

/be
OS: Windows 2000 → All
Priority: -- → P1
Hardware: PC → All
Target Milestone: --- → mozilla1.0
Comment on attachment 81911 [details] [diff] [review]
quick followup fix for efficiency, truth, and beauty

r=rjesup@wgate.com
Attachment #81911 - Flags: review+
Comment on attachment 81911 [details] [diff] [review]
quick followup fix for efficiency, truth, and beauty

sr=waterson
Attachment #81911 - Flags: superreview+
Comment on attachment 81911 [details] [diff] [review]
quick followup fix for efficiency, truth, and beauty

a=asa (on behalf of drivers) for checkin to the 1.0 branch
Attachment #81911 - Flags: approval+
Followup fix is in branch and trunk.  Thanks, neil@parkwaycc.co.uk.

/be
See bug 252324, which was left in the wake of this bug's patches.

/be
You need to log in before you can comment on or make changes to this bug.