Closed Bug 291386 Opened 19 years ago Closed 3 years ago

Timers fire way too quickly in some cases

Categories

(Core :: XPCOM, defect, P3)

x86
Windows XP
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: darin.moz, Unassigned)

References

Details

(Whiteboard: [FF2][no l10n impact])

Attachments

(3 files)

Timers fire way too quickly in some cases.

During heavy page load, our timer code gets horribly confused and starts firing
timers very quickly (a requested delay of 50ms becomes 1ms in some cases).  This
happens with all types of timers, including TYPE_ONE_SHOT!

From a quick glance at TimerThread.cpp, I suspect the problem has to do with
mTimeoutAdjustment growing wildly large.

My code runs a timer on the main thread, and records the interval between Notify
calls.  I exercised it while the browser was loading www.cnn.com and other large
pages.  On those pages, I expect large delays before my Notify method is called,
but I do not expect Notify calls to come in so much faster than the requested 50ms.
Attached file timeout values
This attachment contains a log of timeout values generated using a
TYPE_ONE_SHOT timer, where I created a new timer each time the previous one
fired.	Each timer was configured to fire after a 50ms delay.  Notice that the
observed intervals however frequently dip below 50ms.
darin: can you instrument the filter a bit to show mTimeoutAdjustment?  It's
trying to keep deadlines met, even if that means shrinking delay a bit.

Probably this IIR filter is just hopelessly wrong in light of the *hard to fix*
design flaw and unintended consequence of pavlov's timer rewrite of late 2001:
by using PR_WaitCondVar, timers now have up to 10ms of error.  Before, with
event- or message-loop native timers, it was more like 1ms.

Ideas on fixing the underlying problem?

If there's a bug in the filter implementation, let's fix it too.

/be
I'll try to instrument the code when I can.  Right now, I'm just testing Firefox
1.0.  I think the solution involves not firing ONE_SHOT and REPEATING_SLACK
timers faster than the interval requested.  1ms vs 50ms is a big deal.  The
current algorithm probably only makes sense for REPEATING_PRECISE.
Note that DOM timeouts and intervals _do_ want this adjustment, generally
speaking, and are ONE_SHOT timers (timeouts have to be; intervals are probably
done that way to share code with timeouts as much as possible?  Or because they
do some of their own adjustment for delays?  See
http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#5252)
> Note that DOM timeouts and intervals _do_ want this adjustment, generally

Are you saying that it is okay for setTimeout(foo, 50) to call foo 1 ms later?
Ah, it looks like the DOM timer code deals with timers firing too early:
http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#5105
> Are you saying that it is okay for setTimeout(foo, 50) to call foo 1 ms later?

No, but by far the #1 use of setTimeout is for the foo to set the same timeout
again.  In that case, we really do want to average out to the timeout length in
the long run, or at least try to.

But yes, it's a problem in general...
> No, but by far the #1 use of setTimeout is for the foo to set the same timeout
> again.  In that case, we really do want to average out to the timeout length in
> the long run, or at least try to.

I think the DOM code (setTimeout) does not have the problem described in this
bug.  It seems to have code to workaround the bug I'm talking about here.

I think that each ONE_SHOT timer should be independent of previous ONE_SHOT
timers.  It's absurd for there to be any shared state between different nsITimer
instances.  We need to ensure that ONE_SHOT timers fire no faster than the
requested timeout (within some delta at least).  For TYPE_REPEATING_PRECISE and
TYPE_REPEATING_SLACK there should be some minimal allowed timeout for when we
are in 'catch up' mode.  1 ms is a bad choice for that minimum ;-)
> I think that each ONE_SHOT timer should be independent of previous ONE_SHOT
> timers.

To a certain extent... But the problem this code is trying to address is the
following.  Say we are at time X.  We are asked to fire a timer at time X+N.  We
know that in general there are M units of systematic lag in the system so that
if we set the timer for X+N it'll fire at X+N+M.  So we should really set the
timer for X+N-M so that it will fire at the right time.  This is the general
idea.  The rest is implementation -- to measure the systematic lag we use past
timers.

The problem we have is that the lag is not quite systematic and is largely due
to the poor resolution of the PR_WaitCondvar code.  So we can end up with N
approximately equal to M and a call aligned on 10ms boundaries in such a way
that for it the lag is close to 0.  Then we get the effect you describe.

The solution should be to fix the resolution of the timer code and then
reevaluate how good a job the current algorithm does when it doesn't have to
contend with people wanting timers that fire every 10ms using a timing mechanism
that is only accurate to 10ms....
Did the system as a whole regress, without the timer code changing?  Darin,
would you be able to provide the patch you used to test, so helpers can try it
on older milestones?

/be
bz,

i appreciate that the code is trying to make up for delays that may be observed
with past timers, and that is a good thing.  but, we know when we are about to
issue a timer's callback, and we know when the consumer initialized the timer. 
we can therefore determine that it is not yet time for the timer to expire. 
(the DOM setTimeout code does exactly this!)  for example, we can tell that 1ms
is no where near to the 50ms requested.  we can make the timer wait a little
longer before issuing the timer's callback in an attempt to better hit the
desired timeout value.  this is what the DOM code does, and i think it is what
we should do in the nsITimer code.

i doubt this problem has anything to do with the resolution of PR_WaitCondVar. 
that may be a problem that matters for much shorter (than 50ms) timeouts.  i'm
pretty sure the main problem with timers is that we simply fail to process timer
PLEvents on the main thread when we are deep inside layout for some lengthy
period of time.  recall: the problem described in this bug occurs when the
system is heavily loaded (while trying to render cnn.com for example).

i think we should solve this bug by implementing logic that makes the timer code
go back and wait again if it has expired significantly too soon (based on the
predictive logic).

brendan: my test code is part of a larger extension.  but, it's easy enough to
make a component that can be dropped into mozilla to test this bug.  i'll see
about putting something together for that.
pav and I were thinking starvation of main thread events is a problem now on
more sites than before, or some such change.  Or maybe it's always been a
problem, but we thought this would have been noticed by DHTML gurus sooner.

Darin: agreed on relieving every nsITimer user from having to reinvent the DOM
timeout/interval code that avoids running too soon.  Any impulse response filter
is not going to cope well with starvation-induced delays of the order you saw. 
So we should fix this feedback mechanism and push the DOM logic down into XPCOM
at the same time, whether we fix the starvation problem any time soon, or not.

Pavlov is probably not the right owner.  I won't have time for this till later
in the week.  Guess we can live with it for 1.1a/1.8b2.

/be
Darin, I see what you're saying.  Yeah, your proposal makes sense.
Download this JS file, edit the LOG_PATH variable, and then drop it into your
components directory.  To reproduce excessively small timeouts, load
www.cnn.com.  Then click the link to the CNN money page.  Now, press back and
forward a bunch of times switching between those two pages.  Then take a peak
at the log file.

On my system, I get output that looks like this:

40
40
40
491
40
40
30
521
20
0
10
10
20
30
40
40

Normally, when the browser is just idling, the timeouts hit the 50 ms mark
pretty reliably.
reassigning to brendan ;-)
Assignee: pavlov → brendan
Good idea!

/be
Status: NEW → ASSIGNED
Flags: blocking1.8b3+
Priority: -- → P2
Target Milestone: --- → mozilla1.8beta3
Flags: blocking1.8b4+
Flags: blocking1.8b3-
Flags: blocking1.8b3+
Blocks: 21762, 203448
Whiteboard: [no l10n impact]
Sorry to come in here with an issue that might or might not be topical..

I am noticing that on ver 20050713 of Thunderbird ( 1.0+ Alpha 2 ), after
*something happens*, timers are triggering at nearly 8 seconds, no matter what
the setTimeout OR setInterval interval parameter is set to..

I can enter any of these in the JS console and it will trigger in ~ 8secs

setTimeout(alert,10000,'triggered');
setTimeout(alert,20000,'triggered');
setTimeout(alert,30000,'triggered');
setTimeout(alert,40000,'triggered');
setTimeout(alert,50000,'triggered');

If I close down and then restart TB, timers are not firing early ( until that
*something* happens again...

Is this the same issue or is there a bug filed for this issue
brendan, are we likely to get a not risky patch in time for b4? This seems like
it probably shouldn't block us. 
This can be done for 1.8b4, by imitating with the DOM setTimeout/setInterval
code does to deal with early timers (formerly from buggy OSes such as Win2k,
when we used native widget-kit timers instead of Pav's XPCOM portable timers).

/be
Whiteboard: [no l10n impact] → [no l10n impact] [ETA 8/22]
Target Milestone: mozilla1.8beta3 → mozilla1.8beta4
Blocks: 280922
Flags: blocking1.8b5+
Flags: blocking1.8b5+
Can't block at this point without a patch.  If someone has time to figure out a
solution here, please take this bug.

/be
Flags: blocking1.8b5+
*** Bug 312972 has been marked as a duplicate of this bug. ***
Blocks: 306834
This hit us (Sunbird/Calendar/Lightning) pretty hard in bug 306834, resulting in a pretty ugly workaround hack (see attachment 201791 [details] [diff] [review]). Therefore we would greatly appreciate a fix here.
This bug also causes ChatZilla to go completely numb in some cases.

Timers start firing basically instantly, which means that the automatic who check (every 2m) and automatic ISON check (every 5m) start spamming the send queue and we quickly end up with tens of thousands of items queued, which means anything you try to send will basically never get there (we send 1 item ever 1.5s).

We also don't ping out because ping responses are handled independantly from the sendqueue.

Needless to say, like the sunbird people, we would *really* appreciate a fix.
Blocks: 319219
Removing clearly bogus ETA.
Whiteboard: [no l10n impact] [ETA 8/22] → [no l10n impact]
changing sev to critical because, as comment 23 notes, chatzilla becomes useless when sending to server ceases (bug 319219). cpu loop also probable.

question - if sendqueue length goes very high what might happen?  For example 200,000 or a million?


(?milestone mozilla1.8beta4 is past?)
Severity: normal → critical
Note that setTimeout has a similar problem (bug 318419 and bug 312972), despite the code Darin mentioned in comment 6.
Blocks: 318419
How certain is it that load - previously described as "heavy page load" and "system is heavily loaded" - is a necessary condition to induce this problem?  

When chatzilla induced timer bug hits (bug 319219) the system is often unattended, idle.

Other symptoms when chatzilla tanks - in the browser:

* tab switch very slow (10-30 sec)
* URLbar contents lag when window finally displays after tab switch  (10-30 sec)
* scrolling page is choppy
* URLbar autocomplete stops working
* mouseover / hover does not display title / tooltip 
(In reply to comment #27)
> How certain is it that load - previously described as "heavy page load" and
> "system is heavily loaded" - is a necessary condition to induce this problem?  

This might be a different bug altogether, but in my experience, load has nothing to do with it. I have a page which runs a function, using setInterval(), twice a minute; the function in turn does an AJAX call. It turns out that whenever unrelated script.aculo.us effects (which steps through animation frames using setInterval()) are set in motion, the first function's scheduling becomes erratic and unpredictable, being called much too soon. For this reason I have been forced to add a guard which skips the interval logic if the time difference is too low. I notice that script.aculo.us effects in general are choppy (as opposed to smooth) on Firefox.
This bug affected me 3 days in a row with ChatZilla on XULRunner.
ssieb suggested that I comment out the first if block in NSPR's ntinrval.c <http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/md/windows/ntinrval.c#59>, and after making that change I haven't seen the bug occur for 4 days.
Since the original comment mentions mTimeoutAdjustment, I think this is a (partially) relevant...

There I was, logging nsTimerImpl:5, and I noticed that all the timers that fire before smoothSlack (and thus mTimeoutAdjustment) are set up [1] were all only 0-2ms late - pretty damn good.

However, once smoothSlack got involved, the timers started varying wildly between 8ms *early* and 50ms late mostly (this was with nothing appreciable going on).

smoothSlack itself varied between -600 and -250 for most of it, which seems ok (equates to attempting to wake up 4-10ms early).

[0] FWIW, this is Windows XP 64bit, on an AMD Athlon 64 3200+ with:
  QueryPerformanceFrequency: 2009000000
            _nt_ticksPerSec: 61309
               _nt_bitShift: 15

[1] They wait until they have a certain number of samples first.
No longer blocks: 319219
Blocks: 317491
Anyone want to take this bug from me?  I'll advise.  I've been a bad owner and I don't expect to have the time to fix it, important though it is.  That makes me a useless pointy-hair, I guess.

/be
For the record, AFAIK the DOM code doesn't do anything to adjust for timers that fire early. What it *does* do though is to make sure it always fires JS timeouts in the order for which they were scheduled by the page, even if that means firing the early due to an OS timer for a JS timeout fired early (and before other OS timers fired).
OK -> me
Assignee: brendan → darin
Status: ASSIGNED → NEW
Target Milestone: mozilla1.8beta4 → mozilla1.9alpha
Status: NEW → ASSIGNED
Thanks, Darin!  If we end up with a well-baked patch for this, I would be supportive of it on 1.8.0.x as well as 1.8.1; it's a pretty sour bug.
Whiteboard: [no l10n impact] → [FF2][no l10n impact]
Priority: P2 → P1
Target Milestone: mozilla1.9alpha → mozilla1.8.1
So, I think I found the cause of this bug.  It's basically triggered by the fact that all timer objects are by default restricted to only firing when the main loop is idle.  As a result, timers simply get horribly delayed during page load, especially over a fast network connection.  This is mainly an issue on Windows because we only do idle timers on Windows.

I think we need to account for this explicit starvation when we update the TimerThread's filter.  I'm not sure what exactly though.
I noticed that places had to workaround this bug (see the patch in bug 318052) because suppressing timers during pageload (even chrome pageload) makes it very difficult to implement smooth drag-n-drop.
Well, timers don't work at all in windows during a drag, see bug 203573.
> Well, timers don't work at all in windows during a drag, see bug 203573.

I think that's because the timer events, which drive marquee, are being suppressed because they are only permitted to fire at idle time.  During a drag operation, the main event loop is very busy servicing the mouse movement enents!  Setting a timer to fire normally, and not just when the main event loop is idle, is the only way to workaround this problem.  I think it's really bad that the behavior of the timer system is so platform dependent.  We should try to at least make the timer behavior similar on all platforms, especially given the fact that the timer system was designed to be cross-platform ;-)
Depends on: 129953
Depends on: nsIThreadManager
This bug will be fixed by my patch for bug 326273.
Priority: P1 → P2
*** Bug 335968 has been marked as a duplicate of this bug. ***
Actually, my patch for bug 326273 does not completely resolve this bug.  It is still possible for the filter to force timers to fire too quickly.  I notice the problem on Linux mostly where we have difficulty maintaining 10ms timers on DHTML workloads.

See this testcase:
http://wd-testnet.world-direct.at/mozilla/dhtml/funo/timeoutTest.htm
Assignee: darin → nobody
Status: ASSIGNED → NEW
QA Contact: xpcom
Target Milestone: mozilla1.8.1 → ---
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b1) Gecko/20060728 SeaMonkey/1.1a] (nightly) (W98SE)

(I tried Windows suspend mode again: "bug 335968" still there.)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1) Gecko/20061002 SeaMonkey/1.1b] (nightly) (W98SE)

(I tried Windows suspend mode again: "bug 335968" still there.)
Setting multiple setInterval()'s on one page causes timers to behave erratically (oscillate between near instant firing, and very delayed firing).  I think this is related to this existing bug, so I'm attaching the test case here.  You can also run this test from:

http://pages.mckoss.com/setIntervalTest.htm
(I wonder if bug 376643 fix will help here ?)
Adding bug 329775, which has been fixed with a workaround. But it would be much better, if we could remove the workaround, when this is fixed here.
Blocks: 329775
Why can't this bug be fixed? It's open for YEARS now.
The first time I tried the testcase, it worked fine.  But repeatedly after that, I get something like:
f1: 998ms
f2: 999ms
f1: 1504ms-> Greater than 10% error!
f2: 1504ms-> Greater than 10% error!
f1: 489ms-> Greater than 10% error!
f2: 489ms-> Greater than 10% error!
f1: 1005ms
f2: 1005ms
f1: 1001ms
f2: 1002ms
f1: 1003ms
f2: 1003ms
f1: 997ms
f2: 1000ms
f1: 1001ms
f2: 999ms
f1: 1001ms
f2: 1001ms
f1: 1000ms
f2: 1000ms
f1: 1000ms
f2: 1000ms

Always close to 1500 on the second firing and the third firing will of course be very close to 500 to compensate.  The rest are all very close to 1000.  I'm using a somewhat recent trunk Seamonkey.
I noticed this problem for years and forgot about it until this week. I'm having this problem, to the point I can't use firefox for my project anymore.

I need it for the timeouts and requests for XmlHttpRequest. Which is working fine on all other browsers (MSIE, Opera, Safari) except for firefox 2.0.7.

If it worked like samuel above wrote, than I could live with it, it doesn't look good however having a div-tag blink like that. 

But having 3 or 4 timers, some just stop working after a few seconds... 
On latest trunk seamonkey on Linux, the worst I get is:
f1: 998ms
f2: 999ms
f1: 1024ms
f2: 1024ms
f1: 975ms
f2: 976ms
f1: 1001ms
f2: 1002ms
f1: 1000ms
f2: 1000ms
f1: 999ms
f2: 1000ms
f1: 1001ms
f2: 1001ms
f1: 1000ms
f2: 1001ms
f1: 1001ms
f2: 1000ms
f1: 999ms
f2: 1000ms
f1: 999ms
f2: 1000ms

I wonder why the second firing for both timers is off like that.
Hi Samuel, if I look at your list #50, I you would be very happy, what i'm getting with firefox 2:

f1: 1000ms
f2: 1000ms
f1: 0ms-> Greater than 10% error!
f2: 0ms-> Greater than 10% error!
f1: 2016ms-> Greater than 10% error!
f2: 2016ms-> Greater than 10% error!
f1: 0ms-> Greater than 10% error!
f2: 0ms-> Greater than 10% error!
f1: 1984ms-> Greater than 10% error!
f2: 1984ms-> Greater than 10% error!
f1: 0ms-> Greater than 10% error!
f2: 0ms-> Greater than 10% error!
f1: 2016ms-> Greater than 10% error!
f2: 2016ms-> Greater than 10% error!
f1: 0ms-> Greater than 10% error!
f2: 0ms-> Greater than 10% error!
f1: 2000ms-> Greater than 10% error!
f2: 2000ms-> Greater than 10% error!
f1: 0ms-> Greater than 10% error!
f2: 0ms-> Greater than 10% error!
f1: 2000ms-> Greater than 10% error!
f2: 2000ms-> Greater than 10% error!

Which is close, to totally useless...
I saw the text at the attachment already said the same thing. The other browsers in the field (IE, Opera, Safari) manage to get a perfect 1000 for most almost every events... 

Would be nice to have firefox show atleast 1000, not 999 or 998, that should be possible I think... Sometimes show a little bit more like 1030 could be excused because of some other thing happening at the same time in the computer.
any update on this bug? this is the first time i'm having to write workaround code for firefox and it is not pleasant!
well it works fine in firefox 3, ff2 is a complete catastroph on this point...
On ff3 (3.0.1) on windows XP I got this on one run:

f1: 993ms
f2: 993ms
f1: 993ms
f2: 994ms
f1: 1344ms-> Greater than 10% error!
f2: 1344ms-> Greater than 10% error!
f1: 667ms-> Greater than 10% error!
f2: 667ms-> Greater than 10% error!
f1: 973ms
f2: 973ms
f1: 1015ms
f2: 1015ms
f1: 1013ms
f2: 1014ms
f1: 1019ms
f2: 1019ms
f1: 976ms
f2: 976ms
f1: 996ms
f2: 996ms
f1: 1000ms
f2: 1000ms

Though I couldn't reproduce it.

Also, FF3 still fires timers too early repeatedly.  See this test page:

http://www-cs-students.stanford.edu/~eparker/files/other/test_settimeout.html
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3

Asa can you help us with this bug? Can this be considered closed?

Flags: needinfo?(asa)

I ran the test case from comment 44 and the results look good to me:

f1: 1001ms
f2: 1001ms
f1: 1002ms
f2: 1002ms
f1: 1004ms
f2: 1004ms
f1: 1002ms
f2: 1002ms
f1: 1004ms
f2: 1004ms
f1: 1000ms
f2: 1000ms
f1: 1006ms
f2: 1006ms
f1: 1000ms
f2: 1001ms
f1: 1006ms
f2: 1005ms
f1: 1000ms
f2: 1000ms
f1: 1003ms
f2: 1003ms

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(asa)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: