Closed Bug 318419 Opened 19 years ago Closed 19 years ago

window.setTimeout() goes crazy periodically - functions are immediately called repeatedly

Categories

(Core :: DOM: Core & HTML, defect)

1.8 Branch
defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: tommut, Assigned: jst)

References

()

Details

(Keywords: fixed1.8.1, testcase, verified1.8.0.2, Whiteboard: [rft-dl])

Attachments

(2 files, 5 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

I have seen this every now and then since using the 1.5 betas and Release Candidates (and now the final build).  
I have an extension that sets a timeout to run a function every hour:
window.setTimeout(reminderFox_initializeReminderFoxHourly, 3600000);

Occasionally when that timeout is set, the target function gets called immediately several times.  Even more strange is that if I open a new tab, or change focus from the browser, the target function gets called again.  It seems something goes haywire with how the function is registered.  

I am not sure what causes this or when, but others have seen it when using my extension (reminderfox) as well.  And bug 316491 has some similarities, though in my case I do not have a really large timeout - the length does not seem to matter.  

The problem is pretty severe when it happens as it keeps calling the function many times and in my case, that function pops up an alert dialog so it's very noticable.


Reproducible: Sometimes
Version: unspecified → 1.5 Branch
Component: OS Integration → General
QA Contact: os.integration → general
Assignee: nobody → general
Component: General → DOM: Level 0
Product: Firefox → Core
QA Contact: general → ian
Version: 1.5 Branch → 1.8 Branch
This is a dupe of bug 312972 actually, but I'm not sure whether that bug is correctly duped to bug 291386, so I won't resolve this one yet.
I'm confirming this, because it is precicely this broken setTimeout/setInterval behaviour that causes problems in ChatZilla in bug 319219.

This is basically just bug 291386 (I don't think it's necessarily a dup though), even though bug 291386 comment 6 indicates it shouldn't happen to DOM timers.
Blocks: 319219
Status: UNCONFIRMED → NEW
Depends on: 291386
Ever confirmed: true
Alright, I think I got this one sussed. It's the DOM timer code in nsGlobalWindow failing to take into account the interval time (from NSPR) wrapping.

In the old days - and on some non-Windows platforms (Linux for example) - the value only wraps after 49.7 days (or thereabouts), but on a modern Windows system at least, the value NSPR uses can wrap in as little as 19 hours (not sure how much lower it can go - that is the wrap point for my system).

This is, it seems, totally unrelated to bug 291386 (which is about XPCOM timers), as they appear to be working fine throughout the wrapping (which messes up the DOM timers).

I will explain the details in a bit, after some more debugging...
does this point then to multiple issues affecting chatzilla?  because it cz sometimes has trouble within a few minutes of starting up.  Also, I'm shocked that I have 2 cz sessions up now for almost 2 weeks and not a hint trouble.
No, it's just one issue, but it's a rather odd one. Allow me to explain...

When a timeout (interval or normal) is added, or in the case of an interval re-scheduled, the DOM code stores when it is expected to fire in an unsigned-32bit int. This is fine normally.

If the NSPR interval time (used for measuring everything) is coming up to wrap (every 19 hours here, 49 days on some systems), the value of "now" gets close to 0xFFFFFFFF. This means that when DOM stores the expected trigger time, it will end up with a wrapped value (when now + delay > 0xFFFFFFFF), which is stored normally. This is where things go wrong...

The new timer is (obviously) not supposed to fire until after the 'now' value has wrapped, but it gets stored at the start of the timer queue with a very low expected fire time (because it wrapped). This means that the next time *any* timeout fires, the wrapped timeout will be considered "expired" (because it is earlier in the queue, and has a lower expected fire time), and will be fired. Ooops!

What this means for ChatZilla, is that if you are running /or start/ it within the 5 minutes before the timers wrap, it will go wrong, as there is a 5m timer (which then gets fired instantly until the 'now' time wraps). This applies to all timeouts and intervals, but ChatZilla is the most visible place to be affected.
I've managed to make it happen on Linux too.  It's not as easy or as obvious and it recovers much quicker because it uses a millisecond timer.
OS: Windows XP → All
Hardware: PC → All
this does sound like bug <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=316491">316491</a>
Bug 316491 could be considered a duplicate of this one, but I would note that there is probably more to that bug than this one - this one is about nice, normal timers wrapping and causing problems (which can be fixed simply by observing the wrapping and handling it); that one is about timers which are too big to store in the first place (which would need much more drastic changes).

They both occur because the XPCOM and DOM timer code sucks, though. ;)
Ian, the problem is now well defined. Is there enough information to pass it to a developer?  Seems straight forward.

should the sunbird bug 306834 blocked by this rather than bug 291386, based on what we now know from James' report here about behavior with chatzilla and the "ugly patch" in bug 306834?
Keywords: testcase
there are a couple other places in the file where it needs to account for rollover, but this should fix the major bug
Comment on attachment 210086 [details] [diff] [review]
checks for rollover when inserting into the list

A much simpler method would be to use the TIMER_LESS_THAN macro from nsTimerImpl.h
Attachment #210086 - Attachment is obsolete: true
Flags: blocking1.8.0.2?
Samuel - is this something you could get an updated patch for in time for the next .0.x release - i.e. in the next couple o weeks?
Yes, I'm working on a patch for 64-bit timers.
Assignee: general → samuel
Status: NEW → ASSIGNED
Attached patch convert dom timers to 64-bit (obsolete) — Splinter Review
I couldn't convince the nspr people to let me add a PR_IntervalNow64 so I had to do it this way instead.  I've tested it at roll-over and everything seems to hold together.  The other side effect of this is that you can now create timeouts up to  the full range of a 32-bit signed integer which is around 24.5 days if you really want to. ;-)  (The previous limit could be as low as 6 hours on certain windows computers.)
Comment on attachment 210671 [details] [diff] [review]
convert dom timers to 64-bit

A patch from another bug conflicted with this one and I also didn't compile with DEBUG_TIMERS set.
Attachment #210671 - Attachment is obsolete: true
Attached patch updated patch (obsolete) — Splinter Review
Comment on attachment 210723 [details] [diff] [review]
updated patch

+PRUint64
+nsGlobalWindow::GetIntervalNow()
+{
+  static PRUint64 upper = 0;
+  static PRUint32 lastnow = 0;
+
+  PRUint32 now = PR_IntervalNow();
+  if (now < lastnow)
+    upper++;
+  lastnow = now;
+  return (upper << 32) + now;
+}

Seems like this patch relies on this function (and the one in the timer code as well) firing at least once between every time PRIntervalTimer's wrap around, otherwise timeouts scheduled more than one wrap out could fire late since the absolute time (upper + now) wouldn't match reality. Plus, "upper" could potentially remain unchanged over several PRIntervalTimer wraps even if this function is called if "now < lastnow" happens to not be true when it does get called (i.e. if the time since the last call to this method is greater than the time it takes for PRIntervalTime to wrap).

Also, the timer code in XPCOM can run on any number of threads (AFAIK), so updating static's w/o a lock isn't safe, and locking there is not necessarily something we'd want to do.

Seems like we need another approach here to solve this :(
As far as I can see, there are only two other solutions:
 A. Make the DOM code handle wrapping, a la XPCOM timers.
 B. Make NSPR return a 64bit time.

The shortest (theoretical) wrap period for NSPR currently is 11h55m. Although I'm not really keen on the idea of making the times 64bit by manually computing the high 32bits outside of NSPR, in practise I'm not sure it is really a problem. It's not my call, though, and I'd have gone for solution A above anyway.
Would need a trunk patch landed soon to get the required baking time, otherwise it misses the 1.8.0.2 train.
To the best of my knowledge there's two problems here.

The first one, which I'm arguing is in practical terms not a problem, is that our XPCOM timers can't fire farther out into the future than 2^32/PR_TicksPerSecond() seconds, which in the worst case is 11h plus change. I can't imagine this fact causing DOM timers to ever fire immediately etc.

The second problem is that the DOM code deals with PRIntervalTime, but it doesn't at all take into account that PRIntervalTime's can wrap fairly easily (depending on the time of day and the requested delay, essentially). The problem here is that when a DOM timeout is scheduled, we set mWhen in the DOM timeout to PR_IntervalNow() + delay, which can wrap and leave mWhen way in the past (relative to a current PR_IntervalNow()). When this happens and any DOM timeout fires, any of the DOM timeouts whose mWhen wrapped (or most, in the most likely case at least) will fire since we walk the whole list of DOM timeouts looking for timeout's whose mWhen is less than PR_IntervalNow(). This could easily lead to DOM timeouts to fire at unexpected times, and when they do, a ton of them could fire at once.

This second problem is fixable in the DOM code by making the DOM code use PRTime instead of PRIntervalTime (this is only one of many ways around this, but it's the one I chose to test out). That way a DOM timeout's mWhen wouldn't wrap (for years at least), and once a DOM timeout does fire we'd only actually execute the DOM timeout's that have been scheduled to fire before the current time. I'll attach a patch that implements these changes in the DOM code in a bit.
This does what I talked about in my earlier comment.

James, does this make sense to you? Let me know if you see any problems with this approach, and I'd really appreciate if you could have a look at this and maybe even run it and try to reproduce the problems you've seen in the past.
Attachment #211103 - Flags: superreview?(brendan)
Attachment #211103 - Flags: review?(silver)
JST, there really was no need to repeat the problem I described a month a go in this bug. We all all well aware of the problem here.

As for the patch, if it works fine. I would rather have just fixed the wrapping.

Also, I can't easily test this any more because NSPR has removed the code that wrapped frequently, so the original problem has basically gone away (the bug remains, of course).
I guess I'll just back out the rather pointless NSPR changes and test this patch. Easier than trying to explain why they shouldn't have changed NSPR. ;)
Comment on attachment 211103 [details] [diff] [review]
Change nsTimeout::mWhen to PRTime to avoid firing DOM timeouts early when PRIntervalTime wraps.

>       // Add a "secret" final argument that indicates timeout lateness
>       // in milliseconds
>-      PRIntervalTime lateness =
>-        PR_IntervalToMilliseconds(now - timeout->mWhen);
>-      timeout->mArgv[timeout->mArgc] = INT_TO_JSVAL((jsint) lateness);
>+      nsInt64 lateness = now - timeout->mWhen;
>+
>+      timeout->mArgv[timeout->mArgc] =
>+        INT_TO_JSVAL((jsint)(lateness / PR_USEC_PER_MSEC_64));

mozilla\dom\src\base\nsGlobalWindow.cpp(6208) : error C2440: 'type cast' : cannot convert from 'const nsTInt64' to 'jsint'
        with
        [
            T=PRInt64
        ]
        Ambiguous user-defined-conversion

Ideas?
Sorry for repeating what had mostly already been said, just wanted to make it clear what I was doing in the context of this patch in my own words as well.

As for the compiler error, try changing the offending line(s) to:

      timeout->mArgv[timeout->mArgc] =
        INT_TO_JSVAL((jsint)(PRInt32)(lateness / PR_USEC_PER_MSEC_64));

nsInt64 really sort of sucks, but so do all our other portable 64-bit helpers.
Attachment #210723 - Attachment is obsolete: true
Comment on attachment 211103 [details] [diff] [review]
Change nsTimeout::mWhen to PRTime to avoid firing DOM timeouts early when PRIntervalTime wraps.

r=silver with the one extra typecast to compile.

I'm not a huge fan of all those 32/64 typecasts, but I'm not a huge fan of C++ either. The patch does fix the problem here, in so far as it has stopped the timers spamming on the original wrap points, and I haven't seen it do it at any other time, either.
Attachment #211103 - Flags: review?(silver) → review+
Attached patch Same change w/o nsInt64 (obsolete) — Splinter Review
This is the same change as the previous (with the exception of the removal of a redundant delay < 0 check) w/o the mess that nsInt64 introduces. We no longer need nsInt64 to work with PRTime since all compilers we now care about have native 64-bit values that work right.
Attachment #211806 - Flags: superreview?(brendan)
Comment on attachment 211806 [details] [diff] [review]
Same change w/o nsInt64

>+  // Non-zero interval in milli seconds if repetitive timeout
>   PRInt32 mInterval;
> 
>-  // Nominal time to run this timeout
>-  PRIntervalTime mWhen;
>+  // Nominal time (in micro seconds since the epoch) to run this timeout

Nit: milliseconds and microseconds are each one word ;-).

>         // Reschedule the OS timer. Don't bother returning any error
>         // codes if this fails since nobody who cares about them is
>         // listening anyways.
>         nsresult rv =
>-          timeout->mTimer->InitWithFuncCallback(TimerCallback, timeout, delay,
>+          timeout->mTimer->InitWithFuncCallback(TimerCallback, timeout,
>+                                                delay / PR_USEC_PER_MSEC,
>                                                 nsITimer::TYPE_ONE_SHOT);

Here (above)...

>+    PRUint32 delay =
>+      PR_MAX(((PRUint32)(t->mWhen / PR_USEC_PER_MSEC)),
>+              DOM_MIN_TIMEOUT_VALUE);
>+
>+    // Set mWhen back to the time when the timer is supposed to
>+    // fire.
>     t->mWhen += now;
> 
>     t->mTimer = do_CreateInstance("@mozilla.org/timer;1");
>     NS_ENSURE_TRUE(t->mTimer, NS_ERROR_OUT_OF_MEMORY);
> 
>     rv = t->mTimer->InitWithFuncCallback(TimerCallback, t, delay,
>                                          nsITimer::TYPE_ONE_SHOT);
>     if (NS_FAILED(rv)) {
>       t->mTimer = nsnull;

and here, it seems possible that delay/t->mWhen could be > 2^32 * 1e6 (current PRTime plus 200 years is 53 bits, fits in a JS double, but overflows a PRUint32 by one bit when divided by 1000000).  Check my math, please.  This suggests having a max delay constant too.

I am sorry I didn't follow up on this problem in the code -- I noticed it when reviewing bryner's bfcache changes, which led to bug 306759 being filed.  Still on my list, I'm duping it against this one now.

/be
*** Bug 306759 has been marked as a duplicate of this bug. ***
I wrote:

> and here, it seems possible that delay/t->mWhen could be > 2^32 * 1e6 (current

The / between delay and t->mWhen meant "or", not "divided by".

/be
provisionally want, pending these patches being landed and tested elsewhere
Flags: blocking1.8.0.2? → blocking1.8.0.2+
I'm one of those routinely affected by bug 319219 using the 1.8 branch nightlies on OS/2. It just happened again minutes ago using yesterday's, so I'd like to see this checked in there to see if it's a fix that works.
This prevents the given interval from ever causing PRTime to wrap or for the difference between now and when the timer should fire from ever being set to a value greater than what our timer code can handle (which is system dependent due to the timer code's use of PRIntervalTime).
Attachment #212163 - Flags: superreview?(brendan)
Comment on attachment 212163 [details] [diff] [review]
Fix comments, and prevent the interval from wrapping as well.

sr=me, thanks.

/be
Attachment #212163 - Flags: superreview?(brendan) → superreview+
Taking bug.
Assignee: samuel → jst
Status: ASSIGNED → NEW
Fixed on the trunk.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Attachment #211103 - Attachment is obsolete: true
Attachment #211103 - Flags: superreview?(brendan)
Attachment #211806 - Attachment is obsolete: true
Attachment #211806 - Flags: superreview?(brendan)
Comment on attachment 212163 [details] [diff] [review]
Fix comments, and prevent the interval from wrapping as well.

Pushing silver's r= forward.
Attachment #212163 - Flags: review+
Attachment #212163 - Flags: approval1.8.0.2?
Attachment #212163 - Flags: approval-branch-1.8.1?
Comment on attachment 212163 [details] [diff] [review]
Fix comments, and prevent the interval from wrapping as well.

approved for 1.8.0 branch, a=dveditz
Attachment #212163 - Flags: approval1.8.0.2?
Attachment #212163 - Flags: approval1.8.0.2+
Attachment #212163 - Flags: approval-branch-1.8.1?
Attachment #212163 - Flags: approval-branch-1.8.1+
Fixed on the branches.
Marking [rft-dl] (ready for testing in Firefox 1.5.0.2 release candidates)
Whiteboard: [rft-dl]
v.fixed on 1.8.0 branch with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.2) Gecko/20060308 Firefox/1.5.0.2, testcase executes as expected, I don't see any unexpected first setTimeouts fired.
*** Bug 316491 has been marked as a duplicate of this bug. ***
From what I understand this issue has been fixed by providing a maximum value for each platform. Does the ECMA define a maximum value? If it does not would this not in some way break specification compliance? Surely we need to try providing the web developers a dependable common solution across platforms. If I test against Firefox on Windows, for the most part I expect it to behave the same on Firefox for Mac, but if there are variations on the rules applied to Javascript for each platform this potentially could lead to varying behaviour.

(In reply to comment #44)
> From what I understand this issue has been fixed by providing a maximum value
> for each platform. Does the ECMA define a maximum value?

ECMA, presumably you mean ECMA-262 aka ECMAScript, does not specify anything but the core JS language.  It does not specify window.setTimeout, window.alert, or any such "DOM level 0" methods.

Neither does any w3c spec (blame MS and Netscape back in the day for not trying to standardize level 0).  However the w3c "web API" working group, a recently added w.g., and the WHAT w.g. (http://www.whatwg.org/), are both taking on the task of specifying parts of DOM level 0.  Cc'ing sicking, to help get this point addressed by any webapiwg-produced spec; Hixie for whatwg is already QA contact for this bug.

> If it does not would
> this not in some way break specification compliance? Surely we need to try
> providing the web developers a dependable common solution across platforms. If
> I test against Firefox on Windows, for the most part I expect it to behave the
> same on Firefox for Mac, but if there are variations on the rules applied to
> Javascript for each platform this potentially could lead to varying behaviour.

The upper bound on interval-time is generous enough that I doubt it will matter, and there is no cross-browser standard that pages count on.  I would hope that any spec not require all browsers to have the same upper bound, only that script should not count on a time >= SOME_BIG_LIMIT, where SOME_BIG_LIMIT is a constant for all platforms.

/be
The webapi wg has not yet started work on the timer functions.

I do believe that it is resonable to allow a limit, and one should be specified in the spec once a spec is written. However the limit should be fairly high, measured in days. I would say somewhere around 2-7 days is a resonable limit. It should be higher then a day though since it's resonable to consider a page that does something once a day.

And the implementation should definitely throw if it is given a time that is bigger then allowed.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: