Closed Bug 376643 Opened 17 years ago Closed 17 years ago

setInterval fires repeatedly with incorrectly small delays if machine is suspend/resumed, hibernate/resumed or process is SIGSTOP/SIGCONT

Categories

(Core :: XPCOM, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9alpha5

People

(Reporter: whorfin, Assigned: samuel)

References

()

Details

(Keywords: fixed1.8.0.14, fixed1.8.1.8)

Attachments

(5 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3
Build Identifier: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3

The high level summary is that setInterval appears to be implemented
based on absolute time, not actual elapsed intervals.  Reading:
http://developer.mozilla.org/en/docs/DOM:window.setInterval
clearly states that "func" or "code" are to be called "with a fixed time
delay between each call to that function".  Instead, the func/code is called
n times "as quickly as possible" based on total elapsed time.  If firefox
is suspend/resume'd for a day for instance, a one-second setInterval function
will be called 24*60*60 times in quick succession [ie delays way less than
one-second].  During this "call them all fast" condition, firefox consumes
100% cpu.  See Bug 371879 https://bugzilla.mozilla.org/show_bug.cgi?id=371879
Firefox may essentially never catch up if there were many setIntervals
active, as may happen with many open windows/tabs using javascript and/or
extensions [such as AdBlockPlus] which use setInterval.

I'm no firefox/mozilla expert, but would guess GlobalWindowImpl::RunTimeout
and/or GlobalWindowImpl::SetTimeoutOrInterval are to blame.  I suspect
this affects much more than just firefox.

Reproducible: Always

Steps to Reproduce:
1. load the URL above
2. suspend or hibernate machine on Windows, use "kill -STOP pid" on linux or OSX, or use http://www.microsoft.com/technet/sysinternals/utilities/ProcessExplorer.mspx
on Windows to suspend the firefox process.
3. Wait some amount of time.  Say one minute.
4. Wakeup machine, use "kill -CONT pid" on linux/OSX, or use process explorer
to resume the firefox process.
Actual Results:  
Observe that tick() is fired approximately every 1000 milliseconds until
process/machine is suspended.  After resume/wake, tick() is fired
n times [at least 60 when suspended for one minute], with an interval
of 9-25 (depending on machine speed and other pending setIntervals in
firefox session) milliseconds, before eventually settling back down to normal.

Note, too, that this behavior can be seen even without suspend/resume,
especially on a busy windows box, as the timing will occasionally become
quite erratic --- if an interval is late, extra firings are done to compensate.

Expected Results:  
tick() should be fired every 1000 milliseconds, as documented, within some
reasonable tolerance.

I believe this is the root cause of the "firefox consumes 100% cpu" bug 
referenced above in Bug 371879 [and many, many previous posts to the
user forums and other nebulous bug reports by a multitude of people over the years].  When in this state, firefox becomes essentially unusable, and may take arbitrarily long to recover based on the number and timing of setInterval calls.
While setInterval() is arguably not a "major feature", the result of this
bug has a major impact on laptop users who use sleep/hibernate, which is why
i requested Major severity.
Blocks: 371879
Keywords: topperf
whorfin - can you please check Bug 327050 and see if that covers what you're talking about as well?

cc'ing biesi as he understand the technical side involved in that bug and I don't.
(In reply to comment #2)
> whorfin - can you please check Bug 327050 and see if that covers what you're
> talking about as well?
> 
> cc'ing biesi as he understand the technical side involved in that bug and I
> don't.
> 

This reminds me of bug 197863, where the sleep- and wakeup-notifications were implemented, to be called before and after a sleep. They're trying to notify all timers that there was a delay. Note that the service is implemented for Mac and Windows (when the OS goes in sleep-mode), but not for Linux. Using kill -STOP can not be caught, so still has the problem ofcourse.

But this code is for timers of the TYPE_REPEATING_PRECISE type, which most timers aren't.
I think we have existing bugs on this, right Brendan?

That said, if we don't we should use this bug.  I do think we should fix this somehow... can the timer thread detect suspend/resume?
Status: UNCONFIRMED → NEW
Component: General → XPCOM
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → xpcom
Whiteboard: DUPEME
Version: unspecified → Trunk
Flags: blocking1.9?
(In reply to comment #2)
> whorfin - can you please check Bug 327050 and see if that covers what you're
> talking about as well?

that bug wouldn't explain anything related to setInterval
(In reply to comment #2)
> whorfin - can you please check Bug 327050 and see if that covers what you're
> talking about as well?

As Christian says, that bug is quite different.  Bug 213637 seems to be about
more than one thing, one of which is 327050, the other which may well be this
setInterval issue at root.

I'm neither the architect of nor the designer of the setInterval() feature, but
for what it's worth, I would recommend the fix to be implementation as originally described in the doc, not to attempt to detect suspend/resume [note that SIGSTOP can't be detected] and do something special.

Think of setInterval() as being like setting "frames-per-second" in an
animation system.  If your player can't keep up, you want to drop frames,
not try to blat out all the frames at once when you get behind --- you'll
never catch up, and your playback will begin to fall further and further behind
in terms of elapsed time.

In other words, waking up from a long sleep shouldn't be materially different
from handling the case where the called function took much longer to execute than the defined interval.  In both cases, you just want to fire again at the next regularly scheduled tick interval, not effectively cue up "n" additional null-delay firings.  setInterval() is defining a regular interval of execution, and you want the next execution of the "tick" function to be as close as possible to the next regularly defined timing interval.  Again, thinking of this as something akin to an "event every m milliseconds" timeline, NOT "there will be n total invocations from when called to time t", IMHO.  Think of the
way a media player drops frames when needed to ensure synchronization.
> In both cases, you just want to fire again at the next regularly scheduled tick
> interval, not effectively cue up "n" additional null-delay firings

Sadly, I don't think that's compatible with web content for the "called function took much longer" case, so we have code to do precisely the queueing you see.  But I welcome evidence to the contrary (that is, evidence that indicates that dropping interval firings is what sites expect or other UAs do).
Contrary evidence would be: [i know i know, i'm sorry in advance...]
Internet Explorer, which behaves as i've described.  If i suspend it for
10 seconds, i see the next firing as having taken 10000 + "some amount"
millisecs.  Just in case it's handling suspend, I tried pausing it in a debugger, with the same results.  Likewise, whereas when firefox first opens the
page above and typically stumbles, causing a few very-short firings,
IE stumbles but doesn't try to catch up --- firings are always >= 1000 millisecs.

IE is also way more stable in timings --- it cruises at pretty much exactly
1000...

That was the latest [IE 7] on Windoze.  I also tried an old [IE 5.2] version
on OS X, where i was able to use SIGSTOP/SIGCONT [as under linux], which can't be intercepted.  Similar results to Windoze.

I also tried Safari on OSX.  Interestingly, this behaves per my recommendations above --- it seems to be trying to hit a specific time interval.  If one firing was late by 15 millisecs [1015 milliseconds reported], the next will try to be 15 early [985 or so reported].  But no bogus extra queueing - if i SIGSTOP it for 13 seconds, i get 13xxx milliseconds reported, where xxx was dependent on when it was stopped.

Finally, the latest Opera appears to act similarly to the latest IE --- timings were always >= 1000 for the test case.  Suspending for 10 seconds led to a >= 10000 milliseconds result.
Right, but what I was wondering is what other browsers do when an interval's function takes longer than the interval to run, not what happens if you suspend the app...  The latter was already described in this bug.
Sorry I misunderstood, though it's interesting to see the other behaviors.
I believe it's the same situation with slow funcs; try this:
http://users.cjb.net/macabretech/setintervalslow.html
[also attaching to this bug just in case]

Every tenth tick, the tick functions burns 2700 milliseconds via an evil
spin-loop.

On firefox, we see the expected "extra firings" queued and fired quickly, at much less than 1000 millis.
On IE, we get no extra firings --- after the "Burn" happens, the next is at around 3700 and then 1000 until the next burn.
Some thing in Opera, the only difference being opera's timing appears really tight and better than IE; IE tends to stumble for whatever reason and do a few in the neighborhood of 1500 rather than 1000, while opera looks like a clean mix of 1000 and 3700

I'm attaching a text file of some example outputs.
oops, remove add insertion done by cjb.net
Attachment #260978 - Attachment is obsolete: true
In the "function took longer to execute than interval" case, the Firefox handling seems more correct to me. But in the "system was suspended" case it leads to the unwanted bursts. Solving that would need detecting suspend in a cross-platform manner... How about periodically requesting a signal from the OS, and checking how late it is, and if very late then assuming the system was suspended and cancelling the interval queueing?

Of course it could lead to some false positives on overloaded systems...
If the called function was "too slow" once, it is likely to be "too slow"
again.  The current [and your suggested] behavior is a recipe for getting further and further behind, meaning more and more events in the queue.  For what it's
worth, Safari seems the most correct to me, it is trying to fire the function at
a regular temporal interval.
What is the use case for having really accurate intervals?  The only thing I can think of is for benchmarking. :-)  I would think anyone wanting such accuracy should be checking the time himself and using setTimeout.
I agree.  For animations [like scrolling] it's desireable, and I don't think the Safari behavior need cost anything, but in any event I will vote for reasonable CPU usage over accurate timers in a web browser.  If firefox handled timers like Opera or even IE I'd be totally happy, I just note the Safari behavior as something to be considered "more correct" than queueing additional events.
Whiteboard: DUPEME
This looks like a pretty easy fix.  As I understand it, our current behaviour is good, except if the next firing time is in the past, then the next firing time should be a full interval time from now.  Is that correct?

If no one else is looking into this, then I could take a shot at it.  I had a good look through the timer code a while back.
Samuel, go for it!
Attached patch fix JS intervals (obsolete) — Splinter Review
I finally found it.  JS uses one-shot timers to do intervals.  There may still be a problem with precise repeating timers if anyone uses them.  But this will fix the
Assignee: nobody → samuel
Status: NEW → ASSIGNED
Accidentally hit submit...  Last sentence fragment should not be there.
Should that not be :

       PRTime currentNow = PR_Now();
       PRTime delay;
       PRTime nextInterval = (aTimeout ? timeout->mWhen : now);
 
       // Check if the next interval timeout is overdue.  If so,
       // then restart the interval.
       do {
         nextInterval += (PRTime)timeout->mInterval * PR_USEC_PER_MSEC;
         delay = nextInterval - currentNow;
       } while (delay < 0);


- this also works when (timeout->mInterval * PR_USEC_PER_MSEC) is smaller then the delay, which is exactly what this bug was all about.

- this also works when aTimeout is 0, so that nextInterval should start form timeout->mWhen

- I've also taken the opportunity to cache PR_Now() in currentNow, which is a stupid name, I agree, but so is 'now' in this context :-)
This patch only call PR_Now() once.  I also included a check to make sure precise timers don't have a problem.
Attachment #261583 - Attachment is obsolete: true
Attachment #261910 - Flags: review?
Attachment #261910 - Flags: review? → review?(jst)
Flags: blocking1.9? → blocking1.9+
Comment on attachment 261910 [details] [diff] [review]
better patch
(ignore the xpcom changes)

The changes to the DOM code look good to me, but the changes to xpcom/threads/nsTimerImpl.cpp don't look right.

     mTimeout = now;
 
   mTimeout += delayInterval;
+  if (mTimeout < now)
+    mTimeout = now + delayInterval;
 
mTimeout and now are both of type PRIntervalTime, which is unsigned 32-bit int, and they wrap around reasonably frequently on some systems. IOW, it's not at all uncommon to have now be near 2^32 and mTimeout be just larger than 0, yet that means that mTimeout is in the future, not in the past.

r=jst for the DOM changes, but the last part of the patch needs some work, if it's needed at all.
Attachment #261910 - Flags: review?(jst) → review+
(In reply to comment #24)
> mTimeout and now are both of type PRIntervalTime, which is unsigned 32-bit int,
> and they wrap around reasonably frequently on some systems. IOW, it's not at
> all uncommon to have now be near 2^32 and mTimeout be just larger than 0, yet
> that means that mTimeout is in the future, not in the past.
> 
> r=jst for the DOM changes, but the last part of the patch needs some work, if
> it's needed at all.

See http://lxr.mozilla.org/mozilla/search?string=TIMER_LESS_THAN and use that if the two terms are in the same domain (the one that's restricted to be half the uint32 domain).

/be
Comment on attachment 261910 [details] [diff] [review]
better patch
(ignore the xpcom changes)

I did know about TIMER_LESS_THAN...  I'll drop the xpcom changes for now.  Who should I get to sr the dom changes?
Attachment #261910 - Attachment description: better patch → better patch ignore the xpcom changes
Blocks: 213637, 327050
Target Milestone: --- → mozilla1.9alpha5
Attachment #261910 - Attachment description: better patch ignore the xpcom changes → better patch (ignore the xpcom changes)
Attachment #261910 - Flags: superreview?(brendan)
Comment on attachment 261910 [details] [diff] [review]
better patch
(ignore the xpcom changes)

>+      // Also check if the next interval timeout is overdue.  If so,
>+      // then restart the interval from now.
>+      PRTime nextInterval = (PRTime)timeout->mInterval * PR_USEC_PER_MSEC;
>+      if (!aTimeout || (nextInterval + timeout->mWhen <= now))

Nit: to match the later hunk, use < rather than <=.

Also a nit: no need to parenthesize <= against ||.

>@@ -503,6 +503,8 @@
>     mTimeout = now;
> 
>   mTimeout += delayInterval;
>+  if (mTimeout < now)
>+    mTimeout = now + delayInterval;

Here's the < that I meant.

/be
I'll gladly sr+ a final patch ;-).

/be
Attached patch cleaner patchSplinter Review
I removed the xpcom changes.  Brendan, that's what you were comparing it with.  I also removed the extra parentheses.
Attachment #261910 - Attachment is obsolete: true
Attachment #265338 - Flags: superreview?(brendan)
Attachment #261910 - Flags: superreview?(brendan)
Comment on attachment 265338 [details] [diff] [review]
cleaner patch

Sorry, thought I already sr'ed this.

/be
Attachment #265338 - Flags: superreview?(brendan) → superreview+
Checking in dom/src/base/nsGlobalWindow.cpp;
/cvsroot/mozilla/dom/src/base/nsGlobalWindow.cpp,v  <--  nsGlobalWindow.cpp
new revision: 1.935; previous revision: 1.934
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
No longer blocks: 371879
Given that this bug has only been fixed within the past few days; do you have any good suggestions for workarounds within user space?

I'm actually writing an extension, and I had been relying on setInterval for a "fetch this HTTP resource and check it every hour" sort of thing.  Rapid-fire "intervals" to make up the time after a suspend are superfluous in this case.

Note: this same bug will also happen if you just set your clock forward.
The only suggestion I can make is to check how long it's been since the last one and either just ignore it if it hasn't been long enough or restart the interval.  
On Linux there is a "realtime" timer which is separate from the "clock" timer.  That might be a better fit as the clock source (won't be affected by changing the clock time).  I wonder if windows has a similar thing.
I'm confused by the latest changes to blocking and targets.
Do I understand correctly that this is destined only for Gran Paradiso/Firefox
3/Mozilla 1.9?  This is a serious bug, it would be awesome to target it
for a 2.x update.
Only stability/security issues get fixed in branch builds, in general.
It's just too risky to change code on branch builds, without a significant amount of baking time.
Thanks for the explanation.
I claim that consuming 100% CPU until "Exit" is selected, which further
leads to a crash [see Bug 282660 for talkback references], is a stability issue.
For stability issues (which I agree this could be in some cases) it becomes a tradeoff between the risk and the benefit.  This is a very risky change as far as I can tell, but I'd welcome comments otherwise from the reviewers and patch submitter if they feel otherwise.

Given the risk, this change would need a good bit of trunk testing before it could be considered for the branch... then a good bit of branch testing after it lands there.
(In reply to comment #91 of Bug 213637)
> Folks,
> 
> I had not searched for this bug until yesterday. I have been experiencing this
> high CPU usage after suspend for a very long time. I had high hopes when I read
> about 2.0.0.7pre, but coming out of suspend this morning Firefox was using  98%
> of the CPU time...well I mean Bon Echo 2.0.0.7pre. I have not removed all the
> AddOns, so maybe there is a contribution from one of them.

It turns out that the 100% CPU time issue I have be experiencing for a very
long time is really described by THIS bug 376643. I use a laptop nearly 95% of
time. I use it docked at work and bring it with me when I go home or travel. I
suspend and resume AT LEAST twice a day and almost every time I deal with the
100% CPU issue. Thank goodness for the Session Manager add-on! I often forget
to kill FF before suspending and deal with a turtle slow computer until I get
TaskManager up to kill FF. Damnit I should not have to do that...it is my
single big beef with FF and a daily headache for me.

CAN SOMEONE PLEASE PATCH FF 2!?!?!

Thank you.
Ok, I'll ask for blocking branch for this, since the patch is now on trunk for a few months, but it would be great if Samuel (or anyone else who could know) would give an indication of how risky the patch would be on branch. 
Flags: blocking1.8.1.8?
Flags: blocking1.8.0.14?
"wanted" on the 1.8 branch, too risky and pointless for the unsupported 1.8.0 branch, not yet sure if 1.8.1.8 is the release to take the risk on this change. Have there been any trunk regressions?
Flags: wanted1.8.1.x+
Flags: blocking1.8.0.14?
Flags: blocking1.8.0.14-
What are the criteria to determine when the risk is warranted? I need to kill Bon Echo/FF at least once a day when it spins as much CPU as it can grab. If the assumed risky fix causes an occasional failure then I am really no worse off. All I need to do is start FF rather than BE and I am back to only the daily abnormal termination. BE is a beta rev use at your own risk...right?
(In reply to comment #43)
> What are the criteria to determine when the risk is warranted? I need to kill
> Bon Echo/FF at least once a day when it spins as much CPU as it can grab. If
> the assumed risky fix causes an occasional failure then I am really no worse
> off. All I need to do is start FF rather than BE and I am back to only the
> daily abnormal termination. BE is a beta rev use at your own risk...right?

Uh oh -- are you really using a browser branded BonEcho? That name means a Firefox 2 pre-release. You want the latest Firefox 2 post-final patch release, 2.0.0.7, from http://getfirefox.com or http://www.mozilla.com.

Please confirm that this bug is still biting you once you are running something that calls itself Firefox 2.0.0.7 in the about: page or the Help menu.

/be
Right...I figured (in error???) that BE would be the first place to get this fix. Sure I can update my FF from 2.0.0.6 to 2.0.0.7 and give that a try...done. I will let you know what happens. So far every release I have tried fails in this way.
I can save you the trouble, yes, it's still a problem in 2.0.0.7.

Folks, all you have to do is follow the instructions at the top of this bug to verify whether or not this bug is fixed.  It takes but a few minutes.
Flags: blocking1.8.0.14- → blocking1.8.0.14?
The BonEcho and "beta" words in comment 43 made me fear a pre-release of Firefox 2 was still in use (we do not timebomb old betas as Netscape did).

whorfin, anyone: please confirm that this bug is fixed on a trunk nightly.

Assuming trunk nightlies show this is indeed fixed, then dveditz: I recommend that we take the fix for 1.8.1.8 (Firefox 2.0.0.8). I know of no regressions and it's a clean fix AFAICT.

/be
(In reply to comment #44)
> Uh oh -- are you really using a browser branded BonEcho? That name means a
> Firefox 2 pre-release.

1.8 branch nightlies are still branded "BonEcho" to discourage misuse by people who don't know what they're getting into. The official Firefox branding isn't turned on until we create the relbranch and start producing release candidates.

As long as the About dialog gives you a number higher than the most recent official release (currently it says "2.0.0.8pre") then you're getting the very latest. Thank you for helping us test the nightlies, I certainly don't want to discourage that.
The "just in case" testcase attachment still shows the problem in FF2.0.0.7 and is fixed on trunk.
Flags: blocking1.8.1.8?
Flags: blocking1.8.1.8+
Flags: blocking1.8.0.14?
Flags: blocking1.8.0.14+
1.8 branch patch, the difference is due to the fix for bug 52209 on the trunk.
Attachment #282475 - Flags: review?(jst)
Attachment #282475 - Flags: approval1.8.1.8?
Attachment #282475 - Flags: approval1.8.0.14?
Comment on attachment 282475 [details] [diff] [review]
1.8 branch version

approved for 1.8.1.8 and 1.8.0.14 pending r=jst, a=juanb for release-drivers
Attachment #282475 - Flags: approval1.8.1.8?
Attachment #282475 - Flags: approval1.8.1.8+
Attachment #282475 - Flags: approval1.8.0.14?
Attachment #282475 - Flags: approval1.8.0.14+
Brendan:
Hoo-rah!  3.0a9pre 26-sep nightly shows reasonable behavior on the testcases.
Please, let's get this in 2.0.0.?

Thanks
Attachment #282475 - Flags: superreview+
Attachment #282475 - Flags: review?(jst)
Attachment #282475 - Flags: review+
patch checked into the MOZILLA_1_8_BRANCH and MOZILLA_1_8_0_BRANCH
I'm on FF 2.0.0.8 and symptoms on resume from hibernate led me to this bug, so it seems it's still a problem.
Jean Jordaan: please run the "Test case" attachments to this bug and report results. Thanks,

/be
(In reply to comment #54)
> I'm on FF 2.0.0.8 [...]

That was released seven months ago, you should upgrade ASAP. Have you disabled automatic updates or are they just not working for you?

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: