Last Comment Bug 697250 - An interval timer set for a longer delay can fire twice between firings of an interval timer set for a shorter delay
: An interval timer set for a longer delay can fire twice between firings of an...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: DOM (show other bugs)
: Trunk
: All All
: P1 normal (vote)
: mozilla10
Assigned To: Boris Zbarsky [:bz]
:
Mentors:
http://tholman.com/series/flash-forwa...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-25 13:47 PDT by Jamax
Modified: 2011-11-03 12:26 PDT (History)
11 users (show)
bzbarsky: in‑testsuite?
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Testcase that triggers the timeout issue pretty reliably (505 bytes, text/html)
2011-10-26 08:25 PDT, Boris Zbarsky [:bz]
no flags Details
This works, though not perfectly reliable (499 bytes, text/html)
2011-10-26 21:22 PDT, Boris Zbarsky [:bz]
no flags Details
When an interval is very late, just go ahead and schedule it for firing ASAP and update its target time to 'now' (the latter to prevent misbehavior when the interval is very late due to the computer being suspended, say) instead of trying to schedule it f (2.72 KB, patch)
2011-10-29 00:06 PDT, Boris Zbarsky [:bz]
jst: review+
cjones.bugs: feedback+
Details | Diff | Splinter Review

Description Jamax 2011-10-25 13:47:55 PDT
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0a1) Gecko/20111025 Firefox/10.0a1
Build ID: 20111025031041

Steps to reproduce:

Please visit this site with a floating-point heavy javascript simulation using Nightly:

http://tholman.com/series/flash-forward/ports/multiphase-flow/02/

Hold the mouse button down in the rectangle to add points.  Keep the mouse button pressed for a while until ~1/4th full of points when they settle.


Actual results:

The points behave erratically as if there is some kind of floating point errors in the operations (they have too much 'energy' compared to other browsers).  


Expected results:

On other browsers tested the points move fluidly (IE9, Chrome 16.0.912.4 dev-m, Opera, Safari).

When I tried this last ~1-3 weeks ago using Nightly the points moved fluidly so I believe there is some recent change that broke math in some subtle way.  I notice on arewefast.com that TM+JM got a lot slower recently so I assume some something significant changed in the engine.

Sorry but I don't have the time or experience to narrow this down to a test case that fails.  I am not the author of this demo.
Comment 1 Jamax 2011-10-25 13:54:23 PDT
Also I forgot to note another difference, when holding the mouse down sometimes the 'streeam' of new particles becomes straight with no randomness at which point all the other particles no longer repel each other and fall to the bottom of the screen.  I didn't observe this on any other browser.

So maybe something to do with the random number generator?
Comment 2 Boris Zbarsky [:bz] 2011-10-25 14:03:59 PDT
I get the "no repelling" failure condition every time I try this, so I never get a chance to reproduce the movement aspect.

And I get that failure condition on builds going back to at least Firefox 4...
Comment 3 Jan de Mooij [:jandem] 2011-10-25 14:17:03 PDT
The console is also spammed with warnings about an undefined value. It looks like we end up with a NaN value somewhere. I can track this down tomorrow if nobody beats me to it..
Comment 4 Jan de Mooij [:jandem] 2011-10-26 01:58:22 PDT
OK, here's what causes the "no repelling" failure. The script uses 2 different intervals (functions called by setInterval):

1) Update: called every 20 ms. Adds particles, calculates various properties (neighbors, force, etc)

2) Draw: called every ~17 ms (1000 / 60). Moves and draws all particles.

This scheme fails if the first interval is called twice without an intermediate call to draw (the second interval). When this happens we will have two particles at the same position (the second interval is supposed to move the particles added by the first interval). The function setParticle does this:

    this.nx = p1.x - p2.x;
    this.ny = p1.y - p2.y;
    
    this.distance = Math.sqrt(this.nx * this.nx + this.ny * this.ny);

    // ... snip ...

    var invDistance = 1 / this.distance;

    this.nx *= invDistance;
    this.ny *= invDistance;

If particles p1 and p2 have the same position, this.distance is 0. 1/0 is Infinity so this.nx becomes 0 * Infinity = NaN.

I added some logging and sometimes we indeed have two "update" calls without a "draw" call. Also verified that changing the "draw" interval from 17 ms to 22 ms causes the same problem in Chrome.

I wouldn't be surprised if this is also responsible for the problem in comment 0. Note that the script does not use Math.random or Date; it depends on the mouse input and setInterval but for the rest it's deterministic.
Comment 5 Jan de Mooij [:jandem] 2011-10-26 03:20:18 PDT
Moving to DOM.
Comment 6 Boris Zbarsky [:bz] 2011-10-26 08:25:27 PDT
Created attachment 569688 [details]
Testcase that triggers the timeout issue pretty reliably

Even without the delay loop we get there pretty quickly, but the delay loop nearly guarantees it on the first iteration....
Comment 7 Boris Zbarsky [:bz] 2011-10-26 20:25:57 PDT
Comment on attachment 569688 [details]
Testcase that triggers the timeout issue pretty reliably

This isn't quite right...
Comment 8 Boris Zbarsky [:bz] 2011-10-26 21:22:57 PDT
Created attachment 569900 [details]
This works, though not perfectly reliable
Comment 9 Boris Zbarsky [:bz] 2011-10-26 22:14:54 PDT
So here's what I see when I reproduced on that last testcase.  For everything below, t is the time in milliseconds, with 0 being when those two setInterval calls in the testcase happen:

t=0: f scheduled for t=20
t=0: g scheduled for t=17
t=17: g called, finishes. 17 < 17+17, so g scheduled for t=34
t=20: f called
t=40: f finishes.  20 < 20+20, so f scheduled for t=40
t=41: g called, finishes. 41 < 34+17, g scheduled for t=51 
t=41: f called
t=61: f finishes.  41 < 40+20, so we try to schedule f for t=60, clamp to t=61
t=82: g called, finishes.  82 > 51+17, so we skip ahead; g scheduled for t=99.
t=82: f called
t=102: f finishes.  82 > 61+20, so we skip ahead; f scheduled for t=102.
t=111: g called, finishes.  111 < 99+17, g scheduled for t=116.
t=111: f called
t=131: f finishes.  111 < 102 + 20, so f scheduled for t=122.
t=135: g called, finishes.  135 > 116+17, so we skip ahead; g scheduled for t=152.
t=135: f called
t=155: f finishes.  135 < 122+20, so we try to schedule f for 142, clamp to 155.
t=157: f called

So the key part there is to hit a situation where a call to g ends up on the "target time + 17 < time when g started running" codepath, which resets g to run at "now + 17" while the immediately following call to f ends up on the "target time + 20 > time when f started running", which resets f to run at "target time + 20".   Since g runs right before f, it will get scheduled for "time when f started running + 17", so as long as f started running more than 3ms after its target time f will end up scheduled before g....
Comment 10 Boris Zbarsky [:bz] 2011-10-26 22:20:49 PDT
Looking at the spec at http://www.whatwg.org/specs/web-apps/current-work/multipage/timers.html#dom-windowtimers-setinterval the "skip ahead" behavior doesn't seem to be allowed, though step 8 sort of allows anything you want.

It looks like the "skip ahead" thing was added in bug 376643, since at the time this code was using PRTime for things like mWhen.  Now that we're using TimeStamp there, would it be safe to back out this patch?  Or can TimeStamp increase discontinuously over the things listed in the bug?  I guess at least over SIGSTOP/SIGCONT it can...

If we do want to preserve the skipping behavior, I wonder how we can do it without causing issues like the one in this testcase.
Comment 11 Jamax 2011-10-28 16:13:06 PDT
I'm probably missing something as I have zero knowledge of Mozilla code...

What if you just always schedule the next event to happen after "interval-(now%interval)" milliseconds ie sleep for however much is left of the interval period containing the current time?  This should handle stop/start and suspend/resume, and not fire twice in a row.

But what time to do the first one could be a problem (if the interval was '1 minute' you'd expect it to run after 1 minute not 'on the next minute').  I guess you could store some offset based on when the interval was first set ie "interval-((now+offset)%interval)".
Comment 12 Boris Zbarsky [:bz] 2011-10-28 18:43:59 PDT
> What if you just always schedule the next event to happen after
> "interval-(now%interval)" milliseconds

"now" is not a number.  It doesn't really have a concept of % attached to it...  We could fix that by measuring time from some fixed base (e.g. browser startup).  This is a good idea!  Let me think it through.

> I guess you could store some offset based on when the interval was first set ie
> "interval-((now+offset)%interval)"

Yeah, that's needed to correctly handle this testcase, which should ideally call f every 5ms:

  setInterval(f, 10);
  setTimeout(function() { setInterval(f, 10); }, 5);
Comment 13 Boris Zbarsky [:bz] 2011-10-28 23:27:14 PDT
OK, so with that setup there is the open question of whether to set "now" to the time the timer starts firing or the time the timer stops firing.  

1)  As long as a timer is not irretrievably late (i.e. finishes firing before its next
    scheduled firing) it will get scheduled for said next firing no matter what definition
    of "now" is used.
2)  If a timer starts firing in a situation where we would not skip right now (i.e timer
    firing start is not later than its next firing, then if "now" is treated as firing
    start the timer will be scheduled for "ASAP" just like it is now.  If "now" is treated
    as firing end, then it depends on how long the timer took to run.
3)  If a timer starts firing in a situation where it would skip right now, it's
    guaranteed to skip no matter which definition of "now" is used.  It'll obviously skip
    less with the "timer firing start" definition.

Also, when the timer skips it's guaranteed to skip by no more than it skips by right now.

An alternative approach is to rip out the "skip ahead" bit and do what precise-but-can-skip XPCOM timers do: when a timer finishes running after its next scheduled firing, schedule it for "ASAP" and, importantly, update its target time to "ASAP".  In nsTimeout terms, that means |mWhen = delay + TimeStamp::Now()| instead of |mWhen = firingTime|.  This might behave better in terms of trying to "catch up" if it falls a bit behind....
Comment 14 Boris Zbarsky [:bz] 2011-10-29 00:06:56 PDT
Created attachment 570464 [details] [diff] [review]
When an interval is very late, just go ahead and schedule it for firing ASAP and update its target time to 'now' (the latter to prevent misbehavior when the interval is very late due to the computer being suspended, say) instead of trying to schedule it f

This patch does that last suggestion: act like precise-but-can-skip XPCOM timers.  It fixes the problem with the fluid just collapsing for me.  Jamax, I started off a Windows optimized build with this patch on the try server; once it completes, would you be willing to try the resulting build and see whether it solves the 'too much energy' problem too?
Comment 15 Boris Zbarsky [:bz] 2011-10-29 00:18:41 PDT
I believe the attached patch is actually pretty much equivalent to the comment 11 proposal with "now" measured at firing start, except it won't try to "drift back" to the right timing once it gets off from it, except insofar as the delay line stuff in nsTimerImpl will pull it back.  So maybe the comment 11 proposal is better in that regard.

cjones, if you have any thoughts here, I'd love to hear them!
Comment 16 Mozilla RelEng Bot 2011-10-29 02:20:23 PDT
Try run for d934802342d4 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=d934802342d4
Results (out of 1 total builds):
    success: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bzbarsky@mozilla.com-d934802342d4
Comment 17 Jamax 2011-10-31 09:05:57 PDT
Boris I tried out the patch and it does fix the double timer problem on my system (the demo doesn't ever 'fail to repel').  Your second testcase in comment 8 ran to a couple thousand before I stopped it.  Fyi I did not try suspend/resume or anything else that might be affected by the patch.

But the too-much-energy problem is still there.  Maybe it's just this demo relying on some undefined behavior, but it works 'correctly' in other browsers.  I'll try to get a movie posted later today so you can at least see what the effect looks like.
Comment 18 Boris Zbarsky [:bz] 2011-10-31 09:11:31 PDT
> But the too-much-energy problem is still there.

I'd love to see a demonstration of that.  The behavior looks pretty identical to me in Firefox and Chrome....
Comment 19 Boris Zbarsky [:bz] 2011-10-31 09:12:15 PDT
Comment on attachment 570464 [details] [diff] [review]
When an interval is very late, just go ahead and schedule it for firing ASAP and update its target time to 'now' (the latter to prevent misbehavior when the interval is very late due to the computer being suspended, say) instead of trying to schedule it f

Chris, I'd love your feedback here.  See comment 15 and the preceding discussion.
Comment 20 Jamax 2011-10-31 12:02:46 PDT
I cobbled together a screen capture at:

http://www.youtube.com/watch?v=T0ULuMOllig

BUT this is a problem in the demo not FF.  I got the other browsers to do the same kind of thing (jittering, small exploding groups) by adding far fewer particles.  I think the demo just has a logic problem relating to the refresh rate and I noticed it in Nightly because it's so much faster than the others.

Anyway I'm glad there was a real bug in there however small and I didn't completely waste you guys' time!
Comment 21 Boris Zbarsky [:bz] 2011-10-31 12:16:14 PDT
Jamax, thanks!

Going to retitle this bug to make it cover the issue we're ending up fixing, which I wouldn't describe as "small".
Comment 22 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-11-01 00:23:23 PDT
Comment on attachment 570464 [details] [diff] [review]
When an interval is very late, just go ahead and schedule it for firing ASAP and update its target time to 'now' (the latter to prevent misbehavior when the interval is very late due to the computer being suspended, say) instead of trying to schedule it f

I think this approach makes sense.  The proposal of comment 11 sounds a bit more attractive to me for callbacks that persistently don't finish in less time than their interval, because it would tend to throttle back the callbacks by interval/2 on average.  But this patch provides more consistent behavior.  For non-degenerate callbacks, I think I slightly prefer the approach here of "rebasing" the interval after OS sleep etc, but I'm not sure how much it matters.

We really need to get folks on requestAnimationFrame() ... :)
Comment 23 Johnny Stenback (:jst, jst@mozilla.com) 2011-11-02 13:05:50 PDT
Comment on attachment 570464 [details] [diff] [review]
When an interval is very late, just go ahead and schedule it for firing ASAP and update its target time to 'now' (the latter to prevent misbehavior when the interval is very late due to the computer being suspended, say) instead of trying to schedule it f

This seems reasonable to me. Ultimately we'll need to get this out to users for a while to see how it performs though.

r=jst
Comment 25 Marco Bonardo [::mak] 2011-11-03 08:52:14 PDT
https://hg.mozilla.org/mozilla-central/rev/e2adadca0ee1

Note You need to log in before you can comment on or make changes to this bug.