Closed Bug 1059469 Opened 10 years ago Closed 9 years ago

With heavy use of setTimeout, certain timeouts are incorrect in workers

Categories

(Core :: DOM: Workers, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: jlong, Assigned: khuey)

Details

Attachments

(3 files)

This is an edge case bug that I have had an extremely difficult time isolating. I came up with a reproducible test case but unfortunately the JS required to get it is quite large.

The test case is here: http://jlongster.com/s/worker-setTimeout/. It uses a CSP channels library to do some cooperative tasking, which is here: https://github.com/jlongster/js-csp. This code is bundled up via webpack for the test case. The app.js code is pretty straight-forward: http://jlongster.com/s/worker-setTimeout/app.js.

Sorry I can't make a smaller test case. I have no idea what's going on and I tried to reproduce it outside of the CSP library but couldn't.

Two tasks are created, that each pass values back and forth, and sleep for a certain amount of time. This works great when running in the browser.

However, when running under a web worker, sometimes the `yield sleep()` calls don't actually sleep for the specific time (200 or 300 ms). They immediately get run 0 or 1ms. Internally, it uses a `setTimeout` with the duration passed, here: https://github.com/jlongster/js-csp/blob/master/src/impl/process.js#L129, and queue_delay just calls `setTimeout`: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L81.

`setTimeout` is also used to run things on the next tick in the dispatcher: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L71. We will use message channels when they are available in Firefox.

That `setTimeout` uses a duration of 0 to call it as fast as possible. This happens on every single `yield` as the scheduler is moving the system forward, to call into the next generator without putting it on the stack (and keeping it truly async).

All of this leads to odd behavior: when in the web worker, in the sleep calls which call `setTimeout` with something like 200ms, the function gets called immediately on the next tick (only 0 or 1ms has passed) sometimes. You can see this by pressing "Run in worker" and in the console it will log the sleeps, how long they should have slept for, and how long they actually slept for.

Now, we can actually fix this by changing the dispatcher's call to `setTimeout(func, 0)` (NOT the `setTimeout` that sleep uses). If we call `setTimeout(func, 1)` instead, the whole thing runs along fine. So somehow calling `setTimeout` within a worker with a duration of 0 is messing up the timers.
OSX 10.9.3, Firefox 34.0a1
Summary: With heavy use of setTimeout, certain timeouts are incorrect → With heavy use of setTimeout, certain timeouts are incorrect in workers
Any chance someone could at least verify this issue exists?
Quick check... can anyone who has worked with workers speculate why this happens?
(In reply to James Long (:jlongster) from comment #3)
> Quick check... can anyone who has worked with workers speculate why this
> happens?

In roughly decreasing order of probability:

- There's a bug in your test case.
- There's a bug in our setTimeout implementation on workers (which is indeed separate from the main thread one).
- There's a bug in our nsITimer implementation.
- There's a bug in whatever function you're using for timing (Date.now?)

The main thread does some weird thing with nested timeouts that workers don't do, afaik, so that might be related.  http://hg.mozilla.org/mozilla-central/annotate/0c66a9fd9085/dom/base/nsGlobalWindow.cpp#l12129
Hi there,

An issue exactly like this one manifests itself when one uses Socket.io (the popular websockets library) in a web worker in Firefox on MacOS.

Basically, Socket.io uses a ping timeout to detect dead sockets and attempt reconnection. The code is here:

https://github.com/socketio/socket.io-client/blob/d125da3c0fb05c9bc958a501b102f73d0e552a72/socket.io.js#L1992-L2005

When running this in a web worker in Firefox, the ping timeout sometimes runs immediately, which causes a disconnection of the socket since it looks like a ping timeout.

I'm using Firefox 42.0 on an iMac running OS X Yosemite (10.10.5). I can build a repo with the test case, but it would have to be a Node.js solution to run websockets, etc. Very little code is required besides Socket.io and node - a simple page that spawns a worker and connects to the server is enough. No need for complex timer manipulation as in the original report.

I'm a long time Firefox user and Mozilla supporter, so if I can help in anyway, please let me know.
FWIW I can't reproduce it with my original test case in Nightly. Whatever was my issue may have been fixed.
I just tried nightly and the issue affecting Socket.IO is still there. I opened four tabs, waited a couple of minutes, and it happened in one of them. There's no other ongoing websocket traffic, just the Socket.IO pings (they happen every 25 seconds). The issue is not deterministic, there may be some kind of race underneath, but it happens reliably.
I was able to reproduce this by using nested setTimeout calls, (I think that is the key) Check out this jsfiddle. If you let it run for a little while, you will see some 0 differences in time: http://jsfiddle.net/jgrund/o5p3tgp8/1/
Wonderful, that's awesome that you found a simple test case! Hopefully that will help get this fixed.
Joe posted a really simple test case which repros it consistently for me. It is indeed with nested timeouts. How hard do you think this fix is?
Flags: needinfo?(khuey)
fwiw, I too am experiencing this problem.
Good thing I don't run my pacemaker in a web worker.  :)
This is some sort of weird race with the timer thread.  If I clamp setTimeout(0) to 4 ms like we do on the main thread it goes away.
Flags: needinfo?(khuey)
Assignee: nobody → khuey
This was helpful for debugging.
Attachment #8704445 - Flags: review?(amarchesini)
Read the long commit message to understand the patch.
Attachment #8704447 - Flags: review?(amarchesini)
(In reply to Joe Grund from comment #8)
> I was able to reproduce this by using nested setTimeout calls, (I think that
> is the key) Check out this jsfiddle. If you let it run for a little while,
> you will see some 0 differences in time:
> http://jsfiddle.net/jgrund/o5p3tgp8/1/

This test case was instrumental in figuring this out, btw, so thanks for that.  If you send me email @mozilla.com I can find some Mozilla t-shirts or something to say thank you ;)
Flags: needinfo?(grundjoseph)
OS: Mac OS X → All
Hardware: x86 → All
Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

Review of attachment 8704447 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine, jut make sure to add that extra check to avoid calling RescheduleTimeoutTimer when we're already running expired timeouts.

::: dom/workers/WorkerPrivate.cpp
@@ +1163,5 @@
>      return aWorkerPrivate->RunExpiredTimeouts(aCx);
>    }
> +
> +  NS_IMETHOD
> +  Notify(nsITimer* aTimer)

Nit: override
Attachment #8704447 - Flags: review?(amarchesini) → review+
Comment on attachment 8704445 [details] [diff] [review]
Part 1: Add a logging module for dump()

Review of attachment 8704445 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsContentUtils.cpp
@@ +7120,5 @@
>  
> +mozilla::LogModule*
> +nsContentUtils::DOMDumpLog()
> +{
> +  return sDOMDumpLog;

This is evil ;)
Attachment #8704445 - Flags: review?(amarchesini) → review+
https://hg.mozilla.org/mozilla-central/rev/aadea6a5781c
https://hg.mozilla.org/mozilla-central/rev/4a5dfc998bf4
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Flags: needinfo?(grundjoseph)
Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

I want to take this on beta and to get it into ESR.  Intermittently broken web features are bad, and this fix is pretty simple.

Approval Request Comment
[Feature/regressing bug #]: WebWorkers
[User impact if declined]: Sometimes timeouts will randomly break and remain broken.
[Describe test coverage new/current, TreeHerder]: We have good tests for this behavior.
[Risks and why]: It's possible we introduced some unknown regressions, but this code is fairly straightforward.
[String/UUID change made/needed]: None.
Attachment #8704447 - Flags: approval-mozilla-beta?
Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

Improve the testsuite, taking it.

Should be in 45 beta 2.
Attachment #8704447 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Ah, looks like we need part of bug 901907 too.
Flags: needinfo?(khuey)
I ran tests locally with part 0 and part 2 applied and we don't assert anymore.

FWIW, we don't actually need part 1 on beta (although landing it shouldn't hurt).  It's just debugging information.
Flags: needinfo?(cbook)
(In reply to Carsten Book [:Tomcat] from comment #27)
> thanks kyle, landed part 0 and 1 as
> https://hg.mozilla.org/releases/mozilla-beta/rev/aeee3bd93181
> https://hg.mozilla.org/releases/mozilla-beta/rev/1524bf5cebad

Since I had to go check, for posterity's sake, Tomcat did land parts 0 and 2 per comment 26.
Flags: needinfo?(khuey)
Hi, Just wanted to confirm that was in a shipped version of Firefox?
Yes, this was in Firefox 45, which shipped in early March.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: