Closed
Bug 1059469
Opened 10 years ago
Closed 9 years ago
With heavy use of setTimeout, certain timeouts are incorrect in workers
Categories
(Core :: DOM: Workers, defect)
Core
DOM: Workers
Tracking
()
RESOLVED
FIXED
mozilla46
People
(Reporter: jlong, Assigned: khuey)
Details
Attachments
(3 files)
5.21 KB,
patch
|
bent.mozilla
:
review+
|
Details | Diff | Splinter Review |
14.42 KB,
patch
|
bent.mozilla
:
review+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
2.44 KB,
patch
|
Details | Diff | Splinter Review |
This is an edge case bug that I have had an extremely difficult time isolating. I came up with a reproducible test case but unfortunately the JS required to get it is quite large. The test case is here: http://jlongster.com/s/worker-setTimeout/. It uses a CSP channels library to do some cooperative tasking, which is here: https://github.com/jlongster/js-csp. This code is bundled up via webpack for the test case. The app.js code is pretty straight-forward: http://jlongster.com/s/worker-setTimeout/app.js. Sorry I can't make a smaller test case. I have no idea what's going on and I tried to reproduce it outside of the CSP library but couldn't. Two tasks are created, that each pass values back and forth, and sleep for a certain amount of time. This works great when running in the browser. However, when running under a web worker, sometimes the `yield sleep()` calls don't actually sleep for the specific time (200 or 300 ms). They immediately get run 0 or 1ms. Internally, it uses a `setTimeout` with the duration passed, here: https://github.com/jlongster/js-csp/blob/master/src/impl/process.js#L129, and queue_delay just calls `setTimeout`: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L81. `setTimeout` is also used to run things on the next tick in the dispatcher: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L71. We will use message channels when they are available in Firefox. That `setTimeout` uses a duration of 0 to call it as fast as possible. This happens on every single `yield` as the scheduler is moving the system forward, to call into the next generator without putting it on the stack (and keeping it truly async). All of this leads to odd behavior: when in the web worker, in the sleep calls which call `setTimeout` with something like 200ms, the function gets called immediately on the next tick (only 0 or 1ms has passed) sometimes. You can see this by pressing "Run in worker" and in the console it will log the sleeps, how long they should have slept for, and how long they actually slept for. Now, we can actually fix this by changing the dispatcher's call to `setTimeout(func, 0)` (NOT the `setTimeout` that sleep uses). If we call `setTimeout(func, 1)` instead, the whole thing runs along fine. So somehow calling `setTimeout` within a worker with a duration of 0 is messing up the timers.
Reporter | ||
Comment 1•10 years ago
|
||
OSX 10.9.3, Firefox 34.0a1
Reporter | ||
Updated•10 years ago
|
Summary: With heavy use of setTimeout, certain timeouts are incorrect → With heavy use of setTimeout, certain timeouts are incorrect in workers
Reporter | ||
Comment 2•10 years ago
|
||
Any chance someone could at least verify this issue exists?
Reporter | ||
Comment 3•10 years ago
|
||
Quick check... can anyone who has worked with workers speculate why this happens?
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to James Long (:jlongster) from comment #3) > Quick check... can anyone who has worked with workers speculate why this > happens? In roughly decreasing order of probability: - There's a bug in your test case. - There's a bug in our setTimeout implementation on workers (which is indeed separate from the main thread one). - There's a bug in our nsITimer implementation. - There's a bug in whatever function you're using for timing (Date.now?) The main thread does some weird thing with nested timeouts that workers don't do, afaik, so that might be related. http://hg.mozilla.org/mozilla-central/annotate/0c66a9fd9085/dom/base/nsGlobalWindow.cpp#l12129
Comment 5•9 years ago
|
||
Hi there, An issue exactly like this one manifests itself when one uses Socket.io (the popular websockets library) in a web worker in Firefox on MacOS. Basically, Socket.io uses a ping timeout to detect dead sockets and attempt reconnection. The code is here: https://github.com/socketio/socket.io-client/blob/d125da3c0fb05c9bc958a501b102f73d0e552a72/socket.io.js#L1992-L2005 When running this in a web worker in Firefox, the ping timeout sometimes runs immediately, which causes a disconnection of the socket since it looks like a ping timeout. I'm using Firefox 42.0 on an iMac running OS X Yosemite (10.10.5). I can build a repo with the test case, but it would have to be a Node.js solution to run websockets, etc. Very little code is required besides Socket.io and node - a simple page that spawns a worker and connects to the server is enough. No need for complex timer manipulation as in the original report. I'm a long time Firefox user and Mozilla supporter, so if I can help in anyway, please let me know.
Reporter | ||
Comment 6•9 years ago
|
||
FWIW I can't reproduce it with my original test case in Nightly. Whatever was my issue may have been fixed.
Comment 7•9 years ago
|
||
I just tried nightly and the issue affecting Socket.IO is still there. I opened four tabs, waited a couple of minutes, and it happened in one of them. There's no other ongoing websocket traffic, just the Socket.IO pings (they happen every 25 seconds). The issue is not deterministic, there may be some kind of race underneath, but it happens reliably.
I was able to reproduce this by using nested setTimeout calls, (I think that is the key) Check out this jsfiddle. If you let it run for a little while, you will see some 0 differences in time: http://jsfiddle.net/jgrund/o5p3tgp8/1/
Reporter | ||
Comment 9•9 years ago
|
||
Wonderful, that's awesome that you found a simple test case! Hopefully that will help get this fixed.
Reporter | ||
Comment 10•9 years ago
|
||
Joe posted a really simple test case which repros it consistently for me. It is indeed with nested timeouts. How hard do you think this fix is?
Flags: needinfo?(khuey)
Comment 11•9 years ago
|
||
fwiw, I too am experiencing this problem. Good thing I don't run my pacemaker in a web worker. :)
Assignee | ||
Comment 12•9 years ago
|
||
This is some sort of weird race with the timer thread. If I clamp setTimeout(0) to 4 ms like we do on the main thread it goes away.
Flags: needinfo?(khuey)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → khuey
Assignee | ||
Comment 13•9 years ago
|
||
This was helpful for debugging.
Attachment #8704445 -
Flags: review?(amarchesini)
Assignee | ||
Comment 14•9 years ago
|
||
Read the long commit message to understand the patch.
Attachment #8704447 -
Flags: review?(amarchesini)
Assignee | ||
Comment 15•9 years ago
|
||
(In reply to Joe Grund from comment #8) > I was able to reproduce this by using nested setTimeout calls, (I think that > is the key) Check out this jsfiddle. If you let it run for a little while, > you will see some 0 differences in time: > http://jsfiddle.net/jgrund/o5p3tgp8/1/ This test case was instrumental in figuring this out, btw, so thanks for that. If you send me email @mozilla.com I can find some Mozilla t-shirts or something to say thank you ;)
Flags: needinfo?(grundjoseph)
Assignee | ||
Updated•9 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Comment on attachment 8704447 [details] [diff] [review] Part 2: Cancel the timer when rescheduling Review of attachment 8704447 [details] [diff] [review]: ----------------------------------------------------------------- This looks fine, jut make sure to add that extra check to avoid calling RescheduleTimeoutTimer when we're already running expired timeouts. ::: dom/workers/WorkerPrivate.cpp @@ +1163,5 @@ > return aWorkerPrivate->RunExpiredTimeouts(aCx); > } > + > + NS_IMETHOD > + Notify(nsITimer* aTimer) Nit: override
Attachment #8704447 -
Flags: review?(amarchesini) → review+
Comment on attachment 8704445 [details] [diff] [review] Part 1: Add a logging module for dump() Review of attachment 8704445 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/base/nsContentUtils.cpp @@ +7120,5 @@ > > +mozilla::LogModule* > +nsContentUtils::DOMDumpLog() > +{ > + return sDOMDumpLog; This is evil ;)
Attachment #8704445 -
Flags: review?(amarchesini) → review+
Comment 18•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/aadea6a5781c https://hg.mozilla.org/integration/mozilla-inbound/rev/4a5dfc998bf4
Comment 19•9 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/aadea6a5781c https://hg.mozilla.org/mozilla-central/rev/4a5dfc998bf4
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox46:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(grundjoseph)
Assignee | ||
Comment 20•9 years ago
|
||
Comment on attachment 8704447 [details] [diff] [review] Part 2: Cancel the timer when rescheduling I want to take this on beta and to get it into ESR. Intermittently broken web features are bad, and this fix is pretty simple. Approval Request Comment [Feature/regressing bug #]: WebWorkers [User impact if declined]: Sometimes timeouts will randomly break and remain broken. [Describe test coverage new/current, TreeHerder]: We have good tests for this behavior. [Risks and why]: It's possible we introduced some unknown regressions, but this code is fairly straightforward. [String/UUID change made/needed]: None.
Attachment #8704447 -
Flags: approval-mozilla-beta?
Updated•9 years ago
|
status-firefox45:
--- → affected
Comment 21•9 years ago
|
||
Comment on attachment 8704447 [details] [diff] [review] Part 2: Cancel the timer when rescheduling Improve the testsuite, taking it. Should be in 45 beta 2.
Attachment #8704447 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 22•9 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/508f5c34d798 https://hg.mozilla.org/releases/mozilla-beta/rev/b639a993c955
Comment 23•9 years ago
|
||
seems this caused crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=754623&repo=mozilla-beta
Flags: needinfo?(khuey)
Assignee | ||
Comment 24•9 years ago
|
||
Ah, looks like we need part of bug 901907 too.
Flags: needinfo?(khuey)
Assignee | ||
Comment 25•9 years ago
|
||
Assignee | ||
Comment 26•9 years ago
|
||
I ran tests locally with part 0 and part 2 applied and we don't assert anymore. FWIW, we don't actually need part 1 on beta (although landing it shouldn't hurt). It's just debugging information.
Flags: needinfo?(cbook)
Comment 27•9 years ago
|
||
thanks kyle, landed part 0 and 1 as https://hg.mozilla.org/releases/mozilla-beta/rev/aeee3bd93181 https://hg.mozilla.org/releases/mozilla-beta/rev/1524bf5cebad
Flags: needinfo?(cbook)
Assignee | ||
Comment 28•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #27) > thanks kyle, landed part 0 and 1 as > https://hg.mozilla.org/releases/mozilla-beta/rev/aeee3bd93181 > https://hg.mozilla.org/releases/mozilla-beta/rev/1524bf5cebad Since I had to go check, for posterity's sake, Tomcat did land parts 0 and 2 per comment 26.
Backed out in https://hg.mozilla.org/releases/mozilla-beta/rev/0daf679f5dd1 for osx build bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=764035&repo=mozilla-beta
Flags: needinfo?(khuey)
Seems like a wrong version of the part 2 commit was uplifted. Here's what it should be: remote: https://hg.mozilla.org/releases/mozilla-beta/rev/285297ca4cc9 remote: https://hg.mozilla.org/releases/mozilla-beta/rev/e29130ca7a64
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(khuey)
Comment 31•9 years ago
|
||
Hi, Just wanted to confirm that was in a shipped version of Firefox?
Assignee | ||
Comment 32•9 years ago
|
||
Yes, this was in Firefox 45, which shipped in early March.
You need to log in
before you can comment on or make changes to this bug.
Description
•