Closed Bug 1059469 Opened 10 years ago Closed 9 years ago

With heavy use of setTimeout, certain timeouts are incorrect in workers

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla46

Tracking Flags:

Tracking

Status

firefox45

---

fixed

firefox46

---

fixed

People

(Reporter: jlong, Assigned: khuey)

Details

Attachments

(3 files)

Part 1: Add a logging module for dump() 9 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 5.21 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review
Part 2: Cancel the timer when rescheduling 9 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 14.42 KB, patch	bent.mozilla : review+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review
Part 0: for beta 9 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 2.44 KB, patch		Details \| Diff \| Splinter Review

James Long (:jlongster)

Reporter

Description

•

10 years ago

This is an edge case bug that I have had an extremely difficult time isolating. I came up with a reproducible test case but unfortunately the JS required to get it is quite large.

The test case is here: http://jlongster.com/s/worker-setTimeout/. It uses a CSP channels library to do some cooperative tasking, which is here: https://github.com/jlongster/js-csp. This code is bundled up via webpack for the test case. The app.js code is pretty straight-forward: http://jlongster.com/s/worker-setTimeout/app.js.

Sorry I can't make a smaller test case. I have no idea what's going on and I tried to reproduce it outside of the CSP library but couldn't.

Two tasks are created, that each pass values back and forth, and sleep for a certain amount of time. This works great when running in the browser.

However, when running under a web worker, sometimes the `yield sleep()` calls don't actually sleep for the specific time (200 or 300 ms). They immediately get run 0 or 1ms. Internally, it uses a `setTimeout` with the duration passed, here: https://github.com/jlongster/js-csp/blob/master/src/impl/process.js#L129, and queue_delay just calls `setTimeout`: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L81.

`setTimeout` is also used to run things on the next tick in the dispatcher: https://github.com/jlongster/js-csp/blob/master/src/impl/dispatch.js#L71. We will use message channels when they are available in Firefox.

That `setTimeout` uses a duration of 0 to call it as fast as possible. This happens on every single `yield` as the scheduler is moving the system forward, to call into the next generator without putting it on the stack (and keeping it truly async).

All of this leads to odd behavior: when in the web worker, in the sleep calls which call `setTimeout` with something like 200ms, the function gets called immediately on the next tick (only 0 or 1ms has passed) sometimes. You can see this by pressing "Run in worker" and in the console it will log the sleeps, how long they should have slept for, and how long they actually slept for.

Now, we can actually fix this by changing the dispatcher's call to `setTimeout(func, 0)` (NOT the `setTimeout` that sleep uses). If we call `setTimeout(func, 1)` instead, the whole thing runs along fine. So somehow calling `setTimeout` within a worker with a duration of 0 is messing up the timers.

James Long (:jlongster)

Reporter

Comment 1

•

10 years ago

OSX 10.9.3, Firefox 34.0a1

James Long (:jlongster)

Reporter

Updated

•

10 years ago

Summary: With heavy use of setTimeout, certain timeouts are incorrect → With heavy use of setTimeout, certain timeouts are incorrect in workers

James Long (:jlongster)

Reporter

Comment 2

•

10 years ago

Any chance someone could at least verify this issue exists?

James Long (:jlongster)

Reporter

Comment 3

•

10 years ago

Quick check... can anyone who has worked with workers speculate why this happens?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 4

•

10 years ago

(In reply to James Long (:jlongster) from comment #3)
> Quick check... can anyone who has worked with workers speculate why this
> happens?

In roughly decreasing order of probability:

- There's a bug in your test case.
- There's a bug in our setTimeout implementation on workers (which is indeed separate from the main thread one).
- There's a bug in our nsITimer implementation.
- There's a bug in whatever function you're using for timing (Date.now?)

The main thread does some weird thing with nested timeouts that workers don't do, afaik, so that might be related.  http://hg.mozilla.org/mozilla-central/annotate/0c66a9fd9085/dom/base/nsGlobalWindow.cpp#l12129

Gustavo Duarte

Comment 5

•

9 years ago

Hi there,

An issue exactly like this one manifests itself when one uses Socket.io (the popular websockets library) in a web worker in Firefox on MacOS.

Basically, Socket.io uses a ping timeout to detect dead sockets and attempt reconnection. The code is here:

https://github.com/socketio/socket.io-client/blob/d125da3c0fb05c9bc958a501b102f73d0e552a72/socket.io.js#L1992-L2005

When running this in a web worker in Firefox, the ping timeout sometimes runs immediately, which causes a disconnection of the socket since it looks like a ping timeout.

I'm using Firefox 42.0 on an iMac running OS X Yosemite (10.10.5). I can build a repo with the test case, but it would have to be a Node.js solution to run websockets, etc. Very little code is required besides Socket.io and node - a simple page that spawns a worker and connects to the server is enough. No need for complex timer manipulation as in the original report.

I'm a long time Firefox user and Mozilla supporter, so if I can help in anyway, please let me know.

James Long (:jlongster)

Reporter

Comment 6

•

9 years ago

FWIW I can't reproduce it with my original test case in Nightly. Whatever was my issue may have been fixed.

Gustavo Duarte

Comment 7

•

9 years ago

I just tried nightly and the issue affecting Socket.IO is still there. I opened four tabs, waited a couple of minutes, and it happened in one of them. There's no other ongoing websocket traffic, just the Socket.IO pings (they happen every 25 seconds). The issue is not deterministic, there may be some kind of race underneath, but it happens reliably.

Joe Grund

Comment 8

•

9 years ago

I was able to reproduce this by using nested setTimeout calls, (I think that is the key) Check out this jsfiddle. If you let it run for a little while, you will see some 0 differences in time: http://jsfiddle.net/jgrund/o5p3tgp8/1/

James Long (:jlongster)

Reporter

Comment 9

•

9 years ago

Wonderful, that's awesome that you found a simple test case! Hopefully that will help get this fixed.

James Long (:jlongster)

Reporter

Comment 10

•

9 years ago

Joe posted a really simple test case which repros it consistently for me. It is indeed with nested timeouts. How hard do you think this fix is?

Flags: needinfo?(khuey)

wayneseymour

Comment 11

•

9 years ago

fwiw, I too am experiencing this problem.
Good thing I don't run my pacemaker in a web worker.  :)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 12

•

9 years ago

This is some sort of weird race with the timer thread.  If I clamp setTimeout(0) to 4 ms like we do on the main thread it goes away.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Updated

•

9 years ago

Assignee: nobody → khuey

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 13

•

9 years ago

Attached patch Part 1: Add a logging module for dump() — Details — Splinter Review

This was helpful for debugging.

Attachment #8704445 - Flags: review?(amarchesini)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 14

•

9 years ago

Attached patch Part 2: Cancel the timer when rescheduling — Details — Splinter Review

Read the long commit message to understand the patch.

Attachment #8704447 - Flags: review?(amarchesini)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 15

•

9 years ago

(In reply to Joe Grund from comment #8)
> I was able to reproduce this by using nested setTimeout calls, (I think that
> is the key) Check out this jsfiddle. If you let it run for a little while,
> you will see some 0 differences in time:
> http://jsfiddle.net/jgrund/o5p3tgp8/1/

This test case was instrumental in figuring this out, btw, so thanks for that.  If you send me email @mozilla.com I can find some Mozilla t-shirts or something to say thank you ;)

Flags: needinfo?(grundjoseph)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Updated

•

9 years ago

OS: Mac OS X → All

Hardware: x86 → All

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 16

•

9 years ago

Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

Review of attachment 8704447 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine, jut make sure to add that extra check to avoid calling RescheduleTimeoutTimer when we're already running expired timeouts.

::: dom/workers/WorkerPrivate.cpp
@@ +1163,5 @@
>      return aWorkerPrivate->RunExpiredTimeouts(aCx);
>    }
> +
> +  NS_IMETHOD
> +  Notify(nsITimer* aTimer)

Nit: override

Attachment #8704447 - Flags: review?(amarchesini) → review+

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 17

•

9 years ago

Comment on attachment 8704445 [details] [diff] [review]
Part 1: Add a logging module for dump()

Review of attachment 8704445 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsContentUtils.cpp
@@ +7120,5 @@
>  
> +mozilla::LogModule*
> +nsContentUtils::DOMDumpLog()
> +{
> +  return sDOMDumpLog;

This is evil ;)

Attachment #8704445 - Flags: review?(amarchesini) → review+

Pulsebot

Comment 18

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/aadea6a5781c
https://hg.mozilla.org/integration/mozilla-inbound/rev/4a5dfc998bf4

Carsten Book [:Tomcat]

Comment 19

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/aadea6a5781c
https://hg.mozilla.org/mozilla-central/rev/4a5dfc998bf4

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox46: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla46

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Updated

•

9 years ago

Flags: needinfo?(grundjoseph)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 20

•

9 years ago

Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

I want to take this on beta and to get it into ESR.  Intermittently broken web features are bad, and this fix is pretty simple.

Approval Request Comment
[Feature/regressing bug #]: WebWorkers
[User impact if declined]: Sometimes timeouts will randomly break and remain broken.
[Describe test coverage new/current, TreeHerder]: We have good tests for this behavior.
[Risks and why]: It's possible we introduced some unknown regressions, but this code is fairly straightforward.
[String/UUID change made/needed]: None.

Attachment #8704447 - Flags: approval-mozilla-beta?

Sylvestre Ledru [:Sylvestre]

Updated

•

9 years ago

status-firefox45: --- → affected

Sylvestre Ledru [:Sylvestre]

Comment 21

•

9 years ago

Comment on attachment 8704447 [details] [diff] [review]
Part 2: Cancel the timer when rescheduling

Improve the testsuite, taking it.

Should be in 45 beta 2.

Attachment #8704447 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Carsten Book [:Tomcat]

Comment 22

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/508f5c34d798
https://hg.mozilla.org/releases/mozilla-beta/rev/b639a993c955

status-firefox45: affected → fixed

Carsten Book [:Tomcat]

Comment 23

•

9 years ago

seems this caused crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=754623&repo=mozilla-beta

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 24

•

9 years ago

Ah, looks like we need part of bug 901907 too.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 25

•

9 years ago

Attached patch Part 0: for beta — Details — Splinter Review

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 26

•

9 years ago

I ran tests locally with part 0 and part 2 applied and we don't assert anymore.

FWIW, we don't actually need part 1 on beta (although landing it shouldn't hurt).  It's just debugging information.

Flags: needinfo?(cbook)

Carsten Book [:Tomcat]

Comment 27

•

9 years ago

thanks kyle, landed part 0 and 1 as
https://hg.mozilla.org/releases/mozilla-beta/rev/aeee3bd93181
https://hg.mozilla.org/releases/mozilla-beta/rev/1524bf5cebad

Flags: needinfo?(cbook)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 28

•

9 years ago

(In reply to Carsten Book [:Tomcat] from comment #27)
> thanks kyle, landed part 0 and 1 as
> https://hg.mozilla.org/releases/mozilla-beta/rev/aeee3bd93181
> https://hg.mozilla.org/releases/mozilla-beta/rev/1524bf5cebad

Since I had to go check, for posterity's sake, Tomcat did land parts 0 and 2 per comment 26.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 29

•

9 years ago

Backed out in https://hg.mozilla.org/releases/mozilla-beta/rev/0daf679f5dd1 for osx build bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=764035&repo=mozilla-beta

status-firefox45: fixed → affected

Flags: needinfo?(khuey)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 30

•

9 years ago

Seems like a wrong version of the part 2 commit was uplifted.

Here's what it should be:
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/285297ca4cc9
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/e29130ca7a64

status-firefox45: affected → fixed

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Updated

•

9 years ago

Flags: needinfo?(khuey)

Joe Grund

Comment 31

•

9 years ago

Hi, Just wanted to confirm that was in a shipped version of Firefox?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 32

•

9 years ago

Yes, this was in Firefox 45, which shipped in early March.

You need to log in before you can comment on or make changes to this bug.