1052398 - Web workers past the first twenty will silently fail to launch.

Reporter

Description

•

11 years ago

Attached file worker_pool_lock.tar.gz — Details

In several occasions, it is necessary to create a pool of web workers for later use. An application that does that however, will not get any guarantee whether the workers it creates are actually functional, because after the first twenty workers, the subsequent workers will silently appear to have been successfully created, but they will not execute. This makes it very difficult to reason about resource constraints and a maximum pool size, when the application is not notified about resource limitations in any way. STR: 1. Download the attached test case, unzip. 2. Launch worker_pool_lock.html Observed: The page prints messages 'Main thread working..' and 'worker i working..' for i in [0, 19], even though the page tried to launch 40 workers. There are no error messages printed. Expected: Creating new workers beyond the first 20 should fail with an exception, or there should be some other synchronous mechanism so that the code can know that the newly allocated Worker is not actually functioning. As a workaround, to detect a maximum size for a concurrently-executable worker pool, the code has to keep creating new Workers, send a 'initialization' message to them, have them message back, and then wait for an indeterminate unspecified period to hear back to guess whether the Worker is actually running.

Lars T Hansen [:lth]

Comment 1

•

11 years ago

Ben Turner explained to me a while back that this was to prevent resource exhaustion, esp on B2G. The compiled-in limit (dom/workers/RuntimeService.cpp, look for MAX_WORKERS_PER_DOMAIN) is actually 10; I think 20 is just the default pref setting on some platforms. I have been thinking that the right way to fix this is to allow a second parameter to the Worker constructor that asks for an error return if the worker cannot be assigned a physical thread. This could be folded into a general policy object, something like this: // w must have a thread, and will require access to shared memroy facilities w = new Worker(url, {requiresThread: true, sharedMem: true}) On the deeper level I think the fixed 10-per-domain setting is probably wrong; a system-wide allowance would make more sense, since a lot of pages won't use workers, or at least not many.

Ben Kelly [:bkelly, not reviewing]

Comment 2

•

11 years ago

(In reply to Lars T Hansen [:lth] from comment #1) > On the deeper level I think the fixed 10-per-domain setting is probably > wrong; a system-wide allowance would make more sense, since a lot of pages > won't use workers, or at least not many. Right, but a system-wide limit would allow one domain to effectively DoS other domains from accessing workers. This would probably be worth bringing up on the dev-webapi mailing list.

Ben Kelly [:bkelly, not reviewing]

Comment 3

•

11 years ago

This thread, while not exactly what you are asking for, may be interesting: https://groups.google.com/forum/#!topic/mozilla.dev.platform/QnhfUVw9jCI In particular the code to estimate useful amounts of parallelism here might be useful: http://wg.oftn.org/projects/core-estimator/demo/

Jukka Jylänki

Reporter

Comment 4

•

11 years ago

I'm aware of core estimator, but I am not looking for how to detect how many cores a system has. I'm asking for a way to ensure that the web page does not hang when it tries to schedule work on a Worker that will never start. I want a way to know when the page is no longer allowed to create a new Worker that would run simultaneously with other Workers, so that I can have the page cooperatively not try to exceed the limitations set forth by the browser.

Lars T Hansen [:lth]

Comment 5

•

11 years ago

FWIW, the resource drain I referred to in comment #1 had nothing to do with cores or available parallelism (at all) but was attempting to deal with the fact that by creating a lot of workers, each of which gets its own runtime with all that entails (threads, GC, heap, stack) one could make the entire computer system unusable or at least unstable.

Ben Kelly [:bkelly, not reviewing]

Comment 6

•

11 years ago

Sorry for confusing the issue. All I was trying to say was you could possibly workaround this problem with something like the core estimator.

Jukka Jylänki

Reporter

Comment 7

•

11 years ago

Ah now I understand. The underlying issue is not actually to match the number of hardware threads, but to work around the fact that creating new workers is an asynchronous operation (https://bugzilla.mozilla.org/show_bug.cgi?id=1049091) and a slow one compared to native pthreads.

Alon Zakai (:azakai)

Comment 8

•

11 years ago

On a side note, the silent failure is really confusing, it led to weeks of delay in the Massive benchmark project for example (bug 1021274). At minimum a warning in the console would useful, I think.

(no longer active)

Comment 9

•

11 years ago

Why can't we throw from the Worker ctor?

Flags: needinfo?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 10

•

11 years ago

We try to run as many as possible up to a per-origin limit, and thereafter the workers get queued. As workers finish executing we start the queued workers. So they do run eventually provided that the origin doesn't hold all its workers alive forever. Realistically the best we can do in the short term is bug 1037725.

Flags: needinfo?(bent.mozilla)

Jukka Jylänki

Reporter

Comment 11

•

11 years ago

Just printing a message to console does not help in any way for this bug, because the JS program cannot catch that message and act on it.

Jukka Jylänki

Reporter

Comment 12

•

10 years ago

I was testing core estimator today on a beefy 8 hw / 16 logical cores cpu, and I find that core estimator itself hangs in detection because of the domain limit, see https://github.com/oftn/core-estimator/issues/8. It would be nice to somehow have a way to create a worker with an up front "please throw at me" if the page's quota for new threads is exhausted.

Lars T Hansen [:lth]

Comment 13

•

10 years ago

BTW, this issue as well as the must-return-to-main-loop-before-worker-is-created issue are arguably bugs in Firefox vis-a-vis the WHATWG worker spec. I have filed a bug against that spec for clarifications (and other issues): https://www.w3.org/Bugs/Public/show_bug.cgi?id=29039. There is more about why I think these are FF bugs in that bug, but briefly, the wording of the WHATWG spec (in my reading) pretty much requires the browser to immediately fork off a truly concurrent thread to run the worker. I just think that needs to be stated even more explicitly.

Ben Kelly [:bkelly, not reviewing]

Comment 14

•

10 years ago

The artificial worker limit has been painful for service workers as well. Kyle, what do you think about relaxing this restriction? It seems at the very least 20 is a quite low limit given hardware these days.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 15

•

10 years ago

I'm not that opposed to raising the limit a bit, but I still think we need a limit, which means we still need to solve comment 8/9.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 16

•

10 years ago

(In reply to Lars T Hansen [:lth] from comment #13) > There is more about why I think these are FF bugs in that bug, but briefly, > the wording of the WHATWG spec (in my reading) pretty much requires the > browser to immediately fork off a truly concurrent thread to run the worker. > I just think that needs to be stated even more explicitly. Even if we do change the spec it would take years of engineering work to make that feasible in Gecko.

Luke Wagner [:luke]

Comment 17

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16) Are you specifically referring to letting a worker run before its parent returns to its event queue? If so, what, at a high-level, is the reason this is hard?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 18

•

10 years ago

(In reply to Luke Wagner [:luke] from comment #17) > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16) > Are you specifically referring to letting a worker run before its parent > returns to its event queue? If so, what, at a high-level, is the reason > this is hard? In Gecko, doing a network load (and all loads, even data URIs, etc go through that code) requires the main thread to be processing events. We can't load the script to run if the main thread isn't processing events, so the worker can't run. Fixing that requires rearchitecting the networking stack, or selectively bypassing it for certain types of loads which is its own can of worms.

Luke Wagner [:luke]

Comment 19

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18) Ah, makes sense; that is hard. By any lucky chance do network loads only run on the parent-process main thread and so e10s would help here?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 20

•

10 years ago

(In reply to Luke Wagner [:luke] from comment #19) > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18) > Ah, makes sense; that is hard. By any lucky chance do network loads only > run on the parent-process main thread and so e10s would help here? No. e10s Necko proxies from the main thread of the content process to the main thread of the parent process (as do a lot of other things). On a related note, if you lock up the main thread, the browser UI stops responding, we stop repainting, etc. So you shouldn't be doing that anyways. I *believe* that if you launch a subworker from a worker the subworker will begin to run without the worker returning to the event loop (assuming that the main thread is processing event loops). If that is not the case it is probably fixable in a much shorter amount of time.

Luke Wagner [:luke]

Comment 21

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #20) > On a related note, if you lock up the main thread, the browser UI stops > responding, we stop repainting, etc. So you shouldn't be doing that anyways. That makes sense. The use case here is fine-grained locking which, when it doesn't deadlock, wouldn't lock up the main thread for long. > I *believe* that if you launch a subworker from a worker the subworker will > begin to run without the worker returning to the event loop (assuming that > the main thread is processing event loops). If that is not the case it is > probably fixable in a much shorter amount of time. Ahhh, good point. It is a goal to be able to run these games' main threads in a worker (hence OffScreenCanvas) so that might be good enough.

Eli Grey (:sephr)

Comment 22

•

10 years ago

This issue breaks my Core Estimator library on any machine with 16 or more threads running Firefox. Please fail loudly with a thrown error, not silently and break my code. You should also consider dropping the 20 thread limit entirely, as there are already single-socket desktops with 32 threads being throttled by Firefox. Why an arbitrary limit of 20? It's not saving any resources on common computers with less than 20 cores, and it's only throttling performance on computers with more than 20 cores. If the goal was to limit resource use, then the limit should be relative to actual system threads (e.g. set the limit to 80% of actual threads). No other browser has an arbitrary limit on worker threads. In the meantime, I'm going to have to hard code in *gasp* UA sniffing as it's my only option left to solve this issue. I'll also be adding a passive aggressive console.warn telling developers that their browser is throttling access to system resources, and to subscribe to issue 1052398 for updates.

Luke Wagner [:luke]

Comment 24

•

10 years ago

So are there any objections to throwing an exception when the per-origin limit is hit? I haven't seen any specific disagreement on that point.

Lars T Hansen [:lth]

Comment 25

•

10 years ago

Looping in Anne, since he's going to have a hand in cleaning up the Workers spec (IIUC). In case he has opinions. I think throwing an error when we can't create a worker is a must-have; the alternatives are all worse than that (namely: returning a Worker that appears to be running but isn't, like now; or returing a Worker but allowing introspection to determine that it is not in fact running). I think the per-origin limit is probably not the best we could hope for but that's orthogonal to throwing when creation fails.

Anne (:annevk)

Comment 26

•

10 years ago

Do other browsers throw when creation fails? I wonder if throwing would be compatible with deployed code. Otherwise we could make the throwing behavior optin as suggested earlier: w = new Worker(..., { throwOnCreation: true })

Simon Pieters [:zcorpan]

Comment 27

•

10 years ago

Presto throws QUOTA_EXCEEDED_ERR when exceeding its limit, which is configurable in opera:config. The default in Opera 12 is 256 per window, 1024 per session. It was increased at some point from 16 per window, I think. ("window" is probably top-level browsing context.)

Jukka Jylänki

Reporter

Comment 28

•

10 years ago

I would definitely favor the opt-in "throwOnCreation: true" approach, that would be a perfect fit for all use cases that I have heard which suffer from the issue of not knowing if a new worker can be created or not. Keeping the default behavior to silently queue the worker to spawn when resources are available sounds ok to me. There may be some applications already which schedule several workers for temporary jobs, and they might sporadically go above the limit, and not be prepared to check for a failure. But we should definitely document the default silent queueing behavior in MDN, along with a newly added flag.

Lars T Hansen [:lth]

Comment 29

•

10 years ago

Anne opened a bug on the WhatWG tracker: https://github.com/whatwg/html/issues/319. There are links there to previous correspondence about this issue.

Lars T Hansen [:lth]

Comment 30

•

10 years ago

To summarize that correspondence, the main reason not to throw an exception when failing to create a worker is that an exception can only be thrown for some errors (quota exceeded) but not for others (script failed to load, for example). There are some other cases where exceptions are not a great fit. This message about Chrome's behavior is probably the best summary: https://lists.w3.org/Archives/Public/public-whatwg-archive/2010Sep/0414.html. A flag like "throwOnCreation" (a better name is needed), or an error callback / event, or any such mechanism, is nice because it asks for particular behavior in an important use case without requiring us to solve all cases. (I'm not sure why the error mechanism of the Worker interface can't be reused for this, perhaps it feels weird to have a Worker object that isn't backed by a thread.)

Duan Yao

Comment 31

•

10 years ago

(This comment was posted here: https://github.com/whatwg/html/issues/319#issuecomment-156008810 ) I think the root problem is the fact that running workers can't be swapped in/out -- by "swap out" I mean kill (or detach it from) its native thread, or even release its memory. If they could, a UA could select one to swap out and let the newly created one have a chance to run, if the quota is exceeded. However, saving and restoring a worker's state (JS context) seems hard to implement for UA. So I suggest that UA just throw away a worker's state during swapout, and re-run its entry script during swapin. Of cause this may break existing codes, so I suggest to add an opt-in parameter swappable: var swappableWorker = new Worker(scriptUrl, { swappable: true }); UA would only swap out swappable workers, so authors who want to create large amount of workers should make them swappable. UA could have two separate limits: "max non-swappable workers per-domain" and "max concurrent swappable workers per-domain", so that non-swappable workers can't block swappable ones. There is no hard limit on how many swappable workers can be alive at a same time. It may be also useful to fire a savestate event to a worker that is about to be swapped out, so that it can save/restore state by itself. UA should schedule swappable workers fairly, and also reduce interuption. For example, it is better to swap out an inactive worker than a worker that is doing computation or has pending XHRs. If messages are posted to a swapped-out worker, UA should queue up them, and try to swap the worker in, and then let the worker process them.

Lars T Hansen [:lth]

Comment 32

•

10 years ago

(In reply to Duan Yao from comment #31) > (This comment was posted here: > https://github.com/whatwg/html/issues/319#issuecomment-156008810 ) > > I think the root problem is the fact that running workers can't be swapped > in/out -- by "swap out" I mean kill (or detach it from) its native thread, > or even release its memory. If they could, a UA could select one to swap out > and let the newly created one have a chance to run, if the quota is exceeded. That might be useful for some cases, but for the use case at hand - high performance computing where a worker acts as a thread, and where a worker can go to sleep synchronously waiting for a wakeup from another thread - it will not be appropriate. Here, it is much more important that the program have knowledge of the numebr of threads available to it, so that it can do its own resource scheduling with that information.

Jukka Jylänki

Reporter

Comment 33

•

10 years ago

In Emscripten applications such swappable workers would not be useful, since Emscripten-ported applications require that if a thread execution context is paused/context-switched out, it should later always resume execution in the same state. Application developers may create workers which run their own infinite main loop, which would not allow throwing away worker's state during swapout, and restarting the worker afterwards. Also, with SharedArrayBuffer, a worker might have been holding a mutex, which would cause a deadlock. In either case, that is not the point of this bug, but just the fact that currently JS code cannot reliably measure if a "new Worker" call will ever actually start up, since it involves a network request, so one needs to make a dirty guess about what an upper time limit for such network request+JS parsing+a potential asm.js compilation to complete would be. The main thread application logic might have to pause for the duration until it knows whether the worker actually did spawn. If one guesses a too small timeout, then it's a chance that the application fails to run even when the network was just acting a bit slow, or parsing or asm.js compilation took a long time (mobile). If one guesses a too big timeout, then it creates an annoying long wait when an application hits the quota limit. In my mind, the proposal in comment 26 would solve the issue perfectly. It is extremely desirable here to receive a synchronous failure (throw immediately), rather than an asynchronous failure.

Duan Yao

Comment 34

•

10 years ago

Well, for non-swappable workers, I aggree that UA should expose the number of available non-swappable thread and throw on failure of creating a thread immeditely. However, without swap, this may severely limit the number of workers that can be created. Current spec allow UA to kill any worker at any time, this is even harsher than swapping, and for shared memory case may cause deadlock/data corruption too. Do we want to forbid this behavior (https://www.w3.org/Bugs/Public/show_bug.cgi?id=29039)? I suggest take workers that share memory as a whole (like a process), and UA should kill/swap out all or none of them at a time. For a worker that runs a tight loop, I think the UA may set a flag `saveStateFlag` of the worker to true when the UA want to swap it out; the worker should check the flag in the loop and save state, flip the flag, and continue. Is this OK?

Lars T Hansen [:lth]

Comment 35

•

10 years ago

(In reply to Duan Yao from comment #34) > Well, for non-swappable workers, I aggree that UA should expose the number > of available non-swappable thread and throw on failure of creating a thread > immeditely. However, without swap, this may severely limit the number of > workers that can be created. Definitely. But for the use case at hand that is considered an acceptable cost. > Current spec allow UA to kill any worker at any time, this is even harsher > than swapping, and for shared memory case may cause deadlock/data corruption > too. Do we want to forbid this behavior > (https://www.w3.org/Bugs/Public/show_bug.cgi?id=29039)? As I've argued on that bug, I think that the Workers spec needs some kind of wording to allow workers to be killed, but it needs to define clearly the circumstances under which that may happen. The current wording is useless even for existing use cases, as there's no way to trust that a worker will ever do what it's supposed to do, even under the most benign conditions. That might make sense if the worker were a remote entity but not when it's running on the local computer, and has a means of reporting execution errors (as it does) but not to report anything when it abruptly terminates because of action by the UA. > I suggest take > workers that share memory as a whole (like a process), and UA should > kill/swap out all or none of them at a time. It's possible that that could work, though not if the workers would have to cooperate in the swapping, as I think you're suggesting. A single uncooperating worker would prevent the group from being swapped out. > For a worker that runs a tight loop, I think the UA may set a flag > `saveStateFlag` of the worker to true when the UA want to swap it out; the > worker should check the flag in the loop and save state, flip the flag, and > continue. Is this OK? Probably not, if it's not possible to opt out of swapping. As Jukka and I both pointed out, "restoring state" is sometimes extremely complicated or in practice impossible. Swapping would have to be something to opt in to.

Duan Yao

Comment 36

•

10 years ago

> Swapping would have to be something to opt in to. This was exactly what I proposed :). > As Jukka and I both pointed out, "restoring state" is sometimes extremely complicated or in practice impossible. Sometimes, yes. But in practice, all native mobile apps may be asked to save sate by OS and then killed on low memory condition, and later may be resarted and restored if user switch to it again. Even if an (web) app don't want to bother with "restoring state", it still have to deal with forced close/ crash conditions, and make its state consistent if that happend. For some use cases we want unlimited number of workers. For example, zip.js (https://github.com/gildas-lormeau/zip.js) backs each opened zip file with a worker for deflating/inflating. It is frustrating that the number of simultaneously opened zip file is so limited, and affected by other usage of worker in a same domain. Making worker swappable is not trival but definitely doable in zip.js. Additionally, we can even back each opened zip file with multiple workers to take advantage of multi-core CPUs, without worrying about exhausting worker quota quickly. Sure, a group of workers sharing memory must cooperate to be swappable. You have to make sure there is no "uncooperating worker" in this group. But you already have to make them cooperate to achieve memory safety, right? Yes, I think UA should inform a worker and the holder of the worker that the worker will be killed (like SIGTERM or something?).

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

10 years ago

Depends on: 1241485

Marion Daly [:mdaly]

Updated

•

8 years ago

Priority: -- → P3

Ben Kelly [:bkelly, not reviewing]

Comment 38

•

8 years ago

We currently have a limit of 512. This is probably as close to "unlimited" as we are willing to go. We can't completely remove the limit due to security concerns from spamming stacks into the address space. I'm going to make this as WORKSFORME.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME