Worker object cycle collection is broken

RESOLVED FIXED in Firefox 28

Status

()

Core
DOM: Workers
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: sole, Assigned: khuey)

Tracking

({regression})

unspecified
mozilla29
x86
Mac OS X
regression
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox27 unaffected, firefox28 fixed, firefox29 fixed)

Details

Attachments

(2 attachments)

We've found this issue where launching web workers repeatedly on a page ends up with random timeouts in actually the web worker being launched.

It happens more or less every 20th web worker, and the first frame that "freezes" is also somewhat random. On Firefox release it seems to be consistently the 11th, but on Nightly it is sometimes the first, other times the 2nd, the 11th... quite unpredictable.

I thought it was our Animated_GIF.js library having something wrong but it also happens with gif.js. Here are two demos, where up to 200 gifs will be generated. The text in each square is the time elapsed between generating that one and the previous one-if it takes more than a second it's painted RED because it's slow. You'll see there's a certain pattern with this:

* Using Animated_GIF - we're launching 2 workers per GIF so it freezes every 10th GIF:
http://people.mozilla.org/~spenades/test_cases/workers_every_10th/tests_stress.html

* Using gif.js - since it launches only 1 worker per GIF it freezes every 20th GIF:
http://people.mozilla.org/~spenades/test_cases/workers_every_20th/tests_stress.html

We are also seeing this behaviour in chat.meatspac.es where we've already filed a bug:
https://github.com/meatspaces/meatspace-chat/issues/103

This behaviour is not present in Chrome (it crashes in my stress test, but after many GIFs have been generated without lag/timeouts between them).

I have been looking at the network graph and the profiler but I can't really figure out what it is that is happening :-/
There is a hardcoded limit of 20 workers per origin to prevent DOSing the browser.

http://mxr.mozilla.org/mozilla-central/source/modules/libpref/src/init/all.js#100
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → INVALID
(Reporter)

Comment 2

5 years ago
So if the workers are used and then go away, why are they not being GC'ed earlier? Because it looks to me that what's happening is it takes too long for them to be "restored into the pool". Sometimes it takes 7 seconds. Sometimes it takes A MINUTE. Sometimes IT NEVER HAPPENS.

The other browsers are fine. So something is wrong here. Reopening ;)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Kyle, this problem doesn't happen in Chrome and it affects sites like https://chat.meatspace.es which is heavily used.
Ok, I think there is actually a Gecko bug here related to GCing cycles involving workers.
Status: REOPENED → NEW
Summary: Random timeouts when repeatedly launching Web Workers → Worker object cycle collection is broken
This library never actually terminates the workers (unless abort is called), so the only way to shut them down is for the GIF object to be GCd.  If the library consumer causes a GIF object to be entrained it will leak a running worker.  That's not the issue here, but it is poor design.
(Reporter)

Comment 6

5 years ago
Hey Kyle

I added an explicit method to the library so you can ask it to release the workers once you don't want to use the parent object anymore, by calling terminate() on them. This makes the test pass without crashing, so maybe this confirms the GC issue?

Here's the new test: http://sole.github.io/Animated_GIF/tests_stress.html
Web workers shut down in one of 3 ways: when close() is called on the worker thread, when terminate() is closed on the parent thread, or when the Worker object is no longer reachable on the parent thread.  There's a bug in the implementation of the later (having a grey JS wrapper should count as unreachable for these purposes).  But the page should do one of the first two anyways, and not rely on the third, which is just there as a last resort.
Assignee: nobody → khuey
status-firefox27: --- → unaffected
status-firefox28: --- → affected
status-firefox29: --- → affected
Keywords: regression
Created attachment 8368810 [details] [diff] [review]
Patch
Attachment #8368810 - Flags: review?(continuation)
Attachment #8368810 - Flags: review?(bent.mozilla)
Comment on attachment 8368810 [details] [diff] [review]
Patch

Review of attachment 8368810 [details] [diff] [review]:
-----------------------------------------------------------------

Hooray for code removal!

The cycle collector and binding generator changes look good to me.
Attachment #8368810 - Flags: review?(continuation) → review+
Comment on attachment 8368810 [details] [diff] [review]
Patch

Review of attachment 8368810 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +3463,5 @@
> +  // down the thread.
> +
> +  if (!tmp->mBusyCount && !tmp->mMainThreadObjectsForgotten) {
> +    NS_CYCLE_COLLECTION_NOTE_EDGE_NAME(cb, "self");
> +    cb.NoteXPCOMChild(tmp);

I'd much prefer this to traverse an actual nsRefPtr<WorkerPrivateParent> called mSelfRef. Then we could set it to something in Constructor() and null it in the WorkerFinishedRunnable. It would at least be more easily searchable than looking for a stray AddRef/Release in here.
Attachment #8368810 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/3f052ca2795f
Status: NEW → RESOLVED
Last Resolved: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
status-firefox29: affected → fixed
Comment on attachment 8368810 [details] [diff] [review]
Patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 928312
User impact if declined: Leaks - worker objects may not be GCd properly ever, pages may not work if they depend on workers being reclaimed (e.g. sole's demos)
Testing completed (on m-c, etc.): On m-c, has automated test.
Risk to taking this patch (and alternatives if risky): Low risk, well tested.  Backing out 928312 would be far more invasive
String or IDL/UUID changes made by this patch: None
Attachment #8368810 - Flags: approval-mozilla-beta?
Comment on attachment 8368810 [details] [diff] [review]
Patch

Switching the approval to aurora as Fx27 is unaffected here
Attachment #8368810 - Flags: approval-mozilla-beta? → approval-mozilla-aurora?

Updated

4 years ago
Attachment #8368810 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This has some non-trivial conflicts with Aurora that I'm not confident trying to resolve.
Flags: needinfo?(khuey)
Keywords: branch-patch-needed
https://hg.mozilla.org/releases/mozilla-aurora/rev/afad6b72ed8d
status-firefox28: affected → fixed
Keywords: branch-patch-needed
You need to log in before you can comment on or make changes to this bug.