Closed Bug 437152 (workerthreads) Opened 16 years ago Closed 16 years ago

implement worker threads

Categories

(Core :: DOM: Core & HTML, defect, P1)

x86
All
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: vlad, Assigned: bent.mozilla)

References

(Depends on 1 open bug, )

Details

Attachments

(7 files, 4 obsolete files)

807 bytes, text/html
Details
143.83 KB, patch
jst
: review-
Details | Diff | Splinter Review
9.84 KB, patch
jst
: review+
Details | Diff | Splinter Review
4.01 KB, patch
jst
: review+
Details | Diff | Splinter Review
1.04 KB, patch
jst
: review+
Details | Diff | Splinter Review
3.20 KB, patch
jst
: review+
Details | Diff | Splinter Review
5.72 KB, patch
smaug
: review-
Details | Diff | Splinter Review
We need to implement a JS worker threads API accessible to content.  More discussion and links coming soon.
wanted1.9.1+.  If this should be blocking, please mark, vlad.
Flags: wanted1.9.1+
Priority: -- → P1
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Attached patch WIP (obsolete) — Splinter Review
Here's my latest progress. Most of the basics work great, like making multiple threads and sending messages all over the place. Workers are canceled when navigating away from the page. Errors in worker scripts are properly reported.

I did make some destabilizing changes yesterday (one context per worker on any thread changed to one context per thread for any worker) but i think I've fixed that all up now.

I'm not currently being bitten by JSEng/XPConnect bugs but I do have a bunch of patches included here that I need to separate out and file bugs on (hangs and crashes alike).

Things missing:
1. Suspend and Resume on page navigation.
2. Loading script files instead of just strings.
3. Fun stuff like XHR and Timers.
Here's an example of how you use the threads currently. The API exposed to content JS is absolutely not frozen or even slushy because we haven't had the necessary discussions with others (Gears, HTML5 folks) yet. This is just an idea of what it could look like.
Thanks for the update ben.  Good stuff!
Attached patch WIP (obsolete) — Splinter Review
Now has suspend/resume/cancel on navigation and cache drop.
Attachment #325946 - Attachment is obsolete: true
Attachment #326046 - Attachment is patch: true
Attachment #326046 - Attachment mime type: application/octet-stream → text/plain
http://wiki.mozilla.org/DOMWorkerThreads is the beginning of the API proposal. It's nowhere near done yet, I'll post to mozilla.dev.platform when it's ready to be debated by a larger audience.
Is there a "make XHR threadsafe" bug somewhere?
Hmm, threadsafe XHR...
Currently XHR is pretty heavily bound to DOM.
Does cycle collector work well with threads?
Sorry, the plan is not to make a "threadsafe XHR". We're going to implement a subset of XHR similar to this: http://code.google.com/apis/gears/api_httprequest.html
Attached patch Patch, v1 (obsolete) — Splinter Review
I think this is ready for review. String-only message passing, pause/resume/cancel, timeouts, and intervals all should work. Basic tests included.

Note that I've stripped out all the miscellaneous JS/XPConnect/XPCOM changes that I needed into blocking bugs so this patch by itself won't work!

I've also got some more elaborate tests on the way that aren't quite ready for review yet.
Attachment #326046 - Attachment is obsolete: true
Attachment #328303 - Flags: review?(benjamin)
Comment on attachment 328303 [details] [diff] [review]
Patch, v1

My review plan was to get bsmedberg to review the threading arch and to get jst to review the js-y stuff. Sound adequate? Anyone else I need to loop in here?
Attachment #328303 - Flags: review?(jst)
Comment on attachment 328303 [details] [diff] [review]
Patch, v1

Vlad, do you want to take a look too?
Attachment #328303 - Flags: review?(vladimir)
(In reply to comment #10)

Is the API still being discussed at all? I noticed a couple, little, strange bits that would be good to discuss.

Also - are there more tests coming along? I only saw the one Mochitest - wasn't sure if there was supposed to be more, or not.
Comment on attachment 328303 [details] [diff] [review]
Patch, v1

I've found one additional thread safety issue with timeouts and will post a new patch shortly.
Attachment #328303 - Attachment is obsolete: true
Attachment #328303 - Flags: review?(vladimir)
Attachment #328303 - Flags: review?(jst)
Attachment #328303 - Flags: review?(benjamin)
Attached patch Patch, v1.1Splinter Review
Fixes the timeout threadsafety issue (the linked list of timeouts in nsDOMWorkerThread wasn't protected). Also moved from a linked list of runnables in the thread service to a deque to simplify the code.

I'm still seeing a bizarre slowdown of previously-suspended timeouts (intervals keep coming back later and later even though cpu usage remains flat)... Maybe a reviewer can spot the problem? It's a (relatively) small bug so I wouldn't necessarily say it should block the alpha.
Attachment #329381 - Flags: review?(jst)
(In reply to comment #13)
> Is the API still being discussed at all? I noticed a couple, little, strange
> bits that would be good to discuss.

The API is still being discussed, yes. Whatever we include in the alpha will almost certainly not be final.

> Also - are there more tests coming along?

I've included another mochitest (for error propagation) with the latest patch and I'm still working on others.

Comment on attachment 329381 [details] [diff] [review]
Patch, v1.1

- In nsNavigator::NewWorkerPool():

+  nsCOMPtr<nsIDOMThreadService> threadService =
+    do_GetService("@mozilla.org/domthreads/service;1", &rv);

This service is in the same library, right? No need to go through the service manager then. And would this even need to be a registered service?

+  rv = threadService->CreatePool(getter_AddRefs(newPool));
+  NS_ENSURE_SUCCESS(rv, rv);
+
+  newPool.forget(_retval);

If rv is an error code, we'll return w/o setting *_retval to null here. You could just pass _retval to CreatePool() here and not use the nsCOMPtr at all.

+/**
+ * Simple class to automatically destroy a JSContext to make error handling
+ * easier.
+ */
+class JSAutoContext

Since this only automatically destroys a JSContext, maybe call it JSAutoContextDestroyer?

+class nsDOMWorkerScriptError : public nsIClassInfo
...
+  // Lame, nsIScriptError and nsIClassInfo both have 'readonly attribute
+  // unsigned long flags' so we have to use an inner class to do this the
+  // right way...
+  class InnerScriptError : public nsIScriptError

Since nsDOMWorkerScriptError is primarily a nsIScriptError, and nsIClassInfo really is secondary and an implementation detail here, maybe you should flip the inner/outer inheritance here, make nsDOMWorkerScriptError inherit nsIScriptError, and deal with the classinfo fu in the inner class (or a separate one, whatever is easier).

- In nsDOMWorkerBase::PostMessageInternal():

+    if (GetScriptContextFromJSContext(cx)) {
+      // Must be a normal DOM context, use the pool as the source.
+      source = Pool();
+    }
+    else {
+      // Must be a worker context, get the worker from the context private.
+      nsRefPtr<nsDOMWorkerThread> worker =
+        (nsDOMWorkerThread*)JS_GetContextPrivate(cx);

The if case is true, but the else case is not. You can't assume that if it's not a normal DOM context it'll be one of yours, you'll need to check by checking the context private flags and QI:ing to nsIDOMWorkerThread or whatever. And even in the first case, even if the context isn't a DOM context it's still possible that the running JS code that got you here is from the webpage. Maybe the check in that case should be whether we're on the main thread or not, rather than looking at what type of context got us here?

- In nsDOMWorkerPool::SetErrorListener():

+    // This must exist in a DOM context, not one of our worker contexts.
+    nsIXPConnect* xpc = nsContentUtils::XPConnect();
+  
+    nsAXPCNativeCallContext* ncc;
+    nsresult rv = xpc->GetCurrentNativeCallContext(&ncc);
+    NS_ENSURE_SUCCESS(rv, rv);
+  
+    JSContext* cx;
+    rv = ncc->GetJSContext(&cx);
+    NS_ENSURE_SUCCESS(rv, rv);
+
+    nsIScriptContext* scriptObj = GetScriptContextFromJSContext(cx);
+    NS_ENSURE_TRUE(scriptObj, NS_ERROR_INVALID_ARG);

What exactly is this guarding against? Seems wrong to rely on the context being a cetrain type of context...

- In nsDOMWorkerFunctions::Dump():

Don't we want to pref whether this actually prints to the console (mmm, threadsafe prefs?), like we do for window.dump()? If not, we'll end up in the same situation where we were with window.dump() where we'd just spew crap on the console all the time... And should we include the address of the thread and/or worker in the output to make it more useful?

- In nsDOMWorkerThread::CompileGlobalObject():

+  JSObject* global = JS_NewObject(aCx, nsnull, nsnull, nsnull);
+  NS_ENSURE_TRUE(global, PR_FALSE);
+
+  PRBool success = JS_InitStandardClasses(aCx, global);
+  NS_ENSURE_TRUE(success, PR_FALSE);
...
+  mGlobal = global;
+  success = JS_AddNamedRootRT(rt, &mGlobal, "nsDOMWorkerThread Global Object");

You'll need to root global way before here, it needs to be rooted right after it's been created as there's nothing here holding it alive.

- In nsDOMWorkerTimeout::Notify():

+  if (type == nsITimer::TYPE_ONE_SHOT) {
[...]
+        rv = aTimer->InitWithCallback(this, mInterval,
+                                      nsITimer::TYPE_REPEATING_SLACK);

aTimer is the timer that's in the middle of firing when we do this. As this timer is re-initialized as a repeating slack timer, it'll get added to the timer thread, which is all nice n' good. But once we're done here and return from Notify() the timer code checks if mType is repeating slack, and if it is it calls SetDelayInternal() to adjust when to fire the timeout, and then adds it to the timer thread again. So what you'll end up with, unless I missed something obvious, is the same timer added to the timer thread twice, and that's not good I would think. I think in this case you'll probably be better off creating a new timer rather than re-initializing the one that we're in the middle of firing here.

Other than that this looks really good. It's a big patch, lots to read, but I don't see anything that looks really wrong.

r- until you've gone through the above. Make the next patch be one that applies on top of this one for easier reviewing :)
Attachment #329381 - Flags: review?(jst) → review-
(In reply to comment #17)
> - In nsDOMWorkerThread::CompileGlobalObject():
> 
> +  JSObject* global = JS_NewObject(aCx, nsnull, nsnull, nsnull);
> +  NS_ENSURE_TRUE(global, PR_FALSE);
> +
> +  PRBool success = JS_InitStandardClasses(aCx, global);
> +  NS_ENSURE_TRUE(success, PR_FALSE);

Note that if aCx's globalObject member is null, then the call to JS_InitStandardClasses will set aCx->globalObject, which is a root.
(In reply to comment #17)
> This service is in the same library, right? 

I'll fix that.

> If rv is an error code, we'll return w/o setting *_retval to null here.

Doesn't bsmedberg's static analysis enforce the XPCOM rule that you shouldn't set *_retval if you return failure?

> Since this only automatically destroys a JSContext, maybe call it
> JSAutoContextDestroyer?

Sure.

> Since nsDOMWorkerScriptError is primarily a nsIScriptError, and nsIClassInfo
> really is secondary and an implementation detail here, maybe you should flip
> the inner/outer inheritance here, make nsDOMWorkerScriptError inherit
> nsIScriptError, and deal with the classinfo fu in the inner class (or a
> separate one, whatever is easier).

I tried that first but iirc the nsIClassInfo macros didn't work properly with the inner class... I could do that and not use the macros, I guess, but this seemed easier. Given that choice do you think it is worth it to make those changes?
> You can't assume that if it's not a normal DOM context it'll be one of yours

Hm, I'm confused on this point. How is it possible (on the main thread) to be running in a context that is not a DOM context? I agree that it might be simpler to simply check to see if we're running on the main thread.

And then, on the worker threads, the only context they can ever see is one that I create (and set/clear explicitly), so I don't see why I should do extra verification there... 

> What exactly is this guarding against? Seems wrong to rely on the context being
> a cetrain type of context...

Yeah, that's more debug-ish code... I'll remove it since I no longer expose any way for another thread (or context) to set the error listener on the pool.

> Don't we want to pref whether this actually prints to the console (mmm,
> threadsafe prefs?), like we do for window.dump()?

Yeah, I had planned on it, just hadn't gotten to it (also have to worry about JS options - version, GC level, strict, etc.). I was going to do this by 1) having all these settings per-worker and set to the current state on creation, and then 2) adding a pref callback on the thread service that posted 'pref changed' runnables to worker threads. Soudn ok? Do we care about this for alpha, or is this followup material?

> And should we include the address of the thread and/or worker in the output
> to make it more useful?

Hm, I would lean against this and make the behavior identical to the existing window.dump. If consumers want to print that info they can, it's exposed in the 'thisThread' object.

> You'll need to root global way before here, it needs to be rooted right after
> it's been created as there's nothing here holding it alive.

Given what mrbkap said is it worth changing this? Or should I just add an explanatory comment?


> aTimer is the timer that's in the middle of firing when we do this. As this
> timer is re-initialized as a repeating slack timer, it'll get added to the
> timer thread, which is all nice n' good. But once we're done here and return
> from Notify() the timer code checks if mType is repeating slack, and if it is
> it calls SetDelayInternal() to adjust when to fire the timeout, and then adds
> it to the timer thread again. So what you'll end up with, unless I missed
> something obvious, is the same timer added to the timer thread twice, and
> that's not good I would think. I think in this case you'll probably be better
> off creating a new timer rather than re-initializing the one that we're in the
> middle of firing here.

Hm... Yeah. This is a bug in the timer code, don't you think? Looks like the timer code should remove the re-init'd timer if mArmed is true before adding that second time (as it does in InitCommon). I'll test this today to see if that's what's causing my mysterious slowdowns.

> Make the next patch be one that applies on top of this one for easier
> reviewing :)

You got it.
(In reply to comment #18)
> Note that if aCx's globalObject member is null, then the call to
> JS_InitStandardClasses will set aCx->globalObject, which is a root.

Ah, good point, I didn't realize aCx->globalObject was a root (that it is makes a lot of sense thoug) since in the DOM we explicitly have to root the globals. Inner/outer windows are so much fun I can hardly believe it! :)
(In reply to comment #19)
> Doesn't bsmedberg's static analysis enforce the XPCOM rule that you shouldn't
> set *_retval if you return failure?

Um, if anything it should enforce the opposite I would imagine. bsmedberg?

> I tried that first but iirc the nsIClassInfo macros didn't work properly with
> the inner class... I could do that and not use the macros, I guess, but this
> seemed easier. Given that choice do you think it is worth it to make those
> changes?

Probably not if it ends up being trickier than the alternative. 

> > You can't assume that if it's not a normal DOM context it'll be one of yours
> 
> Hm, I'm confused on this point. How is it possible (on the main thread) to be
> running in a context that is not a DOM context? I agree that it might be
> simpler to simply check to see if we're running on the main thread.

evalInContext() is one example that creates its own contexts in XPConnect and content code will run on the main thread on those contexts, and they're not DOM contexts. I believe there's other examples of the same as well, but can't think of them right now.

> And then, on the worker threads, the only context they can ever see is one that
> I create (and set/clear explicitly), so I don't see why I should do extra
> verification there... 

As long as the if() case guarantees that we're *not* running on a DOM worker, then you don't need any more checking in the else case. There's no way for other threads (non-main and non-dom-worker threads) to ever call in here, right? If it would be, then you'd need more than an is-main-thread check in the if() case as well.

> > Don't we want to pref whether this actually prints to the console (mmm,
> > threadsafe prefs?), like we do for window.dump()?
> 
> Yeah, I had planned on it, just hadn't gotten to it (also have to worry about
> JS options - version, GC level, strict, etc.). I was going to do this by 1)
> having all these settings per-worker and set to the current state on creation,
> and then 2) adding a pref callback on the thread service that posted 'pref
> changed' runnables to worker threads. Soudn ok? Do we care about this for
> alpha, or is this followup material?

I don't think this is required for alpha.

> > And should we include the address of the thread and/or worker in the output
> > to make it more useful?
> 
> Hm, I would lean against this and make the behavior identical to the existing
> window.dump. If consumers want to print that info they can, it's exposed in the
> 'thisThread' object.

But the thisThread property is an nsIDOMWorkerThread, which exposes nothing unique about itself. IOW, if you toString() it you'll only get [object DOMWorkerThread] or something, same thing on all threads (unless you're running this in a debug build which will include the address of the object in the string, but most people won't be running in debug builds). I don't know if this is critical to actually worry about, I guess scripts can always make up their own unique names and include that in the dump output if it's important... Unless of course it's important to track whether the JS executes on the same thread repeatedly or is bounced from thread to thread. I guess that only really matters to us, and we can figure that out in other ways if there's ever a need.

> > You'll need to root global way before here, it needs to be rooted right after
> > it's been created as there's nothing here holding it alive.
> 
> Given what mrbkap said is it worth changing this? Or should I just add an
> explanatory comment?

Nope, no change needed. Comment would be good.

[...]
> Hm... Yeah. This is a bug in the timer code, don't you think? Looks like the

Could be, not sure if this was ever intended use of timers, changing one from a one-shot timer to a repeating timer while it's firing. I could go either way there.

> timer code should remove the re-init'd timer if mArmed is true before adding
> that second time (as it does in InitCommon). I'll test this today to see if
> that's what's causing my mysterious slowdowns.

It could, but wouldn't that penalize us every time we fire a repeating slack timeout? I.e. we'd always need to attempt to remove the timeout before adding it, or something, as we won't easily know ahead of time whether the timer has already been added etc?
s/timeout/timer/g :)
The "rules" of MS-COM state that when a method returns a failure code, it shouldn't touch the outparam at all. In many cases we set it to null, which is currently considered a minor warning in the outparams analysis. If you set it to a non-null param and return a failure code, that's a major warning and will become an error once the analysis and annotations are better.
Attached patch Patch, v1.2Splinter Review
This patch applies on top of attachment 329381 [details] [diff] [review] although interdiff gave me a little trouble... Should address those comments above.
Attachment #330617 - Flags: review?(jst)
Oh, yeah, I filed bug 445828 on the timer re-init problem and... get this... it completely solves the slowdown problem. Yay!
Comment on attachment 330617 [details] [diff] [review]
Patch, v1.2

Looks good!
Attachment #330617 - Flags: superreview+
Attachment #330617 - Flags: review?(jst)
Attachment #330617 - Flags: review+
Attached patch Patch, v1.3Splinter Review
This applies on top of attachment 330617 [details] [diff] [review] and fixes one hazard where a worker could see its last release on a background thread (which isn't supposed to happen).
Attachment #333289 - Flags: review?(jst)
Attachment #333289 - Attachment is patch: true
Attachment #333289 - Attachment mime type: application/octet-stream → text/plain
Attachment #333289 - Flags: superreview+
Attachment #333289 - Flags: review?(jst)
Attachment #333289 - Flags: review+
Attachment #330981 - Flags: superreview+
Attachment #330981 - Flags: review?(jst)
Attachment #330981 - Flags: review+
Attachment #333291 - Flags: superreview+
Attachment #333291 - Flags: review?(jst)
Attachment #333291 - Flags: review+
Note the new tracker for worker threads features, bug 450448.
Alias: workerthreads
Alias: workerthreads
Pushed changeset 1ebacbd09ad9 to mozilla-central.

Also had to fix an uninitialized nsresult bug that caused failing test boxes, changeset c1e9838bb4f4.

FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
bent: could you reformat the output of test_threadErrors.html slightly so that the normal output doesn't get picked up by the tinderbox's error parser?

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1219140344.1219143241.13982.gz

The regex it uses is here:
http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/ep_unittest.pl#31
I notice that this implementation utilizes the concept of pools, which aren't defined in the WHATWG Web Worker spec. Is there a follow-up bug that's working to match the API? or is the spec out of date?
When I run mochitest locally, workerthread tests take occasionally lots of time 
(10+ mins or so). I tried to attach gdb to see what is happening but so far I 
couldn't get any reasonable stack. The problem seems to happen only when I run 
all the mochitests, not only workerthread tests. If I run only workerthread tests,
they all pass in less than 2 minutes.
This is on 32 bit Linux.
I think the problem has something to do with gczeal.
(In reply to comment #32)
> bent: could you reformat the output of test_threadErrors.html

I'm looking into it, but I'm not manually outputting those messages (that's the JS engine throwing the errors to the console), so I'm not sure how quick the fix will be.
(In reply to comment #33)
Yeah, we're still working on getting some spec stabilized, but until then we're just going to go with what we have. I'll file bugs just as soon as we know what changes we need to make.
(In reply to comment #34)
> When I run mochitest locally, workerthread tests take occasionally lots of time 

(In reply to comment #35)
> I think the problem has something to do with gczeal.
> 

Indeed. I'm a little paranoid about GC causing random problems since I'm passing weak refs across threads so I have gczeal turned up for all the workerthreads tests. Think this is a problem?

My guess is that using gczeal here will make things run very slowly, especially for threaded tests, which I'm guessing will probably end up running effectively serially as a result of using the feature.  That may or may not be a bad thing, but you might instead want to run much shorter tests with gczeal tuned up high, than you would otherwise.
Tested without gczeal==2, and now I can run mochitest in reasonable time.
WorkerThread tests don't take too much time.
So would be great to fix the tests somehow. Maybe shorter tests with gczeal==2, like crowder suggested.
Attachment #334540 - Flags: review? → review?(Olli.Pettay)
Now with comments
Attachment #334540 - Attachment is obsolete: true
Attachment #334541 - Flags: review?(Olli.Pettay)
Attachment #334540 - Flags: review?(Olli.Pettay)
Comment on attachment 334541 [details] [diff] [review]
Don't set gczeal in tests on debug builds

Just add a #ifdef DEBUG pref, rather than relying on a pref which is also #ifdef DEBUGged 
http://mxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#357
Attachment #334541 - Flags: review?(Olli.Pettay) → review-
Depends on: 459503
Depends on: 468045
Depends on: 468065
Depends on: 483633
Component: DOM: Mozilla Extensions → DOM
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.