Closed Bug 919638 Opened 11 years ago Closed 10 years ago

PJS: New WorkStealing ThreadPool

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla29

People

(Reporter: dbonetta, Assigned: shu)

References

Details

Attachments

(6 files, 14 obsolete files)

bug-919638-fix.patch 11 years ago daniele bonetta 35.99 KB, patch		Details \| Diff \| Splinter Review
bug-919638-fix.patch 11 years ago daniele bonetta 67.24 KB, patch	shu : review+	Details \| Diff \| Splinter Review
bug-919638-fix.patch 11 years ago daniele bonetta 67.50 KB, patch	shu : review+	Details \| Diff \| Splinter Review
bug-919638-fix.patch 11 years ago daniele bonetta 68.28 KB, patch		Details \| Diff \| Splinter Review
bug-919638-fix.patch 11 years ago daniele bonetta 68.25 KB, patch		Details \| Diff \| Splinter Review
bug-919638-fix.patch 11 years ago daniele bonetta 68.29 KB, patch	dbonetta : review+	Details \| Diff \| Splinter Review
Part 1: Implement a work stealing thread pool for PJS. 11 years ago Shu-yu Guo [:shu] 58.59 KB, patch		Details \| Diff \| Splinter Review
Part 2: Bail out should not execute remaining slices in the worklist. 11 years ago Shu-yu Guo [:shu] 12.33 KB, patch		Details \| Diff \| Splinter Review
Part 1: Implement a work stealing thread pool for PJS. ( 11 years ago Shu-yu Guo [:shu] 78.40 KB, patch	nmatsakis : review+	Details \| Diff \| Splinter Review
Part 2: Bail out should not execute remaining slices in the worklist. 11 years ago Shu-yu Guo [:shu] 12.47 KB, patch		Details \| Diff \| Splinter Review
Part 1: Implement a work stealing thread pool for PJS. v2 11 years ago Shu-yu Guo [:shu] 82.03 KB, patch	shu : review+	Details \| Diff \| Splinter Review
Part 2: Bail out should not execute remaining slices in the worklist. v2 11 years ago Shu-yu Guo [:shu] 10.96 KB, patch		Details \| Diff \| Splinter Review
Part 1a: Make workstealing lock free. 11 years ago Shu-yu Guo [:shu] 18.68 KB, patch		Details \| Diff \| Splinter Review
Part 2: Bail out should not execute remaining slices in the worklist. 11 years ago Shu-yu Guo [:shu] 10.58 KB, patch		Details \| Diff \| Splinter Review
Part 1: Add work stealing scheduler 11 years ago Shu-yu Guo [:shu] 81.61 KB, patch		Details \| Diff \| Splinter Review
Roll up v2 11 years ago Shu-yu Guo [:shu] 93.31 KB, patch	nmatsakis : review-	Details \| Diff \| Splinter Review
Part 2: Change warmup semantics to be sequential in slices, low to high. 11 years ago Shu-yu Guo [:shu] 5.28 KB, patch	nmatsakis : review+	Details \| Diff \| Splinter Review
Roll up v3 10 years ago Shu-yu Guo [:shu] 95.51 KB, patch	nmatsakis : review+	Details \| Diff \| Splinter Review
Followup: unbreak compiling without NSPR. (r=?) 10 years ago Shu-yu Guo [:shu] 2.13 KB, patch		Details \| Diff \| Splinter Review
Followup: unbreak compiling without NSPR. (r=?) 10 years ago Shu-yu Guo [:shu] 7.31 KB, patch	pnkfelix : review+	Details \| Diff \| Splinter Review

daniele bonetta

Reporter

Description

•

11 years ago

New ThreadPool for PJS using workstealing.

Chris Peterson [:cpeterson]

Comment 1

•

11 years ago

If you are writing a new work stealing implementation, this paper describes an interesting approach:

"Non-blocking steal-half work queues"
http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf

This paper presents "StealHalf", a new generalization of work stealing algorithms that allows workers, instead of stealing one, to steal up to half of the items in a given work queue at a time. Workers don't need to steal as often and they are more likely to find a non-empty queue when they randomly pick one because items are distributed among more queues.

daniele bonetta

Reporter

Comment 2

•

11 years ago

(In reply to Chris Peterson (:cpeterson) from comment #1)
> If you are writing a new work stealing implementation, this paper describes
> an interesting approach:
> 
> "Non-blocking steal-half work queues"
> http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf
> 
> This paper presents "StealHalf", a new generalization of work stealing
> algorithms that allows workers, instead of stealing one, to steal up to half
> of the items in a given work queue at a time. Workers don't need to steal as
> often and they are more likely to find a non-empty queue when they randomly
> pick one because items are distributed among more queues.

Thank you Chris. Yes, I am working on a workstealing thread pool for PJS. The current prototype I developed does not implement steal-half, but that's definitively an interesting optimization.

daniele bonetta

Reporter

Comment 3

•

11 years ago

Attached patch bug-919638-fix.patch (obsolete) — Details — Splinter Review

Lock-based workstealing algorithm.

Attachment #810500 - Flags: review?(shu)

Shu-yu Guo [:shu]

Assignee

Comment 4

•

11 years ago

Comment on attachment 810500 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 810500 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good overall! I'd like to see the following addressed:

- An overview comment near the beginning of ThreadPool.h explaining the distinction between workers, slices, jobs, etc, and make sure the uses of variable names like taskId, workerId, etc, are consistent
- Use Atomic<T> instead of the ATOMIC_ macros
- Is it possible to not duplicate a bunch of logic for the main thread? Why can't we, say, give the main thread id 0 and have it be the first thing in workers_?

Plus, the style nits that need to be fixed. I didn't list all of them, just the first few instances of each kind that I saw.

::: js/src/jit-test/tests/parallel/Array-scanPar-sum.js
@@ +1,4 @@
>  load(libdir + "parallelarray-helpers.js");
>  function sum(a, b) { return a+b; }
>  if (getBuildConfiguration().parallelJS)
> +  testArrayScanPar(range(1, 8 * 1024), sum);

Where did 8 come from?

::: js/src/vm/ForkJoin.cpp
@@ +368,5 @@
>      // Invoked from parallel worker threads:
> +    void executeFromWorker(uint32_t sliceId, uintptr_t stackLimit);
> +
> +    // Invoked only from the main thread:
> +    void executeFromMain(uint32_t sliceId);

Why change the name from |executionFromMainThread|?

@@ +1373,3 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // push parallel tasks and wait until they're all done

Nit: Capitalize

::: js/src/vm/ThreadPool.cpp
@@ +31,5 @@
> +    uint32_t id_;
> +    ParallelJob *parallelJob_;
> +
> +  public:
> +    void Init(int id, ParallelJob *job) {

Nit: don't capitalize Init

@@ +62,5 @@
> +    ThreadPoolBaseWorker(size_t workerId, ThreadPool *pool) 
> +      : workerId_(workerId), 
> +        pool_(pool), 
> +        worklist_() 
> +    { }

Nit: trailing whitespace in the above 4 lines

@@ +100,5 @@
>      void run();
>  
> +    // To steal work from other workers (or from the main thread)
> +    uint32_t steal(uintptr_t stackLimit);
> +    void stealFromMain(uintptr_t stackLimit);

Maybe Main -> MainThread?

@@ +140,5 @@
> +    virtual bool submitJobs(ParallelTask *jobs, size_t from, size_t to);
> +};
> +
> +
> +ThreadPoolWorker::ThreadPoolWorker(size_t workerId, ThreadPool *pool) 

Trailing whitespace

@@ +188,5 @@
> +    ParallelTask *task = NULL;
> +    {
> +        // Lock so that nobody can add/remove tasks
> +        AutoLockMonitor lock(*(pool_->main_worker_));
> +        if(!pool_->main_worker_->worklist_.empty()) {

Nit: space after if, and ditto for the other cases in the patch.

@@ +196,5 @@
> +
> +    if(task != NULL) {
> +        task->execute(workerId_, stackLimit);
> +        JS_ATOMIC_DECREMENT(&(pool_->pending_jobs_));
> +#ifdef DEBUG        

Trailing whitespace

@@ +231,5 @@
> +
> +    if(pool_->numWorkers_ == 0)
> +        return 0;
> +
> +    int32_t victimId = rand() % (pool_->numWorkers_) -1 ;

Nit: avoid superfluous parentheses.

@@ +232,5 @@
> +    if(pool_->numWorkers_ == 0)
> +        return 0;
> +
> +    int32_t victimId = rand() % (pool_->numWorkers_) -1 ;
> +    while(victimId == this->workerId_) {

Nit: no space after while, and no { } for one-line bodies.

@@ +272,5 @@
>          if (state_ == TERMINATING)
>              break;
>  
> +        if(pool_->workStealing_) {
> +            // Allow others to steal from us while running

This comment doesn't seem write. Isn't what the following lines doing stealing from other threads?

@@ +293,4 @@
>  {
>      AutoLockMonitor lock(*this);
>      JS_ASSERT(state_ == ACTIVE);
> +    for(size_t i=from; i<to; i++) {

Nit: for (size_t i = from; i < to; i++) and similar cases below.

@@ +353,5 @@
> +        }
> +    }
> +
> +    if(pool_->workStealing_) {
> +        // Allow others to steal from us while running

Ditto on inaccurate seeming comment.

@@ +408,2 @@
>  {
> +    main_worker_ = js_new<ThreadPoolMainWorker>(this);

Shouldn't be using js_new here for malloc accounting reasons. Use rt->new_ instead.

@@ +430,2 @@
>  # ifdef DEBUG
>      if (char *jsthreads = getenv("JS_THREADPOOL_SIZE"))

Change this to JS_THREADPOOL_WORKERS for clarity.

@@ +433,5 @@
> +    if (char *jsthreads = getenv("JS_THREADPOOL_SLICES"))
> +        numSlices_ = strtol(jsthreads, NULL, 10);
> +    if (char *jsthreads = getenv("JS_THREADPOOL_STEAL")) {
> +        size_t steal = strtol(jsthreads, NULL, 10);
> +        workStealing_ = (steal==1);

Nit: spaces around ==

@@ +468,5 @@
>      // Note that numWorkers_ is the number of *desired* workers,
>      // but workers_.length() is the number of *successfully
>      // initialized* workers.
>      for (size_t workerId = 0; workerId < numWorkers(); workerId++) {
> +        ThreadPoolWorker *worker = js_new<ThreadPoolWorker>(workerId, this);

Use cx->new_ here.

@@ +525,5 @@
> +    size_t jid = jobsPerThread;
> +    size_t tid = 0;
> +
> +    while(jid < numSlices_) {
> +        size_t boundary = (jid+jobsPerThread > numSlices_)? numSlices_ :jid+jobsPerThread ;

Nit: Odd style here, please use inconsistent spaces around operators.

@@ +559,5 @@
> +
> +    pending_jobs_ = 0;
> +    stolen_jobs_ = 0;
> +
> +    if ( ! submitJobsMain(cx, jobs, jobsPerThread))

Nit: if (!submitJobsMain(cx, jobs, jobsPerThread))

::: js/src/vm/ThreadPool.h
@@ +73,5 @@
>      JSRuntime *const runtime_;
>  #endif
>      js::Vector<ThreadPoolWorker*, 8, SystemAllocPolicy> workers_;
>  
> +    ThreadPoolMainWorker *main_worker_;

Nit: camelCase

@@ +79,4 @@
>      // Number of workers we will start, when we actually start them
>      size_t numWorkers_;
>  
> +    // Total number of slices in the current job. See ForkJoin.h

I don't see any comments about slices in ForkJoin.h. Could you summarize the distinction and give a glossary of workers, slices, jobs, etc, either near the top of this file or the top of ForkJoin.h?

@@ +85,5 @@
> +    // Flag to enable workStealing support
> +    bool workStealing_;
> +
> +    uint32_t stolen_jobs_;          // Number of stolen tasks in the last parallel job
> +    uint32_t pending_jobs_;         // Number of pending jobs on the current job

Instead of using the ATOMIC_ macros, we should use the template Atomic<T> found in mfbt/Atomics.h

Also, camelCase

@@ +105,5 @@
>  
> +    // Return number of slices in the current parallel job (if present).
> +    size_t numSlices() { return numSlices_; }
> +
> +    bool usingWorkstealing() { return workStealing_; }

WorkStealing

Attachment #810500 - Flags: review?(shu)

daniele bonetta

Reporter

Comment 5

•

11 years ago

Attached patch bug-919638-fix.patch (obsolete) — Details — Splinter Review

Attachment #810500 - Attachment is obsolete: true

Attachment #812300 - Flags: review?(shu)

Shu-yu Guo [:shu]

Assignee

Comment 6

•

11 years ago

Comment on attachment 812300 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 812300 [details] [diff] [review]:
-----------------------------------------------------------------

Looks much better! r=me with the questions below addressed.

You've already the tests locally, so the next step is to push to the try server. Do you have any level of commit access? If not, I could push it to try for you after the fixes.

::: js/src/vm/ForkJoin.cpp
@@ +457,2 @@
>      ForkJoinMode mode = ForkJoinModeNormal;
>      if (args.length() > 1) {

If you're asserting above that |args.length() == 3|, no need to check args.length() > 1 and > 2. Should move all the argument asserts up to the top of the function.

@@ +1275,5 @@
>      }
>  
>      bool invoke(PerThreadData *perThread) {
>          RootedValue result(perThread);
> +        enter_(jitcode_, argc_ + 1, argv_ + 1, nullptr, calleeToken_, nullptr, 0, result.address());

Style nit: the max width is 100 characters.

@@ +1377,4 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // Push parallel tasks and wait until they're all done.
> +        jobResult = threadPool_->executeJobWithMain(cx_, this, numSlices_);

Why do we not need to check |jobResult != TP_SUCCESS| here anymore?

@@ +1795,5 @@
> +        uint32_t slices = workers * 8;
> +        if (chunks < slices)
> +            return workers;
> +        else
> +            return slices;

Style nit: no else after return. Do

if (c)
    return x;
return y;

::: js/src/vm/ForkJoin.h
@@ +32,5 @@
>  // running in parallel.  Each copy will then do 1/Nth of the total
> +// work. When |N| is big enough the pool will perform load balancing
> +// through workstealing. When |N| is not specified, |N| is the number
> +// of workers in the threadpool---by default, this is the number of
> +// cores on the computer).

This comment is contrary to the intrinsic function definition, where you assert the number of arguments == 3. Which one is intended?

::: js/src/vm/ThreadPool.cpp
@@ +64,5 @@
> +      : workerId_(workerId),
> +        pool_(pool),
> +        worklist_()
> +    { }
> +    virtual ~ThreadPoolBaseWorker() { }

Is this used here or in any of the subclasses?

@@ +73,5 @@
> +
> +    // Submit work to be executed. If this returns true, you are guaranteed
> +    // that the slices will execute before the thread-pool terminates (barring
> +    // an infinite loop in some prior slices).
> +    virtual bool submitSlices(ParallelSliceTask *slices, size_t from, size_t to) = 0;

Does this need to be virtual? Given that |submitSlicesMain| and |submitSlicesWorkers| are already separate paths and that |mainWorker_| is already of the derived type.

@@ +377,5 @@
> +        // Lock so that nobody can add/remove slices
> +        AutoLockMonitor lock(*victim);
> +        if (!victim->worklist_.empty()) {
> +            slice = victim->worklist_.popCopy();
> +        }

Style nit: no braces around single-line bodies.

@@ +523,5 @@
>      if (!lazyStartWorkers(cx))
>          return false;
>  
> +    size_t jid = initialOffset;
> +    size_t tid = 0;

Could you comment what these ids track?

Attachment #812300 - Flags: review?(shu) → review+

daniele bonetta

Reporter

Comment 7

•

11 years ago

Attached patch bug-919638-fix.patch (obsolete) — Details — Splinter Review

Attachment #812300 - Attachment is obsolete: true

Attachment #814641 - Flags: review?(shu)

Shu-yu Guo [:shu]

Assignee

Comment 8

•

11 years ago

Comment on attachment 814641 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 814641 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits fixed. No need to request re-review.

I'd like to know why we don't need to check for jobResult != TP_SUCCESS in comment 2 below though. You could just explain on IRC.

::: js/src/vm/ForkJoin.cpp
@@ +454,3 @@
>      JS_ASSERT(args[0].toObject().is<JSFunction>());
> +    JS_ASSERT(args[1].isInt32());
> +    JS_ASSERT(args[1].toInt32() < NumForkJoinModes);    

Nit: trailing whitespace

@@ +1372,4 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // Push parallel tasks and wait until they're all done.
> +        jobResult = threadPool_->executeJobWithMain(cx_, this, numSlices_);

I still don't understand why we don't need to check |jobResult != TP_SUCCESS| here.

::: js/src/vm/ForkJoin.h
@@ +32,4 @@
>  // running in parallel.  Each copy will then do 1/Nth of the total
> +// work. When |N| is big enough the pool will perform load balancing
> +// through workstealing. Otherwise, |N| is the number of threads in
> +// the computer.

This seems to say that if you give an N < # of threads, N = # of threads. Is that actually the case?

::: js/src/vm/ThreadPool.cpp
@@ +109,5 @@
>      // termination completes.
>      void terminate();
> +
> +    // Submit work to be executed. If this returns true, you are guaranteed
> +    // that the slices will execute before the thread-pool terminates by 

Nit: trailing whitespace

@@ +522,5 @@
>          return false;
>  
> +    // Assign each worker thread a slice, round-robin.
> +    size_t sliceId = initialOffset;
> +    size_t workerIid = 0;

Typo for workerId?

Attachment #814641 - Flags: review?(shu) → review+

daniele bonetta

Reporter

Comment 9

•

11 years ago

Attached patch bug-919638-fix.patch (obsolete) — Details — Splinter Review

Attachment #814641 - Attachment is obsolete: true

daniele bonetta

Reporter

Updated

•

11 years ago

Blocks: 925828

daniele bonetta

Reporter

Comment 10

•

11 years ago

Attached patch bug-919638-fix.patch (obsolete) — Details — Splinter Review

Attachment #815965 - Attachment is obsolete: true

daniele bonetta

Reporter

Comment 11

•

11 years ago

Attached patch bug-919638-fix.patch — Details — Splinter Review

Attachment #816005 - Attachment is obsolete: true

Attachment #816038 - Flags: review+

Shu-yu Guo [:shu]

Assignee

Comment 12

•

11 years ago

Attached patch Part 1: Implement a work stealing thread pool for PJS. (obsolete) — Details — Splinter Review

Partial rewrite

Attachment #8349224 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Assignee: general → shu

Status: NEW → ASSIGNED

Shu-yu Guo [:shu]

Assignee

Comment 13

•

11 years ago

Attached patch Part 2: Bail out should not execute remaining slices in the worklist. (obsolete) — Details — Splinter Review

Make bailouts less stupid in the workstealing scheduler

Attachment #8349225 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 14

•

11 years ago

Attached patch Part 1: Implement a work stealing thread pool for PJS. ( (obsolete) — Details — Splinter Review

Some more fixes, remove debug printfs

Attachment #8349857 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Attachment #8349224 - Attachment is obsolete: true

Attachment #8349224 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 15

•

11 years ago

Attached patch Part 2: Bail out should not execute remaining slices in the worklist. (obsolete) — Details — Splinter Review

Remove debug printfs

Attachment #8349858 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Attachment #8349225 - Attachment is obsolete: true

Attachment #8349225 - Flags: review?(nmatsakis)

Niko Matsakis [:nmatsakis]

Comment 16

•

11 years ago

Comment on attachment 8349857 [details] [diff] [review]
Part 1: Implement a work stealing thread pool for PJS. (

Review of attachment 8349857 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good.

::: js/src/vm/ThreadPool.cpp
@@ +59,5 @@
> +    Vector<ThreadPoolSlice, 8, SystemAllocPolicy> worklist_;
> +
> +    bool stealFrom(ThreadPoolBaseWorker *victim, ThreadPoolSlice *slice);
> +
> +    void popSlice(ThreadPoolSlice *slice) {

Assert that lock is held on entry

@@ +275,5 @@
>      AutoLockMonitor lock(*this);
>      JS_ASSERT(state_ == ACTIVE);
> +
> +    for (unsigned i = from; i < to; i++) {
> +        if (!worklist_.append(ThreadPoolSlice(job, i)))

In my rewrite of threadpool, I did this differently. I had the threadpool itself be "assigned" to one job at a time, and then rather than storing a queue of ThreadPoolSlices (which takes O(n) memory and must be allocated etc), each worker thread just stored a minSlice and maxSlice. Whenever the two were unequal, there was work to do. To do a local job, the worker thread could increment minSlice. To steal, it would decrement maxSlice. This can be easily done with locks or by compare-and-swapping a single 64-bit value that encodes both min/max slice together. What do you think of this? It seems better to me, maybe we can open up a follow-up bug to implement this scheme.

(In that design, the thread pool was also a singleton that lived apart from a runtime, so part of "locking" it for a job was assigning it to a runtime. That made it possible to share the threadpool between workers.)

@@ +487,5 @@
> +    // Assign each worker thread a slice, round-robin.
> +    unsigned sliceId = slicesPerWorker;
> +    unsigned workerId = 0;
> +
> +    while (sliceId < numSlices) {

Since we do work-stealing, it probably doesn't matter, but I don't think that this is the most equal distribution. In particular, if the number of slices doesn't divide up evenly, the final worker winds up with more. If we wanted to start out with an equal distribution, the better algorithm is to compute EXTRA = NUM_SLICES % NUM_WORKERS and then distribute slicesPerWorker+1 into the first EXTRA workers, and slicesPerWorker into the remainder.

@@ +513,5 @@
> +#ifdef DEBUG
> +    stolenSlices_ = 0;
> +#endif
> +
> +    unsigned slicesPerThread = (unsigned) (numSlices / (numWorkers() + 1));

Nit: use name slicesPerWorker as above (or change the code above, I don't care)

::: js/src/vm/ThreadPool.h
@@ +37,4 @@
>  };
>  
>  // ThreadPool used for parallel JavaScript execution as well as
>  // parallel compilation.  Unless you are building a new kind of

We should update this comment. We don't use the thread pool for compilation -- never got around to it -- and I've come to think it's a bad idea. Eventually I'd like the par workers to be able to kick off compilation as they go and have it run asynchronously.

@@ +61,5 @@
> +// used by the pool as a worker thread.
> +//
> +// The ThreadPool supports work stealing. Every time a worker
> +// completes all the slices in its local queue, it tries to acquire
> +// some work form other workers (including the main thread).

s/form/from

@@ +73,5 @@
> +// bunch of slices to be done at the very beginning of the job, and
> +// have to wait until all the threads have joined back. During phase
> +// (1) there is no synchronization overhead between workers
> +// introduced by the stealing algorithm, and therefore the execution
> +// overhead introduced is almost nullptr with balanced workloads. The way

s/nullptr/zero/

@@ +121,5 @@
> +    unsigned stolenSlices() { return stolenSlices_; }
> +#endif
> +
> +    // Submit slices for parallel execution in the pool.
> +    bool submitSlices(JSContext *cx, ParallelJob *job, unsigned slicesPerWorker, unsigned numSlices);

This API is only used internally. Let's make it private I think (or perhaps just remove it and fold it into executeJob())

Attachment #8349857 - Flags: review?(nmatsakis) → review+

Shu-yu Guo [:shu]

Assignee

Comment 17

•

11 years ago

(In reply to Niko Matsakis [:nmatsakis] from comment #16)
> Comment on attachment 8349857 [details] [diff] [review]
> Part 1: Implement a work stealing thread pool for PJS. (
> 
> Review of attachment 8349857 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks good.
> 
> ::: js/src/vm/ThreadPool.cpp
> @@ +59,5 @@
> > +    Vector<ThreadPoolSlice, 8, SystemAllocPolicy> worklist_;
> > +
> > +    bool stealFrom(ThreadPoolBaseWorker *victim, ThreadPoolSlice *slice);
> > +
> > +    void popSlice(ThreadPoolSlice *slice) {
> 
> Assert that lock is held on entry
> 

How do I actually do this? A |#ifdef DEBUG|-only bool field seems clunky.

> @@ +275,5 @@
> >      AutoLockMonitor lock(*this);
> >      JS_ASSERT(state_ == ACTIVE);
> > +
> > +    for (unsigned i = from; i < to; i++) {
> > +        if (!worklist_.append(ThreadPoolSlice(job, i)))
> 
> In my rewrite of threadpool, I did this differently. I had the threadpool
> itself be "assigned" to one job at a time, and then rather than storing a
> queue of ThreadPoolSlices (which takes O(n) memory and must be allocated
> etc), each worker thread just stored a minSlice and maxSlice. Whenever the
> two were unequal, there was work to do. To do a local job, the worker thread
> could increment minSlice. To steal, it would decrement maxSlice. This can be
> easily done with locks or by compare-and-swapping a single 64-bit value that
> encodes both min/max slice together. What do you think of this? It seems
> better to me, maybe we can open up a follow-up bug to implement this scheme.
> 
> (In that design, the thread pool was also a singleton that lived apart from
> a runtime, so part of "locking" it for a job was assigning it to a runtime.
> That made it possible to share the threadpool between workers.)
> 

Daniele's way is more flexible, if we ever want to reuse this for task parallelism with work stealing. But I'll implement this optimization in this fashion: each worker gets assigned one job and some # of slices, stealing another worker's slice executes the victim worker's job.

> @@ +487,5 @@
> > +    // Assign each worker thread a slice, round-robin.
> > +    unsigned sliceId = slicesPerWorker;
> > +    unsigned workerId = 0;
> > +
> > +    while (sliceId < numSlices) {
> 
> Since we do work-stealing, it probably doesn't matter, but I don't think
> that this is the most equal distribution. In particular, if the number of
> slices doesn't divide up evenly, the final worker winds up with more. If we
> wanted to start out with an equal distribution, the better algorithm is to
> compute EXTRA = NUM_SLICES % NUM_WORKERS and then distribute
> slicesPerWorker+1 into the first EXTRA workers, and slicesPerWorker into the
> remainder.
> 

Good idea, I'll change this.

> @@ +513,5 @@
> > +#ifdef DEBUG
> > +    stolenSlices_ = 0;
> > +#endif
> > +
> > +    unsigned slicesPerThread = (unsigned) (numSlices / (numWorkers() + 1));
> 
> Nit: use name slicesPerWorker as above (or change the code above, I don't
> care)
> 
> ::: js/src/vm/ThreadPool.h
> @@ +37,4 @@
> >  };
> >  
> >  // ThreadPool used for parallel JavaScript execution as well as
> >  // parallel compilation.  Unless you are building a new kind of
> 
> We should update this comment. We don't use the thread pool for compilation
> -- never got around to it -- and I've come to think it's a bad idea.
> Eventually I'd like the par workers to be able to kick off compilation as
> they go and have it run asynchronously.
> 
> @@ +61,5 @@
> > +// used by the pool as a worker thread.
> > +//
> > +// The ThreadPool supports work stealing. Every time a worker
> > +// completes all the slices in its local queue, it tries to acquire
> > +// some work form other workers (including the main thread).
> 
> s/form/from
> 
> @@ +73,5 @@
> > +// bunch of slices to be done at the very beginning of the job, and
> > +// have to wait until all the threads have joined back. During phase
> > +// (1) there is no synchronization overhead between workers
> > +// introduced by the stealing algorithm, and therefore the execution
> > +// overhead introduced is almost nullptr with balanced workloads. The way
> 
> s/nullptr/zero/
> 
> @@ +121,5 @@
> > +    unsigned stolenSlices() { return stolenSlices_; }
> > +#endif
> > +
> > +    // Submit slices for parallel execution in the pool.
> > +    bool submitSlices(JSContext *cx, ParallelJob *job, unsigned slicesPerWorker, unsigned numSlices);
> 
> This API is only used internally. Let's make it private I think (or perhaps
> just remove it and fold it into executeJob())

I'll just hand inline it, yeah.

Shu-yu Guo [:shu]

Assignee

Comment 18

•

11 years ago

Attached patch Part 1: Implement a work stealing thread pool for PJS. v2 (obsolete) — Details — Splinter Review

Applied comments.

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Attachment #8349857 - Attachment is obsolete: true

Shu-yu Guo [:shu]

Assignee

Comment 19

•

11 years ago

Attached patch Part 2: Bail out should not execute remaining slices in the worklist. v2 (obsolete) — Details — Splinter Review

Rebased against v2.

Attachment #8350441 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Attachment #8349858 - Attachment is obsolete: true

Attachment #8349858 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 20

•

11 years ago

Comment on attachment 8350440 [details] [diff] [review]
Part 1: Implement a work stealing thread pool for PJS. v2

Review of attachment 8350440 [details] [diff] [review]:
-----------------------------------------------------------------

Carrying r+ from nmatsakis

Attachment #8350440 - Flags: review+

Niko Matsakis [:nmatsakis]

Comment 21

•

11 years ago

(In reply to Shu-yu Guo [:shu] from comment #17)
> How do I actually do this? A |#ifdef DEBUG|-only bool field seems clunky.

PR_ASSERT_LOCK_HELD_BY_CURRENT_THREAD or something like that

> Daniele's way is more flexible, if we ever want to reuse this for task
> parallelism with work stealing. But I'll implement this optimization in this
> fashion: each worker gets assigned one job and some # of slices, stealing
> another worker's slice executes the victim worker's job.

Ok. This sounds like a good compromise. I considered the loss of flexibility, I'm just not concerned about specializing for the API we have today and letting tomorrow work itself out. The precise shape of a task API is kind of unclear to me just now. But with the approach you suggested, it seems that it would scale to "nested parallelism" in a relatively straightforward way: you could "push" a new job (with a new set of slice indices) onto the current worker.

Shu-yu Guo [:shu]

Assignee

Comment 22

•

11 years ago

Attached patch Part 1a: Make workstealing lock free. (obsolete) — Details — Splinter Review

Was fairly easy to make the workstealing lock free.

Attachment #8350846 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 23

•

11 years ago

Attached patch Part 2: Bail out should not execute remaining slices in the worklist. — Details — Splinter Review

Rebased against lock free part 1.

Attachment #8350847 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

11 years ago

Attachment #8350441 - Attachment is obsolete: true

Attachment #8350441 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 24

•

11 years ago

Niko and Felix,

About the CAS implementation: we don't have 64bit CAS on all our architectures (WinXP 32bit for one, I don't know about ARM). We could shrink slice IDs to 16bit. 65535 slices per thread seems plenty to me for load balancing purposes. Thoughts?

Niko Matsakis [:nmatsakis]

Comment 25

•

11 years ago

It's not 65535 slices per thread, right? Just overall? Still, probably fine.

Shu-yu Guo [:shu]

Assignee

Comment 26

•

11 years ago

(In reply to Niko Matsakis [:nmatsakis] from comment #25)
> It's not 65535 slices per thread, right? Just overall? Still, probably fine.

Oops, yes you're right.

Shu-yu Guo [:shu]

Assignee

Comment 27

•

11 years ago

Attached patch Part 1: Add work stealing scheduler — Details — Splinter Review

Rolled up lock free impl.

Attachment #8350440 - Attachment is obsolete: true

Attachment #8350846 - Attachment is obsolete: true

Attachment #8350846 - Flags: review?(nmatsakis)

Attachment #8355292 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 28

•

11 years ago

Attached patch Roll up v2 (obsolete) — Details — Splinter Review

Rolled up parts 1 and 2 since after removing rendezvous code and fixing aborts, I could no longer (with moderate effort) cleanly separate the 2 patches.

Some notes on this version:

 - Rendezvous code removed

 - The fork-join synchronization mechanism is now moved to ThreadPool.

   Since this is moved to ThreadPool, submitting jobs to workers no longer requires locking, as distributing the work at the beginning of a fork-join means all the workers are idle. I moved the pendingSlices_ arithmetic back into the Worker classes, largely because of symmetry with aborts, see below.

 - Aborting is somewhat tricky. Aborting is really a per-job thing: once that fails, we need to clear that job from all workers. Only the workers know how many slices are still left pending, so the pendingSlices_ arithmetic can only be done in abortJob(). For symmetry, submitJob() also does the pendingSlices_ arithmetic.

   Moreover, aborting a job doesn't really mean we should pause the worker, since the thread pool might contain other jobs in the current workload that are still unfinished (thinking ahead to actual nested parallelism here).

Not obsoleting the old patches in case you want to refer back to them.

Ping me on IRC if you have any questions.

Attachment #8355431 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 29

•

11 years ago

Attached patch Part 2: Change warmup semantics to be sequential in slices, low to high. — Details — Splinter Review

Seems to pass jit-tests.

Attachment #8355694 - Flags: review?(nmatsakis)

Niko Matsakis [:nmatsakis]

Comment 30

•

10 years ago

Comment on attachment 8355431 [details] [diff] [review]
Roll up v2

Review of attachment 8355431 [details] [diff] [review]:
-----------------------------------------------------------------

I am concerned about the synchronization. I made a comment below as to the degenerate case I'm worried about. I'll post an alternate suggestion that I worked out.

::: js/src/vm/ThreadPool.cpp
@@ +560,5 @@
> +    mainWorker_->submitJob(job, sliceFrom, sliceFrom + slicesPerWorker);
> +
> +    // Notify the worker threads that there's work now.
> +    for (uint32_t workerId = 0; workerId < numWorkers(); workerId++)
> +        workers_[workerId]->notifyOfWork();

This handoff doesn't feel quite right. In particular, we start the workers and submit jobs above, so it's possible that one of them has seen those jobs and started to execute them and thus are not waiting. In that case, they'd be holding the lock, and we'd be stuck in this loop waiting for them to acquire the lock, no?

Attachment #8355431 - Flags: review?(nmatsakis) → review-

Niko Matsakis [:nmatsakis]

Comment 31

•

10 years ago

(In reply to Niko Matsakis [:nmatsakis] from comment #30)
> This handoff doesn't feel quite right. In particular, we start the workers
> and submit jobs above, so it's possible that one of them has seen those jobs
> and started to execute them and thus are not waiting. In that case, they'd
> be holding the lock, and we'd be stuck in this loop waiting for them to
> acquire the lock, no?

Let me rephrase this more clearly. I am concerned that the main thread will basically block while some worker is executing its local jobs, since the main thread needs to acquire the worker's lock, but the worker only releases the lock when it is waiting.

Niko Matsakis [:nmatsakis]

Comment 32

•

10 years ago

So I think that we could synchronize using the atomic pendingSlices_ variable. I was thinking of a scheme where you have just one lock, on the pool, and the pendingSlices_ variable is always kept up to date with the total amount of work remaining. The worker threads would basically spin, waiting until pendingSlices becomes non-zero, and they would continue trying to take local work and steal so long as pendingSlices is non-zero. The pseudo-code is below.

(The other question I guess is how much we want to worry about the nested case. I still think it's nice to keep the code as simple as possible and focus on performing great for the cases we do handle, but it's true that I'm not sure how to extend this strategy to the nested case yet. Sorry, I know we had this discussion already, and I also know that I'm the one who advocated the current approach.)

    class Pool : public Monitor {
      private:
        friend class Worker;
        
        Atomic<uint32_t> pendingSlices_;
        uint32_t activeWorkers_;
        enum { ACTIVE, TERMINATED } state_;
        
      public:
        ...
        
    };
    
    class Worker {
        Pool *const pool_;
        ParallelJob *job_;
        Atomic<uint32_t> bounds_;
    };
    
    void
    Worker::submitJob(ParallelJob *job, uint32_t bounds) {
        ASSERT(!hasWork()); // no suppose for nested
        job_ = job;
        bounds_ = bounds;
    }
    
    void
    Worker::abortJob() {
        uint32_t bounds = bounds_.swapWith(0);
        uint16_t start, end;
        DecomposeBounds(bounds_, &start, &end);
        uint16_t jobs = end - start;
        pool_->pendingSlices_ -= jobs;
    }
    
    void
    Worker::main() {
        while (true) {
            // Wait for work to arrive or for us to
            // be terminating.
            {
                AutoLockMonitor lock(pool_);
                while (state_ == ACTIVE && pool_->pendingSlices_ == 0) {
                    lock.wait();
                }
                if (state_ != ACTIVE) {
                    pool->activeWorkers_--;
                    return;
                }
            }
            
            // Iterate as long as there is work out there
            // to be done...
            while (tryLocalJob()) ;
            while (pool_->pendingSlices > 0) {
                while (tryStolenJob()) ;
            }
        }
    }
    
    void
    Pool::executeJob() {
        assert(pendingSlices_ == 0); // no nested work supported
        
        for (each Worker *w) {
            w->submitJob(...);
        }
        
        // signal workers there is work to be done
        pendingSlices_ += totalSlices;
        {
            AutoLockMonitor lock(pool_);
            lock.notifyAll();
        }
        
        executeMainJob();

        while (pendingSlices_ != 0)
            stealFromWorkers();
    }

To abort, just invoke abortJob on each worker:

    void
    Pool::abort() {
        for (each Worker *w) {
            w->abortJob();
        }
    }

Shu-yu Guo [:shu]

Assignee

Comment 33

•

10 years ago

(In reply to Niko Matsakis [:nmatsakis] from comment #30)
> Comment on attachment 8355431 [details] [diff] [review]
> Roll up v2
> 
> Review of attachment 8355431 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I am concerned about the synchronization. I made a comment below as to the
> degenerate case I'm worried about. I'll post an alternate suggestion that I
> worked out.
> 
> ::: js/src/vm/ThreadPool.cpp
> @@ +560,5 @@
> > +    mainWorker_->submitJob(job, sliceFrom, sliceFrom + slicesPerWorker);
> > +
> > +    // Notify the worker threads that there's work now.
> > +    for (uint32_t workerId = 0; workerId < numWorkers(); workerId++)
> > +        workers_[workerId]->notifyOfWork();
> 
> This handoff doesn't feel quite right. In particular, we start the workers
> and submit jobs above, so it's possible that one of them has seen those jobs
> and started to execute them and thus are not waiting. In that case, they'd
> be holding the lock, and we'd be stuck in this loop waiting for them to
> acquire the lock, no?

There will always be progress. Since submitting work or getting work are lock free, workers that have already started will just keep progressing. When the workload finishes, the main thread will fire a spurious "there is work" even though all the work is finished. This also doesn't really result in any imbalance since the workers always prefers locally assigned work over stealing, each time through the worker loop.

Does that make sense? Is this subtlety worth giving something that seems easily extensible to nesting?

Shu-yu Guo [:shu]

Assignee

Comment 34

•

10 years ago

Giving up, I meant to say.

Niko Matsakis [:nmatsakis]

Comment 35

•

10 years ago

I'm not saying it won't always make progress, just that the main thread will block unnecessarily. Regarding extensibility to nesting: I don't think that the current code is particularly extensible to nesting? It seems like using a pair of integers makes it hard to support nesting, but maybe I'm missing something. (Though I still like using a pair of integers -- and I would like to find a way to support nesting without requiring that we allocate vectors of tasks, at least in the non-nested case.)

Niko Matsakis [:nmatsakis]

Comment 36

•

10 years ago

Comment on attachment 8355694 [details] [diff] [review]
Part 2: Change warmup semantics to be sequential in slices, low to high.

Review of attachment 8355694 [details] [diff] [review]:
-----------------------------------------------------------------

I wonder if it's worth removing the "complete" flag and instead having warmup just run the entire slice all the time, vs just the first chunk. Now that there are more slices this doesn't seem like a bad thing (it's also unclear to me if the very notion of chunks continues to make a lot of sense, but I guess that's separate).

Attachment #8355694 - Flags: review?(nmatsakis) → review+

Shu-yu Guo [:shu]

Assignee

Updated

•

10 years ago

Attachment #8350847 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Updated

•

10 years ago

Attachment #8355292 - Flags: review?(nmatsakis)

Shu-yu Guo [:shu]

Assignee

Comment 37

•

10 years ago

Attached patch Roll up v3 — Details — Splinter Review

Roughly follows Niko's suggestion from a previous comment.

The bailout-executed test has intermittent failures, depending on how many slices got executed by warmup since the warmup behavior changed. Don't know how to fix it exactly, though.

Attachment #8355431 - Attachment is obsolete: true

Attachment #8357637 - Flags: review?(nmatsakis)

Niko Matsakis [:nmatsakis]

Updated

•

10 years ago

Attachment #8357637 - Flags: review?(nmatsakis) → review+

Shu-yu Guo [:shu]

Assignee

Updated

•

10 years ago

Blocks: 958370

Till Schneidereit [:till]

Comment 38

•

10 years ago

Part 1 breaks non-NSPR shell builds:
js/src/vm/Monitor.h:38:9: error: use of undeclared identifier 'PR_ASSERT_CURRENT_THREAD_OWNS_LOCK'

Also, you forgot to paste the inbound links.

Flags: needinfo?(shu)

Shu-yu Guo [:shu]

Assignee

Comment 39

•

10 years ago

I swore I pasted the inbound links...

https://hg.mozilla.org/integration/mozilla-inbound/rev/464e261cbcbe
https://hg.mozilla.org/integration/mozilla-inbound/rev/976a1fe9f080

Ryan VanderMeulen [:RyanVM]

Comment 40

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/464e261cbcbe
https://hg.mozilla.org/mozilla-central/rev/976a1fe9f080

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: --- → mozilla29

Shu-yu Guo [:shu]

Assignee

Comment 41

•

10 years ago

Attached patch Followup: unbreak compiling without NSPR. (r=?) (obsolete) — Details — Splinter Review

Manually track owner thread in Monitor.

Attachment #8358654 - Flags: review?(pnkfelix)

Shu-yu Guo [:shu]

Assignee

Updated

•

10 years ago

Flags: needinfo?(shu)

Felix S. Klock II [:pnkfelix, :fklock]

Comment 42

•

10 years ago

Comment on attachment 8358654 [details] [diff] [review]
Followup: unbreak compiling without NSPR. (r=?)

Review of attachment 8358654 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Monitor.h
@@ +86,5 @@
>  #endif
>  
>      ~AutoLockMonitor() {
>  #ifdef JS_THREADSAFE
>          PR_Unlock(monitor.lock_);

should you set monitor.owner_ back to null before this Unlock?  (To make the relationship invariant bidirectional?)

Felix S. Klock II [:pnkfelix, :fklock]

Comment 43

•

10 years ago

(If you do choose to enforce the stronger bidirectional invariant, there are probably other places where you would need to assign owner_ to null, and you would probably want to check that it is null (after acquiring the lock) before assigning a non-null value to it.)

Shu-yu Guo [:shu]

Assignee

Comment 44

•

10 years ago

Attached patch Followup: unbreak compiling without NSPR. (r=?) — Details — Splinter Review

Attachment #8358702 - Flags: review?(pnkfelix)

Shu-yu Guo [:shu]

Assignee

Updated

•

10 years ago

Attachment #8358654 - Attachment is obsolete: true

Attachment #8358654 - Flags: review?(pnkfelix)

Felix S. Klock II [:pnkfelix, :fklock]

Updated

•

10 years ago

Attachment #8358702 - Flags: review?(pnkfelix) → review+

Shu-yu Guo [:shu]

Assignee

Comment 45

•

10 years ago

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/ab1d9c6b581a

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 46

•

10 years ago

(In reply to Shu-yu Guo [:shu] from comment #45)
> remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/ab1d9c6b581a

I had to back this out in http://hg.mozilla.org/integration/mozilla-inbound/rev/f29ecd0b9c99 for causing build bustage like this: https://tbpl.mozilla.org/php/getParsedLog.php?id=32850042&tree=Mozilla-Inbound

Shu-yu Guo [:shu]

Assignee

Comment 47

•

10 years ago

Fixed the bustage

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/7c3432eb049b

Steve Singer (:stevensn)

Updated

•

10 years ago

Depends on: 958803

Carsten Book [:Tomcat]

Comment 48

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/7c3432eb049b

Christian Holler (:decoder)

Comment 49

•

10 years ago

This still breaks non-threadsafe builds:

js/src/vm/Monitor.h: In member function ‘bool js::AutoLockMonitor::isFor(js::Monitor&) const’:
js/src/vm/Monitor.h:79:16: error: ‘monitor’ was not declared in this scope
js/src/vm/Monitor.h: In member function ‘bool js::AutoUnlockMonitor::isFor(js::Monitor&) const’:
js/src/vm/Monitor.h:148:16: error: ‘monitor’ was not declared in this scope

Can you please test such patches, that they actually build with --disable-threadsafe?

Something like this fixes it for me (for both places where it's defined), but I don't know if this is the right approach:

>     bool isFor(Monitor &other) const {
>+#ifdef JS_THREADSAFE
>         return monitor.lock_ == other.lock_;
>+#else
>+        return true;
>+#endif
>     }

Status: RESOLVED → REOPENED

Flags: needinfo?(shu)

Resolution: FIXED → ---

Shu-yu Guo [:shu]

Assignee

Comment 50

•

10 years ago

(In reply to Christian Holler (:decoder) from comment #49)
> This still breaks non-threadsafe builds:
> 
> js/src/vm/Monitor.h: In member function ‘bool
> js::AutoLockMonitor::isFor(js::Monitor&) const’:
> js/src/vm/Monitor.h:79:16: error: ‘monitor’ was not declared in this scope
> js/src/vm/Monitor.h: In member function ‘bool
> js::AutoUnlockMonitor::isFor(js::Monitor&) const’:
> js/src/vm/Monitor.h:148:16: error: ‘monitor’ was not declared in this scope
> 
> Can you please test such patches, that they actually build with
> --disable-threadsafe?
> 
> Something like this fixes it for me (for both places where it's defined),
> but I don't know if this is the right approach:
> 
> >     bool isFor(Monitor &other) const {
> >+#ifdef JS_THREADSAFE
> >         return monitor.lock_ == other.lock_;
> >+#else
> >+        return true;
> >+#endif
> >     }

ugh, sorry. r=me

Flags: needinfo?(shu)

Shu-yu Guo [:shu]

Assignee

Comment 51

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/8838fe37b98d

Ryan VanderMeulen [:RyanVM]

Comment 52

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/8838fe37b98d

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.