Closed Bug 919638 Opened 11 years ago Closed 10 years ago

PJS: New WorkStealing ThreadPool

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: dbonetta, Assigned: shu)

References

Details

Attachments

(6 files, 14 obsolete files)

68.29 KB, patch
dbonetta
: review+
Details | Diff | Splinter Review
10.58 KB, patch
Details | Diff | Splinter Review
81.61 KB, patch
Details | Diff | Splinter Review
5.28 KB, patch
nmatsakis
: review+
Details | Diff | Splinter Review
95.51 KB, patch
nmatsakis
: review+
Details | Diff | Splinter Review
7.31 KB, patch
pnkfelix
: review+
Details | Diff | Splinter Review
New ThreadPool for PJS using workstealing.
If you are writing a new work stealing implementation, this paper describes an interesting approach:

"Non-blocking steal-half work queues"
http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf

This paper presents "StealHalf", a new generalization of work stealing algorithms that allows workers, instead of stealing one, to steal up to half of the items in a given work queue at a time. Workers don't need to steal as often and they are more likely to find a non-empty queue when they randomly pick one because items are distributed among more queues.
(In reply to Chris Peterson (:cpeterson) from comment #1)
> If you are writing a new work stealing implementation, this paper describes
> an interesting approach:
> 
> "Non-blocking steal-half work queues"
> http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf
> 
> This paper presents "StealHalf", a new generalization of work stealing
> algorithms that allows workers, instead of stealing one, to steal up to half
> of the items in a given work queue at a time. Workers don't need to steal as
> often and they are more likely to find a non-empty queue when they randomly
> pick one because items are distributed among more queues.

Thank you Chris. Yes, I am working on a workstealing thread pool for PJS. The current prototype I developed does not implement steal-half, but that's definitively an interesting optimization.
Attached patch bug-919638-fix.patch (obsolete) — Splinter Review
Lock-based workstealing algorithm.
Attachment #810500 - Flags: review?(shu)
Comment on attachment 810500 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 810500 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good overall! I'd like to see the following addressed:

- An overview comment near the beginning of ThreadPool.h explaining the distinction between workers, slices, jobs, etc, and make sure the uses of variable names like taskId, workerId, etc, are consistent
- Use Atomic<T> instead of the ATOMIC_ macros
- Is it possible to not duplicate a bunch of logic for the main thread? Why can't we, say, give the main thread id 0 and have it be the first thing in workers_?

Plus, the style nits that need to be fixed. I didn't list all of them, just the first few instances of each kind that I saw.

::: js/src/jit-test/tests/parallel/Array-scanPar-sum.js
@@ +1,4 @@
>  load(libdir + "parallelarray-helpers.js");
>  function sum(a, b) { return a+b; }
>  if (getBuildConfiguration().parallelJS)
> +  testArrayScanPar(range(1, 8 * 1024), sum);

Where did 8 come from?

::: js/src/vm/ForkJoin.cpp
@@ +368,5 @@
>      // Invoked from parallel worker threads:
> +    void executeFromWorker(uint32_t sliceId, uintptr_t stackLimit);
> +
> +    // Invoked only from the main thread:
> +    void executeFromMain(uint32_t sliceId);

Why change the name from |executionFromMainThread|?

@@ +1373,3 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // push parallel tasks and wait until they're all done

Nit: Capitalize

::: js/src/vm/ThreadPool.cpp
@@ +31,5 @@
> +    uint32_t id_;
> +    ParallelJob *parallelJob_;
> +
> +  public:
> +    void Init(int id, ParallelJob *job) {

Nit: don't capitalize Init

@@ +62,5 @@
> +    ThreadPoolBaseWorker(size_t workerId, ThreadPool *pool) 
> +      : workerId_(workerId), 
> +        pool_(pool), 
> +        worklist_() 
> +    { }

Nit: trailing whitespace in the above 4 lines

@@ +100,5 @@
>      void run();
>  
> +    // To steal work from other workers (or from the main thread)
> +    uint32_t steal(uintptr_t stackLimit);
> +    void stealFromMain(uintptr_t stackLimit);

Maybe Main -> MainThread?

@@ +140,5 @@
> +    virtual bool submitJobs(ParallelTask *jobs, size_t from, size_t to);
> +};
> +
> +
> +ThreadPoolWorker::ThreadPoolWorker(size_t workerId, ThreadPool *pool) 

Trailing whitespace

@@ +188,5 @@
> +    ParallelTask *task = NULL;
> +    {
> +        // Lock so that nobody can add/remove tasks
> +        AutoLockMonitor lock(*(pool_->main_worker_));
> +        if(!pool_->main_worker_->worklist_.empty()) {

Nit: space after if, and ditto for the other cases in the patch.

@@ +196,5 @@
> +
> +    if(task != NULL) {
> +        task->execute(workerId_, stackLimit);
> +        JS_ATOMIC_DECREMENT(&(pool_->pending_jobs_));
> +#ifdef DEBUG        

Trailing whitespace

@@ +231,5 @@
> +
> +    if(pool_->numWorkers_ == 0)
> +        return 0;
> +
> +    int32_t victimId = rand() % (pool_->numWorkers_) -1 ;

Nit: avoid superfluous parentheses.

@@ +232,5 @@
> +    if(pool_->numWorkers_ == 0)
> +        return 0;
> +
> +    int32_t victimId = rand() % (pool_->numWorkers_) -1 ;
> +    while(victimId == this->workerId_) {

Nit: no space after while, and no { } for one-line bodies.

@@ +272,5 @@
>          if (state_ == TERMINATING)
>              break;
>  
> +        if(pool_->workStealing_) {
> +            // Allow others to steal from us while running

This comment doesn't seem write. Isn't what the following lines doing stealing from other threads?

@@ +293,4 @@
>  {
>      AutoLockMonitor lock(*this);
>      JS_ASSERT(state_ == ACTIVE);
> +    for(size_t i=from; i<to; i++) {

Nit: for (size_t i = from; i < to; i++) and similar cases below.

@@ +353,5 @@
> +        }
> +    }
> +
> +    if(pool_->workStealing_) {
> +        // Allow others to steal from us while running

Ditto on inaccurate seeming comment.

@@ +408,2 @@
>  {
> +    main_worker_ = js_new<ThreadPoolMainWorker>(this);

Shouldn't be using js_new here for malloc accounting reasons. Use rt->new_ instead.

@@ +430,2 @@
>  # ifdef DEBUG
>      if (char *jsthreads = getenv("JS_THREADPOOL_SIZE"))

Change this to JS_THREADPOOL_WORKERS for clarity.

@@ +433,5 @@
> +    if (char *jsthreads = getenv("JS_THREADPOOL_SLICES"))
> +        numSlices_ = strtol(jsthreads, NULL, 10);
> +    if (char *jsthreads = getenv("JS_THREADPOOL_STEAL")) {
> +        size_t steal = strtol(jsthreads, NULL, 10);
> +        workStealing_ = (steal==1);

Nit: spaces around ==

@@ +468,5 @@
>      // Note that numWorkers_ is the number of *desired* workers,
>      // but workers_.length() is the number of *successfully
>      // initialized* workers.
>      for (size_t workerId = 0; workerId < numWorkers(); workerId++) {
> +        ThreadPoolWorker *worker = js_new<ThreadPoolWorker>(workerId, this);

Use cx->new_ here.

@@ +525,5 @@
> +    size_t jid = jobsPerThread;
> +    size_t tid = 0;
> +
> +    while(jid < numSlices_) {
> +        size_t boundary = (jid+jobsPerThread > numSlices_)? numSlices_ :jid+jobsPerThread ;

Nit: Odd style here, please use inconsistent spaces around operators.

@@ +559,5 @@
> +
> +    pending_jobs_ = 0;
> +    stolen_jobs_ = 0;
> +
> +    if ( ! submitJobsMain(cx, jobs, jobsPerThread))

Nit: if (!submitJobsMain(cx, jobs, jobsPerThread))

::: js/src/vm/ThreadPool.h
@@ +73,5 @@
>      JSRuntime *const runtime_;
>  #endif
>      js::Vector<ThreadPoolWorker*, 8, SystemAllocPolicy> workers_;
>  
> +    ThreadPoolMainWorker *main_worker_;

Nit: camelCase

@@ +79,4 @@
>      // Number of workers we will start, when we actually start them
>      size_t numWorkers_;
>  
> +    // Total number of slices in the current job. See ForkJoin.h

I don't see any comments about slices in ForkJoin.h. Could you summarize the distinction and give a glossary of workers, slices, jobs, etc, either near the top of this file or the top of ForkJoin.h?

@@ +85,5 @@
> +    // Flag to enable workStealing support
> +    bool workStealing_;
> +
> +    uint32_t stolen_jobs_;          // Number of stolen tasks in the last parallel job
> +    uint32_t pending_jobs_;         // Number of pending jobs on the current job

Instead of using the ATOMIC_ macros, we should use the template Atomic<T> found in mfbt/Atomics.h

Also, camelCase

@@ +105,5 @@
>  
> +    // Return number of slices in the current parallel job (if present).
> +    size_t numSlices() { return numSlices_; }
> +
> +    bool usingWorkstealing() { return workStealing_; }

WorkStealing
Attachment #810500 - Flags: review?(shu)
Attached patch bug-919638-fix.patch (obsolete) — Splinter Review
Attachment #810500 - Attachment is obsolete: true
Attachment #812300 - Flags: review?(shu)
Comment on attachment 812300 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 812300 [details] [diff] [review]:
-----------------------------------------------------------------

Looks much better! r=me with the questions below addressed.

You've already the tests locally, so the next step is to push to the try server. Do you have any level of commit access? If not, I could push it to try for you after the fixes.

::: js/src/vm/ForkJoin.cpp
@@ +457,2 @@
>      ForkJoinMode mode = ForkJoinModeNormal;
>      if (args.length() > 1) {

If you're asserting above that |args.length() == 3|, no need to check args.length() > 1 and > 2. Should move all the argument asserts up to the top of the function.

@@ +1275,5 @@
>      }
>  
>      bool invoke(PerThreadData *perThread) {
>          RootedValue result(perThread);
> +        enter_(jitcode_, argc_ + 1, argv_ + 1, nullptr, calleeToken_, nullptr, 0, result.address());

Style nit: the max width is 100 characters.

@@ +1377,4 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // Push parallel tasks and wait until they're all done.
> +        jobResult = threadPool_->executeJobWithMain(cx_, this, numSlices_);

Why do we not need to check |jobResult != TP_SUCCESS| here anymore?

@@ +1795,5 @@
> +        uint32_t slices = workers * 8;
> +        if (chunks < slices)
> +            return workers;
> +        else
> +            return slices;

Style nit: no else after return. Do

if (c)
    return x;
return y;

::: js/src/vm/ForkJoin.h
@@ +32,5 @@
>  // running in parallel.  Each copy will then do 1/Nth of the total
> +// work. When |N| is big enough the pool will perform load balancing
> +// through workstealing. When |N| is not specified, |N| is the number
> +// of workers in the threadpool---by default, this is the number of
> +// cores on the computer).

This comment is contrary to the intrinsic function definition, where you assert the number of arguments == 3. Which one is intended?

::: js/src/vm/ThreadPool.cpp
@@ +64,5 @@
> +      : workerId_(workerId),
> +        pool_(pool),
> +        worklist_()
> +    { }
> +    virtual ~ThreadPoolBaseWorker() { }

Is this used here or in any of the subclasses?

@@ +73,5 @@
> +
> +    // Submit work to be executed. If this returns true, you are guaranteed
> +    // that the slices will execute before the thread-pool terminates (barring
> +    // an infinite loop in some prior slices).
> +    virtual bool submitSlices(ParallelSliceTask *slices, size_t from, size_t to) = 0;

Does this need to be virtual? Given that |submitSlicesMain| and |submitSlicesWorkers| are already separate paths and that |mainWorker_| is already of the derived type.

@@ +377,5 @@
> +        // Lock so that nobody can add/remove slices
> +        AutoLockMonitor lock(*victim);
> +        if (!victim->worklist_.empty()) {
> +            slice = victim->worklist_.popCopy();
> +        }

Style nit: no braces around single-line bodies.

@@ +523,5 @@
>      if (!lazyStartWorkers(cx))
>          return false;
>  
> +    size_t jid = initialOffset;
> +    size_t tid = 0;

Could you comment what these ids track?
Attachment #812300 - Flags: review?(shu) → review+
Attached patch bug-919638-fix.patch (obsolete) — Splinter Review
Attachment #812300 - Attachment is obsolete: true
Attachment #814641 - Flags: review?(shu)
Comment on attachment 814641 [details] [diff] [review]
bug-919638-fix.patch

Review of attachment 814641 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits fixed. No need to request re-review.

I'd like to know why we don't need to check for jobResult != TP_SUCCESS in comment 2 below though. You could just explain on IRC.

::: js/src/vm/ForkJoin.cpp
@@ +454,3 @@
>      JS_ASSERT(args[0].toObject().is<JSFunction>());
> +    JS_ASSERT(args[1].isInt32());
> +    JS_ASSERT(args[1].toInt32() < NumForkJoinModes);    

Nit: trailing whitespace

@@ +1372,4 @@
>      {
>          AutoUnlockMonitor unlock(*this);
> +        // Push parallel tasks and wait until they're all done.
> +        jobResult = threadPool_->executeJobWithMain(cx_, this, numSlices_);

I still don't understand why we don't need to check |jobResult != TP_SUCCESS| here.

::: js/src/vm/ForkJoin.h
@@ +32,4 @@
>  // running in parallel.  Each copy will then do 1/Nth of the total
> +// work. When |N| is big enough the pool will perform load balancing
> +// through workstealing. Otherwise, |N| is the number of threads in
> +// the computer.

This seems to say that if you give an N < # of threads, N = # of threads. Is that actually the case?

::: js/src/vm/ThreadPool.cpp
@@ +109,5 @@
>      // termination completes.
>      void terminate();
> +
> +    // Submit work to be executed. If this returns true, you are guaranteed
> +    // that the slices will execute before the thread-pool terminates by 

Nit: trailing whitespace

@@ +522,5 @@
>          return false;
>  
> +    // Assign each worker thread a slice, round-robin.
> +    size_t sliceId = initialOffset;
> +    size_t workerIid = 0;

Typo for workerId?
Attachment #814641 - Flags: review?(shu) → review+
Attached patch bug-919638-fix.patch (obsolete) — Splinter Review
Attachment #814641 - Attachment is obsolete: true
Blocks: 925828
Attached patch bug-919638-fix.patch (obsolete) — Splinter Review
Attachment #815965 - Attachment is obsolete: true
Attachment #816005 - Attachment is obsolete: true
Attachment #816038 - Flags: review+
Partial rewrite
Attachment #8349224 - Flags: review?(nmatsakis)
Assignee: general → shu
Status: NEW → ASSIGNED
Make bailouts less stupid in the workstealing scheduler
Attachment #8349225 - Flags: review?(nmatsakis)
Some more fixes, remove debug printfs
Attachment #8349857 - Flags: review?(nmatsakis)
Attachment #8349224 - Attachment is obsolete: true
Attachment #8349224 - Flags: review?(nmatsakis)
Remove debug printfs
Attachment #8349858 - Flags: review?(nmatsakis)
Attachment #8349225 - Attachment is obsolete: true
Attachment #8349225 - Flags: review?(nmatsakis)
Comment on attachment 8349857 [details] [diff] [review]
Part 1: Implement a work stealing thread pool for PJS. (

Review of attachment 8349857 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good.

::: js/src/vm/ThreadPool.cpp
@@ +59,5 @@
> +    Vector<ThreadPoolSlice, 8, SystemAllocPolicy> worklist_;
> +
> +    bool stealFrom(ThreadPoolBaseWorker *victim, ThreadPoolSlice *slice);
> +
> +    void popSlice(ThreadPoolSlice *slice) {

Assert that lock is held on entry

@@ +275,5 @@
>      AutoLockMonitor lock(*this);
>      JS_ASSERT(state_ == ACTIVE);
> +
> +    for (unsigned i = from; i < to; i++) {
> +        if (!worklist_.append(ThreadPoolSlice(job, i)))

In my rewrite of threadpool, I did this differently. I had the threadpool itself be "assigned" to one job at a time, and then rather than storing a queue of ThreadPoolSlices (which takes O(n) memory and must be allocated etc), each worker thread just stored a minSlice and maxSlice. Whenever the two were unequal, there was work to do. To do a local job, the worker thread could increment minSlice. To steal, it would decrement maxSlice. This can be easily done with locks or by compare-and-swapping a single 64-bit value that encodes both min/max slice together. What do you think of this? It seems better to me, maybe we can open up a follow-up bug to implement this scheme.

(In that design, the thread pool was also a singleton that lived apart from a runtime, so part of "locking" it for a job was assigning it to a runtime. That made it possible to share the threadpool between workers.)

@@ +487,5 @@
> +    // Assign each worker thread a slice, round-robin.
> +    unsigned sliceId = slicesPerWorker;
> +    unsigned workerId = 0;
> +
> +    while (sliceId < numSlices) {

Since we do work-stealing, it probably doesn't matter, but I don't think that this is the most equal distribution. In particular, if the number of slices doesn't divide up evenly, the final worker winds up with more. If we wanted to start out with an equal distribution, the better algorithm is to compute EXTRA = NUM_SLICES % NUM_WORKERS and then distribute slicesPerWorker+1 into the first EXTRA workers, and slicesPerWorker into the remainder.

@@ +513,5 @@
> +#ifdef DEBUG
> +    stolenSlices_ = 0;
> +#endif
> +
> +    unsigned slicesPerThread = (unsigned) (numSlices / (numWorkers() + 1));

Nit: use name slicesPerWorker as above (or change the code above, I don't care)

::: js/src/vm/ThreadPool.h
@@ +37,4 @@
>  };
>  
>  // ThreadPool used for parallel JavaScript execution as well as
>  // parallel compilation.  Unless you are building a new kind of

We should update this comment. We don't use the thread pool for compilation -- never got around to it -- and I've come to think it's a bad idea. Eventually I'd like the par workers to be able to kick off compilation as they go and have it run asynchronously.

@@ +61,5 @@
> +// used by the pool as a worker thread.
> +//
> +// The ThreadPool supports work stealing. Every time a worker
> +// completes all the slices in its local queue, it tries to acquire
> +// some work form other workers (including the main thread).

s/form/from

@@ +73,5 @@
> +// bunch of slices to be done at the very beginning of the job, and
> +// have to wait until all the threads have joined back. During phase
> +// (1) there is no synchronization overhead between workers
> +// introduced by the stealing algorithm, and therefore the execution
> +// overhead introduced is almost nullptr with balanced workloads. The way

s/nullptr/zero/

@@ +121,5 @@
> +    unsigned stolenSlices() { return stolenSlices_; }
> +#endif
> +
> +    // Submit slices for parallel execution in the pool.
> +    bool submitSlices(JSContext *cx, ParallelJob *job, unsigned slicesPerWorker, unsigned numSlices);

This API is only used internally. Let's make it private I think (or perhaps just remove it and fold it into executeJob())
Attachment #8349857 - Flags: review?(nmatsakis) → review+
(In reply to Niko Matsakis [:nmatsakis] from comment #16)
> Comment on attachment 8349857 [details] [diff] [review]
> Part 1: Implement a work stealing thread pool for PJS. (
> 
> Review of attachment 8349857 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks good.
> 
> ::: js/src/vm/ThreadPool.cpp
> @@ +59,5 @@
> > +    Vector<ThreadPoolSlice, 8, SystemAllocPolicy> worklist_;
> > +
> > +    bool stealFrom(ThreadPoolBaseWorker *victim, ThreadPoolSlice *slice);
> > +
> > +    void popSlice(ThreadPoolSlice *slice) {
> 
> Assert that lock is held on entry
> 

How do I actually do this? A |#ifdef DEBUG|-only bool field seems clunky.

> @@ +275,5 @@
> >      AutoLockMonitor lock(*this);
> >      JS_ASSERT(state_ == ACTIVE);
> > +
> > +    for (unsigned i = from; i < to; i++) {
> > +        if (!worklist_.append(ThreadPoolSlice(job, i)))
> 
> In my rewrite of threadpool, I did this differently. I had the threadpool
> itself be "assigned" to one job at a time, and then rather than storing a
> queue of ThreadPoolSlices (which takes O(n) memory and must be allocated
> etc), each worker thread just stored a minSlice and maxSlice. Whenever the
> two were unequal, there was work to do. To do a local job, the worker thread
> could increment minSlice. To steal, it would decrement maxSlice. This can be
> easily done with locks or by compare-and-swapping a single 64-bit value that
> encodes both min/max slice together. What do you think of this? It seems
> better to me, maybe we can open up a follow-up bug to implement this scheme.
> 
> (In that design, the thread pool was also a singleton that lived apart from
> a runtime, so part of "locking" it for a job was assigning it to a runtime.
> That made it possible to share the threadpool between workers.)
> 

Daniele's way is more flexible, if we ever want to reuse this for task parallelism with work stealing. But I'll implement this optimization in this fashion: each worker gets assigned one job and some # of slices, stealing another worker's slice executes the victim worker's job.

> @@ +487,5 @@
> > +    // Assign each worker thread a slice, round-robin.
> > +    unsigned sliceId = slicesPerWorker;
> > +    unsigned workerId = 0;
> > +
> > +    while (sliceId < numSlices) {
> 
> Since we do work-stealing, it probably doesn't matter, but I don't think
> that this is the most equal distribution. In particular, if the number of
> slices doesn't divide up evenly, the final worker winds up with more. If we
> wanted to start out with an equal distribution, the better algorithm is to
> compute EXTRA = NUM_SLICES % NUM_WORKERS and then distribute
> slicesPerWorker+1 into the first EXTRA workers, and slicesPerWorker into the
> remainder.
> 

Good idea, I'll change this.

> @@ +513,5 @@
> > +#ifdef DEBUG
> > +    stolenSlices_ = 0;
> > +#endif
> > +
> > +    unsigned slicesPerThread = (unsigned) (numSlices / (numWorkers() + 1));
> 
> Nit: use name slicesPerWorker as above (or change the code above, I don't
> care)
> 
> ::: js/src/vm/ThreadPool.h
> @@ +37,4 @@
> >  };
> >  
> >  // ThreadPool used for parallel JavaScript execution as well as
> >  // parallel compilation.  Unless you are building a new kind of
> 
> We should update this comment. We don't use the thread pool for compilation
> -- never got around to it -- and I've come to think it's a bad idea.
> Eventually I'd like the par workers to be able to kick off compilation as
> they go and have it run asynchronously.
> 
> @@ +61,5 @@
> > +// used by the pool as a worker thread.
> > +//
> > +// The ThreadPool supports work stealing. Every time a worker
> > +// completes all the slices in its local queue, it tries to acquire
> > +// some work form other workers (including the main thread).
> 
> s/form/from
> 
> @@ +73,5 @@
> > +// bunch of slices to be done at the very beginning of the job, and
> > +// have to wait until all the threads have joined back. During phase
> > +// (1) there is no synchronization overhead between workers
> > +// introduced by the stealing algorithm, and therefore the execution
> > +// overhead introduced is almost nullptr with balanced workloads. The way
> 
> s/nullptr/zero/
> 
> @@ +121,5 @@
> > +    unsigned stolenSlices() { return stolenSlices_; }
> > +#endif
> > +
> > +    // Submit slices for parallel execution in the pool.
> > +    bool submitSlices(JSContext *cx, ParallelJob *job, unsigned slicesPerWorker, unsigned numSlices);
> 
> This API is only used internally. Let's make it private I think (or perhaps
> just remove it and fold it into executeJob())

I'll just hand inline it, yeah.
Applied comments.
Attachment #8349857 - Attachment is obsolete: true
Rebased against v2.
Attachment #8350441 - Flags: review?(nmatsakis)
Attachment #8349858 - Attachment is obsolete: true
Attachment #8349858 - Flags: review?(nmatsakis)
Comment on attachment 8350440 [details] [diff] [review]
Part 1: Implement a work stealing thread pool for PJS. v2

Review of attachment 8350440 [details] [diff] [review]:
-----------------------------------------------------------------

Carrying r+ from nmatsakis
Attachment #8350440 - Flags: review+
(In reply to Shu-yu Guo [:shu] from comment #17)
> How do I actually do this? A |#ifdef DEBUG|-only bool field seems clunky.

PR_ASSERT_LOCK_HELD_BY_CURRENT_THREAD or something like that

> Daniele's way is more flexible, if we ever want to reuse this for task
> parallelism with work stealing. But I'll implement this optimization in this
> fashion: each worker gets assigned one job and some # of slices, stealing
> another worker's slice executes the victim worker's job.

Ok. This sounds like a good compromise. I considered the loss of flexibility, I'm just not concerned about specializing for the API we have today and letting tomorrow work itself out. The precise shape of a task API is kind of unclear to me just now. But with the approach you suggested, it seems that it would scale to "nested parallelism" in a relatively straightforward way: you could "push" a new job (with a new set of slice indices) onto the current worker.
Was fairly easy to make the workstealing lock free.
Attachment #8350846 - Flags: review?(nmatsakis)
Rebased against lock free part 1.
Attachment #8350847 - Flags: review?(nmatsakis)
Attachment #8350441 - Attachment is obsolete: true
Attachment #8350441 - Flags: review?(nmatsakis)
Niko and Felix,

About the CAS implementation: we don't have 64bit CAS on all our architectures (WinXP 32bit for one, I don't know about ARM). We could shrink slice IDs to 16bit. 65535 slices per thread seems plenty to me for load balancing purposes. Thoughts?
It's not 65535 slices per thread, right? Just overall? Still, probably fine.
(In reply to Niko Matsakis [:nmatsakis] from comment #25)
> It's not 65535 slices per thread, right? Just overall? Still, probably fine.

Oops, yes you're right.
Rolled up lock free impl.
Attachment #8350440 - Attachment is obsolete: true
Attachment #8350846 - Attachment is obsolete: true
Attachment #8350846 - Flags: review?(nmatsakis)
Attachment #8355292 - Flags: review?(nmatsakis)
Attached patch Roll up v2 (obsolete) — Splinter Review
Rolled up parts 1 and 2 since after removing rendezvous code and fixing aborts, I could no longer (with moderate effort) cleanly separate the 2 patches.

Some notes on this version:

 - Rendezvous code removed

 - The fork-join synchronization mechanism is now moved to ThreadPool.

   Since this is moved to ThreadPool, submitting jobs to workers no longer requires locking, as distributing the work at the beginning of a fork-join means all the workers are idle. I moved the pendingSlices_ arithmetic back into the Worker classes, largely because of symmetry with aborts, see below.

 - Aborting is somewhat tricky. Aborting is really a per-job thing: once that fails, we need to clear that job from all workers. Only the workers know how many slices are still left pending, so the pendingSlices_ arithmetic can only be done in abortJob(). For symmetry, submitJob() also does the pendingSlices_ arithmetic.

   Moreover, aborting a job doesn't really mean we should pause the worker, since the thread pool might contain other jobs in the current workload that are still unfinished (thinking ahead to actual nested parallelism here).

Not obsoleting the old patches in case you want to refer back to them.

Ping me on IRC if you have any questions.
Attachment #8355431 - Flags: review?(nmatsakis)
Seems to pass jit-tests.
Attachment #8355694 - Flags: review?(nmatsakis)
Comment on attachment 8355431 [details] [diff] [review]
Roll up v2

Review of attachment 8355431 [details] [diff] [review]:
-----------------------------------------------------------------

I am concerned about the synchronization. I made a comment below as to the degenerate case I'm worried about. I'll post an alternate suggestion that I worked out.

::: js/src/vm/ThreadPool.cpp
@@ +560,5 @@
> +    mainWorker_->submitJob(job, sliceFrom, sliceFrom + slicesPerWorker);
> +
> +    // Notify the worker threads that there's work now.
> +    for (uint32_t workerId = 0; workerId < numWorkers(); workerId++)
> +        workers_[workerId]->notifyOfWork();

This handoff doesn't feel quite right. In particular, we start the workers and submit jobs above, so it's possible that one of them has seen those jobs and started to execute them and thus are not waiting. In that case, they'd be holding the lock, and we'd be stuck in this loop waiting for them to acquire the lock, no?
Attachment #8355431 - Flags: review?(nmatsakis) → review-
(In reply to Niko Matsakis [:nmatsakis] from comment #30)
> This handoff doesn't feel quite right. In particular, we start the workers
> and submit jobs above, so it's possible that one of them has seen those jobs
> and started to execute them and thus are not waiting. In that case, they'd
> be holding the lock, and we'd be stuck in this loop waiting for them to
> acquire the lock, no?

Let me rephrase this more clearly. I am concerned that the main thread will basically block while some worker is executing its local jobs, since the main thread needs to acquire the worker's lock, but the worker only releases the lock when it is waiting.
So I think that we could synchronize using the atomic pendingSlices_ variable. I was thinking of a scheme where you have just one lock, on the pool, and the pendingSlices_ variable is always kept up to date with the total amount of work remaining. The worker threads would basically spin, waiting until pendingSlices becomes non-zero, and they would continue trying to take local work and steal so long as pendingSlices is non-zero. The pseudo-code is below.

(The other question I guess is how much we want to worry about the nested case. I still think it's nice to keep the code as simple as possible and focus on performing great for the cases we do handle, but it's true that I'm not sure how to extend this strategy to the nested case yet. Sorry, I know we had this discussion already, and I also know that I'm the one who advocated the current approach.)

    class Pool : public Monitor {
      private:
        friend class Worker;
        
        Atomic<uint32_t> pendingSlices_;
        uint32_t activeWorkers_;
        enum { ACTIVE, TERMINATED } state_;
        
      public:
        ...
        
    };
    
    class Worker {
        Pool *const pool_;
        ParallelJob *job_;
        Atomic<uint32_t> bounds_;
    };
    
    void
    Worker::submitJob(ParallelJob *job, uint32_t bounds) {
        ASSERT(!hasWork()); // no suppose for nested
        job_ = job;
        bounds_ = bounds;
    }
    
    void
    Worker::abortJob() {
        uint32_t bounds = bounds_.swapWith(0);
        uint16_t start, end;
        DecomposeBounds(bounds_, &start, &end);
        uint16_t jobs = end - start;
        pool_->pendingSlices_ -= jobs;
    }
    
    void
    Worker::main() {
        while (true) {
            // Wait for work to arrive or for us to
            // be terminating.
            {
                AutoLockMonitor lock(pool_);
                while (state_ == ACTIVE && pool_->pendingSlices_ == 0) {
                    lock.wait();
                }
                if (state_ != ACTIVE) {
                    pool->activeWorkers_--;
                    return;
                }
            }
            
            // Iterate as long as there is work out there
            // to be done...
            while (tryLocalJob()) ;
            while (pool_->pendingSlices > 0) {
                while (tryStolenJob()) ;
            }
        }
    }
    
    void
    Pool::executeJob() {
        assert(pendingSlices_ == 0); // no nested work supported
        
        for (each Worker *w) {
            w->submitJob(...);
        }
        
        // signal workers there is work to be done
        pendingSlices_ += totalSlices;
        {
            AutoLockMonitor lock(pool_);
            lock.notifyAll();
        }
        
        executeMainJob();

        while (pendingSlices_ != 0)
            stealFromWorkers();
    }

To abort, just invoke abortJob on each worker:

    void
    Pool::abort() {
        for (each Worker *w) {
            w->abortJob();
        }
    }
(In reply to Niko Matsakis [:nmatsakis] from comment #30)
> Comment on attachment 8355431 [details] [diff] [review]
> Roll up v2
> 
> Review of attachment 8355431 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I am concerned about the synchronization. I made a comment below as to the
> degenerate case I'm worried about. I'll post an alternate suggestion that I
> worked out.
> 
> ::: js/src/vm/ThreadPool.cpp
> @@ +560,5 @@
> > +    mainWorker_->submitJob(job, sliceFrom, sliceFrom + slicesPerWorker);
> > +
> > +    // Notify the worker threads that there's work now.
> > +    for (uint32_t workerId = 0; workerId < numWorkers(); workerId++)
> > +        workers_[workerId]->notifyOfWork();
> 
> This handoff doesn't feel quite right. In particular, we start the workers
> and submit jobs above, so it's possible that one of them has seen those jobs
> and started to execute them and thus are not waiting. In that case, they'd
> be holding the lock, and we'd be stuck in this loop waiting for them to
> acquire the lock, no?

There will always be progress. Since submitting work or getting work are lock free, workers that have already started will just keep progressing. When the workload finishes, the main thread will fire a spurious "there is work" even though all the work is finished. This also doesn't really result in any imbalance since the workers always prefers locally assigned work over stealing, each time through the worker loop.

Does that make sense? Is this subtlety worth giving something that seems easily extensible to nesting?
Giving up, I meant to say.
I'm not saying it won't always make progress, just that the main thread will block unnecessarily. Regarding extensibility to nesting: I don't think that the current code is particularly extensible to nesting? It seems like using a pair of integers makes it hard to support nesting, but maybe I'm missing something. (Though I still like using a pair of integers -- and I would like to find a way to support nesting without requiring that we allocate vectors of tasks, at least in the non-nested case.)
Comment on attachment 8355694 [details] [diff] [review]
Part 2: Change warmup semantics to be sequential in slices, low to high.

Review of attachment 8355694 [details] [diff] [review]:
-----------------------------------------------------------------

I wonder if it's worth removing the "complete" flag and instead having warmup just run the entire slice all the time, vs just the first chunk. Now that there are more slices this doesn't seem like a bad thing (it's also unclear to me if the very notion of chunks continues to make a lot of sense, but I guess that's separate).
Attachment #8355694 - Flags: review?(nmatsakis) → review+
Attachment #8350847 - Flags: review?(nmatsakis)
Attachment #8355292 - Flags: review?(nmatsakis)
Attached patch Roll up v3Splinter Review
Roughly follows Niko's suggestion from a previous comment.

The bailout-executed test has intermittent failures, depending on how many slices got executed by warmup since the warmup behavior changed. Don't know how to fix it exactly, though.
Attachment #8355431 - Attachment is obsolete: true
Attachment #8357637 - Flags: review?(nmatsakis)
Attachment #8357637 - Flags: review?(nmatsakis) → review+
Blocks: 958370
Part 1 breaks non-NSPR shell builds:
js/src/vm/Monitor.h:38:9: error: use of undeclared identifier 'PR_ASSERT_CURRENT_THREAD_OWNS_LOCK'

Also, you forgot to paste the inbound links.
Flags: needinfo?(shu)
https://hg.mozilla.org/mozilla-central/rev/464e261cbcbe
https://hg.mozilla.org/mozilla-central/rev/976a1fe9f080
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Manually track owner thread in Monitor.
Attachment #8358654 - Flags: review?(pnkfelix)
Flags: needinfo?(shu)
Comment on attachment 8358654 [details] [diff] [review]
Followup: unbreak compiling without NSPR. (r=?)

Review of attachment 8358654 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Monitor.h
@@ +86,5 @@
>  #endif
>  
>      ~AutoLockMonitor() {
>  #ifdef JS_THREADSAFE
>          PR_Unlock(monitor.lock_);

should you set monitor.owner_ back to null before this Unlock?  (To make the relationship invariant bidirectional?)
(If you do choose to enforce the stronger bidirectional invariant, there are probably other places where you would need to assign owner_ to null, and you would probably want to check that it is null (after acquiring the lock) before assigning a non-null value to it.)
Attachment #8358702 - Flags: review?(pnkfelix)
Attachment #8358654 - Attachment is obsolete: true
Attachment #8358654 - Flags: review?(pnkfelix)
Attachment #8358702 - Flags: review?(pnkfelix) → review+
Depends on: 958803
This still breaks non-threadsafe builds:

js/src/vm/Monitor.h: In member function ‘bool js::AutoLockMonitor::isFor(js::Monitor&) const’:
js/src/vm/Monitor.h:79:16: error: ‘monitor’ was not declared in this scope
js/src/vm/Monitor.h: In member function ‘bool js::AutoUnlockMonitor::isFor(js::Monitor&) const’:
js/src/vm/Monitor.h:148:16: error: ‘monitor’ was not declared in this scope

Can you please test such patches, that they actually build with --disable-threadsafe?

Something like this fixes it for me (for both places where it's defined), but I don't know if this is the right approach:

>     bool isFor(Monitor &other) const {
>+#ifdef JS_THREADSAFE
>         return monitor.lock_ == other.lock_;
>+#else
>+        return true;
>+#endif
>     }
Status: RESOLVED → REOPENED
Flags: needinfo?(shu)
Resolution: FIXED → ---
(In reply to Christian Holler (:decoder) from comment #49)
> This still breaks non-threadsafe builds:
> 
> js/src/vm/Monitor.h: In member function ‘bool
> js::AutoLockMonitor::isFor(js::Monitor&) const’:
> js/src/vm/Monitor.h:79:16: error: ‘monitor’ was not declared in this scope
> js/src/vm/Monitor.h: In member function ‘bool
> js::AutoUnlockMonitor::isFor(js::Monitor&) const’:
> js/src/vm/Monitor.h:148:16: error: ‘monitor’ was not declared in this scope
> 
> Can you please test such patches, that they actually build with
> --disable-threadsafe?
> 
> Something like this fixes it for me (for both places where it's defined),
> but I don't know if this is the right approach:
> 
> >     bool isFor(Monitor &other) const {
> >+#ifdef JS_THREADSAFE
> >         return monitor.lock_ == other.lock_;
> >+#else
> >+        return true;
> >+#endif
> >     }

ugh, sorry. r=me
Flags: needinfo?(shu)
https://hg.mozilla.org/mozilla-central/rev/8838fe37b98d
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: