Closed Bug 969722 Opened 6 years ago Closed 6 years ago

PJS: Inline the self-hosted parts of the scheduler

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla30

People

(Reporter: shu, Assigned: shu)

References

Details

Attachments

(3 files, 5 obsolete files)

With bug 968524 landed the self-hosted parts can now be inlined.
Depends on: 968524
Assignee: nobody → shu
Status: NEW → ASSIGNED
Attached patch Part 2: Inline ForkJoinGetSlice. (obsolete) — Splinter Review
WIP.
Comment on attachment 8375322 [details] [diff] [review]
Part 1: Refactor how ThreadPoolWorkers are accessed for ease of inlining.

Review of attachment 8375322 [details] [diff] [review]:
-----------------------------------------------------------------

r+ -- the comments are just questions, I figure you know where you're going with this :)

::: js/src/vm/ThreadPool.h
@@ +78,5 @@
> +    uint32_t id() const {
> +        return workerId_;
> +    }
> +
> +    void submitSlices(uint16_t sliceFrom, uint16_t sliceTo) {

Can we make this non-public somehow? Or do we expect self-hosted code to be calling it?

@@ +87,5 @@
> +    // Get the next slice; work stealing happens here if work stealing is
> +    // on. Returns false if there are no more slices to hand out.
> +    //
> +    // Implemented in ThreadPoolWorker and ThreadPoolMainWorker.
> +    virtual bool getSlice(uint16_t *sliceId) = 0;

Do we really want to pay the price of virtual dispatch here?
Attachment #8375322 - Flags: review?(nmatsakis) → review+
Unified the worker classes.
Attachment #8376563 - Flags: review?(nmatsakis)
Attachment #8375322 - Attachment is obsolete: true
Attached patch Part 2: Inline ForkJoinGetSlice. (obsolete) — Splinter Review
Lots of files changed, but the meat is in CodeGenerator-x86-shared.cpp's
JitRuntime::generateForkJoinGetSliceStub.
Attachment #8376644 - Flags: review?(nmatsakis)
Attachment #8375323 - Attachment is obsolete: true
I'm seeing a slight speedup on mandelbrot:

MANDELBROT PARALLEL AVERAGE  : 92.66666666666667
vs
MANDELBROT PARALLEL AVERAGE  : 104

We can now accomodate more slices though, so we have more freedom to tweak the
computation for # of slices.
Comment on attachment 8376563 [details] [diff] [review]
Part 1: Refactor how ThreadPoolWorkers are accessed for ease of inlining. v2

Review of attachment 8376563 [details] [diff] [review]:
-----------------------------------------------------------------

Unifying the two worker types seems like a win.

::: js/src/vm/ThreadPool.cpp
@@ +169,5 @@
>      do {
>          if (!pool_->hasWork())
>              return false;
>  
> +        // Add one to add the main thread into the mix.

This comment seems out of date.

@@ +174,1 @@
>          victim = pool_->workers_[rand() % pool_->numWorkers()];

From the man page:

The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call. This might just be the seed value to be used by the next call, or it might be something more elaborate. In order to get reproducible behavior in a threaded application, this state must be made explicit; this can be done using the reentrant function rand_r(). 

I suggest we call something else. Surely mozilla (or NSPR?) has a random number facility.

::: js/src/vm/ThreadPool.h
@@ +186,5 @@
>      // Number of pending slices in the current job.
>      mozilla::Atomic<uint32_t, mozilla::ReleaseAcquire> pendingSlices_;
>  
> +    // Whether the main thread is currently processing slices.
> +    bool isMainThreadActive_;

likely this should be Atomic or volatile, I imagine it's accessed from multiple threads; in Java volatile would be right, but I... am not entirely sure about C++. In any case, reads on an x86 don't generate any extra barriers or anything, so atomic should be fine :)
Attachment #8376563 - Flags: review?(nmatsakis) → review+
Comment on attachment 8376644 [details] [diff] [review]
Part 2: Inline ForkJoinGetSlice.

Review of attachment 8376644 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me!

::: js/src/assembler/assembler/X86Assembler.h
@@ +1409,5 @@
> +    {
> +        spew("cmpw       %s, %s",
> +             nameIReg(2, src), nameIReg(2, dst));
> +        m_formatter.prefix(PRE_OPERAND_SIZE);
> +        m_formatter.oneByteOp(OP_CMP_EvGv, src, dst);

Just gonna trust you here ;)

::: js/src/jit/MCallOptimize.cpp
@@ +1373,5 @@
> +{
> +    if (info().executionMode() != ParallelExecution)
> +        return InliningStatus_NotInlined;
> +
> +    if (callInfo.argc() != 1 || callInfo.constructing())

Maybe we should just make these checks asserts, since using ForkJoinGetSlice() in any other way is just bogus.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +1875,5 @@
> +        // If we don't have any slices left ourselves, move on to stealing.
> +        masm.branch16(Assembler::Equal, output, bounds, &stealWork);
> +
> +        // If we still have work, try to CAS [ from+1, to ].
> +        masm.move32(output, eax);

If we put the from bound in the LOWER 16 bits, rather than the upper, you could just inc the bounds integer as a whole and it'd work out fine. You wouldn't have to do all this bitshifting, masking and or'ing. That said, this...seems fine.

@@ +1878,5 @@
> +        // If we still have work, try to CAS [ from+1, to ].
> +        masm.move32(output, eax);
> +        masm.add32(Imm32(1), eax);
> +        masm.shll(Imm32(16), eax);
> +        masm.movzwl(bounds, edx);

I had to lookup movzwl, maybe a comment? "// Copies low 16 bits only"? Or maybe people should just look it up themselves :)

@@ +1891,5 @@
> +
> +    // Try to steal work.
> +    masm.bind(&stealWork);
> +
> +    // XXXshu: It's not technically correct to test whether work-stealing is

Can you remove "XXXshu" but leave the rest of this comment? Or at least remove "XXX"? So unsightly. :)

@@ +1896,5 @@
> +    // turned on only during stub-generation time, but it's a DEBUG only thing
> +    // so I'm not going to worry about it.
> +    uint32_t framePushedAfterSave = 0;
> +    if (cx->runtime()->threadPool.workStealing()) {
> +        // Save volatile regs as we might call rand() below.

Given that we are calling rand() anyway (which is not threadsafe, as previously noted, and might be overkill anyway) -- maybe we should just move the whole stealing path into a C helper? (As an aside, we should experiment with stealing a range of items, rather than just one at a time.)

::: js/src/jit/x64/Assembler-x64.h
@@ +133,5 @@
> +// or edx, which are needed for cmpxchg and div, respectively.
> +static MOZ_CONSTEXPR_VAR Register ForkJoinGetSliceReg_cx = rdi;
> +static MOZ_CONSTEXPR_VAR Register ForkJoinGetSliceReg_temp0 = rbx;
> +static MOZ_CONSTEXPR_VAR Register ForkJoinGetSliceReg_temp1 = rcx;
> +static MOZ_CONSTEXPR_VAR Register ForkJoinGetSliceReg_output = rsi;

Does this edx etc restriction apply to x64? Should it say rdx?
Attachment #8376644 - Flags: review?(nmatsakis) → review+
shu -- just nits except for the call to rand() and the need to use an atomic boolean (or something that permits cross-thread access) for isMainThreadActive
Shu -- just realized we can replace all the grungy shifting and adds to adjust `from` by just adding `0x10000` (and then no need to change format)
Something occurred to me while walking home. Don't we want to set "isMainThreadActive" to true *before* we signal the workers to start? I am concerned about a race in which the workers see the isMainThreadActive flag as false before the main thread has *even started*.
(In reply to Niko Matsakis [:nmatsakis] from comment #11)
> Something occurred to me while walking home. Don't we want to set
> "isMainThreadActive" to true *before* we signal the workers to start? I am
> concerned about a race in which the workers see the isMainThreadActive flag
> as false before the main thread has *even started*.

As discussed on IRC the isMainThreadActive flag doesn't seem to serve a purpose anymore after we ripped out the rendezvous code, and so should also be removed.
Depends on: 976848
v2 with rand() fixes and removal of isMainThreadActive
Attachment #8376563 - Attachment is obsolete: true
Attached patch Part 2: Inline ForkJoinGetSlice. (obsolete) — Splinter Review
v2 with micro opts and rand() fixes
Attachment #8376644 - Attachment is obsolete: true
Comment on attachment 8381855 [details] [diff] [review]
Part 1: Refactor how ThreadPoolWorkers are accessed for ease of inlining.

Review of attachment 8381855 [details] [diff] [review]:
-----------------------------------------------------------------

carrying r=nmatsakis
Attachment #8381855 - Flags: review+
Comment on attachment 8381856 [details] [diff] [review]
Part 2: Inline ForkJoinGetSlice.

Review of attachment 8381856 [details] [diff] [review]:
-----------------------------------------------------------------

carrying r=nmatsakis
Attachment #8381856 - Flags: review+
Comment on attachment 8381856 [details] [diff] [review]
Part 2: Inline ForkJoinGetSlice.

Review of attachment 8381856 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +1844,5 @@
> +        // If we still have work, try to CAS [ from+1, to ].
> +        masm.move32(bounds, edx);
> +        masm.add32(Imm32(0x10000), edx);
> +        masm.move32(bounds, eax);
> +        masm.atomic_cmpxchg32(edx, workerSliceBounds, eax);

One question: why do we use eax/edx specifically here? Is this required for some reason?

@@ +1882,5 @@
> +            masm.loadPtr(Address(StackPointer, framePushedAfterSave), cxReg);
> +
> +            masm.setupUnalignedABICall(1, temp);
> +            masm.passABIArg(cxReg);
> +            masm.callWithABI(JS_FUNC_TO_DATA_PTR(void *, ::math_random_uint32_no_outparam));

Given the concerns about leaking randomization information between compartments, perhaps we should use our own distinct randomization seed and function here, rather than using math_random()? In particular, I imagine we could use [xorshift][1] which is awfully fast. But then again, the one we're using today looks pretty dang simple too, and it's very nice to have user-accessible randomization be threadsafe!

(Of course, we could translate (in parallel code) calls to Math.random() into our own randomizer. But eeew.)

[1]: http://en.wikipedia.org/wiki/Xorshift
v3 with xorshift inline.
Attachment #8381856 - Attachment is obsolete: true
Comment on attachment 8382591 [details] [diff] [review]
Part 2: Inline ForkJoinGetSlice.

Review of attachment 8382591 [details] [diff] [review]:
-----------------------------------------------------------------

carrying r=nmatsakis
Attachment #8382591 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/023aed557989
https://hg.mozilla.org/mozilla-central/rev/b39e4dce0e09
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
You need to log in before you can comment on or make changes to this bug.