Closed Bug 966567 Opened 9 years ago Closed 8 years ago

Bailout of parallel execution due to unsupported operation, script invalidation

Categories

(Core :: JavaScript Engine, defect)

29 Branch
x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jaswanth.sreeram, Assigned: nmatsakis)

References

Details

Attachments

(6 files)

Attached file SimPJS-test.html
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0 (Beta/Release)
Build ID: 20140131095418

Steps to reproduce:

Run the attached JS script in the SpiderMonkey shell (C29.0a1) or load up the attached HTML file in Nightly 29.0a1 (2014-01-31).


Actual results:

In Nightly:
Time taken for 5 iterations :: [Parallel took 685 ms], [Sequential took 618 ms]

In addition, the console says an operation that is not supported was encountered at line 238. There doesn't seem to be one on that line (Math.abs is marked as SAFE_OP AFAICT).

When run in the shell, I see several interesting things that might be useful to look at. The log of the shell execution is attached.

1) Several "invalidation bailouts" in both sequential and parallel versions. I can't pinpoint why. Curiously, I don't see the "unsupported operation" bailout in the shell run.
2) Several slices that are warmed up more more than once consecutively.
3) I see a message that suggests warmup execution finished all the work - this is with 2880 vertices in the original array.
4) From the output it appears that some scripts are compiled very frequently. I don't know if this is indeed true or if its an artifact of the logger/spewer.
 




Expected results:

Parallel execution should run faster than sequential on my 4-core machine.
Attached file SimPJS-test.js
Output from shell run with PAFLAGS=full and IONFLAGS=bailouts,scripts
(In reply to Jaswanth Sreeram from comment #0)
> load up the attached HTML file in Nightly 29.0a1 (2014-01-31).
> Expected results:
> Parallel execution should run faster than sequential on my 4-core machine.
Confirmed in 29.0a1 (2014-02-02), win 7 x64
Status: UNCONFIRMED → NEW
Component: General → JavaScript Engine
Ever confirmed: true
This test case is taken from this program: http://powerbuilder.sh.intel.com/public/share/deform/
1. It took me longer to debug this than it should have, because the bailout debug info for the bailout for some reason points the user at this assignment in the for loop (line 234):

  signal = Math.abs( (1.0 - 2.0 * PJS_gradient_noise3d(PJS_mul4(position, frequency))) );

but the actual problem is some lines down (which I'll explain in a moment).

2. Running in the shell with either PAFLAGS=compile or, if you prefer more output, PAFLAGS=full, one sees this in the output:

  [Parallel:M]   Pow: Unsafe (/Users/fklock/Dev/Bugz/bugz966567/SimPJS-test.js:238)

This tells us that a |Pow| operation on line 238 failed to compile since that operation is marked as Unsafe in ParallelSafetyAnalysis.  (And indeed, |Pow| is marked as an Unsafe op there.)

So that explains *why* we're bailing out here.  I'll look into whether we can make a parallel-safe version of Math.pow to use here.
Jaswanth: Am I crazy, or isn't this code always invoking |signal * Math.pow(x, y)| with |y == 0|, which seems easily replaced with just |signal| ?
(Note: even if one removes the call to Math.pow, there are still other causes of bailouts that need to be investigated after that.)
Yes, I figured that the Pow() was the unsupported op, I changed it to SAFE_OP in ParallelSafety.cpp and that bailout went away. I don't know if it actually is safe.

And yes, it appears that |fi| is always 0 even in the original program, so its not a mistake I made while extracting the kernel. Replacing it with |signal| is therefore, likely correct.
Depends on: 968390
I added a new bug for Math.pow specifically
Some new info of interest: After marking Math.pow as safe, I then changed PJS_div4 to this:

function PJS_div4(v, s)
{
	var a = v[0]/s;
	var b = v[1]/s;
	var c = v[2]/s;
	var d = Number(v[3]/s);
	return [ a, b, c, d ];
}

and it seems to resolve all (!) the bailouts.  Even the GC bailouts go away, which struck me as a bit odd (perhaps it has something to do with out the arrays are represented when they have a mix of integer and doubles within?  I do not know.

% ../objdir-opt-js/js/src/shell/js ../bug966567.js
Time taken for 5 iterations :: [Parallel took 1059 ms],  [Sequential took 1201 ms]

(Without the |Number| invocation, you get a bunch of warnings of the form:
../bug966567.js:323:5 warning: Bailed out of parallel operation: unsupported at ../bug966567.js:76
../bug966567.js:323:5 warning: Bailed out of parallel operation: unsupported at ../bug966567.js:74
and so on, jumping between :74 and :76 in an unpredictable fashion...)


And a further observation: it actually does not matter which of the four calculations is wrapped with the call to |Number(..)|; it can be any of a,b,c, or d.
This could be related to the code that attempts to optimize arrays of mixed ints/doubles into just doubles.
Update: While adding the call to |Number(..)| that I noted in comment 10, I was searching the debug output from PAFLAGS=full and looking for bailouts.

I saw no bailouts and said "that's interesting" (which it is), but I neglected to check about whether the remainder of the compilation was completing successfully and running in parallel.

On further investigation, it looks like the call to |Number| is tagged as an unsafe operation in the ParallelSafetyAnalysis, which means the code is then not parallelized.

(I will see if there is some other way to get a number out in the short term besides invoking |Number|... maybe adding 0.0 will work.)
(of course the fact that we fail to compile the code to parallel form because of |Number| being unsafe would mask any bailouts from the parallel machine code.)
Assignee: nobody → nmatsakis
Depends on: 977853
Shu and I identified the problem with divs. I created bug 977853 to describe the problem and its solution.
Note: with that fix, I still observe Zone GC bailouts. Have to investigate those next.
Re GC zone bailouts:

If the MOZ_GCTIMER log is to be believed the program allocates at least 90MB*20=1.8GB, so GC is to be expected.

Part of that volume will be array data for the quads.  Based on instrumentation, the program allocates 18.5M quads or so (a few of these are length-2 but most are length-4).  Figure 40 bytes in the very best case for each, that's 740MB.  But in fact, a simple program that allocates 18.5 length-4 arrays allocates 1.7GB, so it's reasonable to assume that array allocation accounts for virtually all allocation in this test.
I suspect the only way to get good parallel speedup on a program like this will require a combination of reduced allocation volume and parallel-friendly generational collection.

Allocation volume can in principle be reduced by specializing length-k Arrays for low k (with the justification that they are important to graphics); by making TypedObject objects more efficient than Arrays and suggest to users to move in that direction; or by providing good SIMD data types (for this special case) and suggest to users to move in that direction.  None are appealing but the latter two are at least credible, and might open up for the JIT.
Depends on: 980386
Attached file TypedObjectsPort.js
A port of SimPJS-test.js to use a typed object result, which should allow us to use more efficient garbage collection since result is known to be scalar.
Update: we are finding that GC overhead is a real blocker for performance. I have a preliminary patch that clears ALL OBJECTS from the parallel arenas, without doing any markers or scans. This can safely be run in between parallel iterations, so long as we know that none of those objects can have escaped -- i.e., so long as the result type is known to be scalar. On my machine, with this patch, we get 2x perf improvement on 2 cores. The patch is not 100% polished but I will attach it anyhow. (To get that measurement, I hacked the benchmark to return null instead of the actual result, since otherwise the optimization would be invalid.)

One hitch is that, in the benchmark as currently written, it generates a non-scalar result. Hence I did a (very minimal) rewrite of the benchmark to generate a TypedObject result of known type. This works, but nfortunately, it does not currently run in parallel because we have to make a par version of fromPar. That shouldn't be too hard, though.
Preliminary patch to finalize and remove all objects from arena lists during parallel execution. The idea is that we would only invoke this function on an arena if we can guarantee all GCThings contained within are dead.

Putting f? for Terrence -- Terrence, please ping me on IRC to discuss general background and if you need convincing that a patch like this makes sense. I'm probably missing something in this patch, it was my first stab at the problem, hence I'm only putting f? not r?. :) (Note that in the patch as is, we ALWAYS invoke the clear function, which is clearly unsafe, but that's easy to fix.)
Attachment #8390533 - Flags: feedback?(terrence)
Comment on attachment 8390533 [details] [diff] [review]
Bug966567-WIP.diff

Review of attachment 8390533 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +562,5 @@
> +            if (releaseArenas) {
> +                aheader->chunk()->releaseArena(aheader);
> +            } else {
> +                aheader->getArena()->setAsFullyUnused(thingKind);
> +                dest.insert(aheader);

Instead of open-coding this, could you add a method |static void Chunk::recycleArena(ArenaHeader *aheader, ArenaList &dest, AllocKind thingKind)| right below releaseArena that does this.

@@ +569,2 @@
>              dest.insert(aheader);
> +        }

This is pretty hard to read as is. Could we perhaps rewrite it as:

if (!allClear)
    dest.insert(aheader);
else if (InParallelSection())
    Chunk::recycleArena(aheader, dest, thingKind);
else
    aheader->chunk()->releaseArena(aheader);

@@ +1509,5 @@
>      return allocateFromArenaInline(zone, thingKind);
>  }
>  
>  void
> +ArenaLists::wipeDuringParallelExecution(JSRuntime *rt)

It's a shame that this has to take a runtime. I guess it's just for the FreeOp?

@@ +1527,5 @@
> +    for (unsigned i = 0; i < FINALIZE_LAST; i++) {
> +        AllocKind thingKind = AllocKind(i);
> +        if (!IsBackgroundFinalized(thingKind) && arenaLists[thingKind].head)
> +            return;
> +    }

Escape analysis would be another option here. Shame that isn't ready yet.

::: js/src/jsgc.h
@@ +514,5 @@
>       */
>      void purge() {
>          for (size_t i = 0; i != FINALIZE_LIMIT; ++i) {
> +            purge(AllocKind(i));
> +        }

No {} around single line.
Attachment #8390533 - Flags: feedback?(terrence) → feedback+
Blocks: 983477
No longer blocks: 983477
Depends on: 983477
Depends on: 983486
Depends on: 983686
Depends on: 983977
Depends on: 983986
Depends on: 983987
I just added two more dependencies. On my laptop, with those bugs patched in a somewhat hacky way, I see the following performance: 

Mr-Bennet. dist/bin/js --ion-parallel-compile=off ~/tmp/bug966567.js
Time taken for 50 iterations :: [Parallel took 2833 ms],  [Sequential took 8427 m
s]                                                                              

I also see full CPU utilization during the parallel section. My laptop has 2 hyperthreaded cores.
Depends on: 986015
Depends on: 987912
See Also: → 996842
Blocks: PJS
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.