Closed
Bug 981547
Opened 9 years ago
Closed 8 years ago
Work required to remove overhead from mapPar
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: lth, Unassigned)
References
Details
Attachments
(1 file)
10.67 KB,
patch
|
Details | Diff | Splinter Review |
(Follow-on work to bug 972581, possibly this gets to be split up into several work items by and by.) At present, the inner loop of TypedObjectArray.mapPar has significant overhead in that it contains run-time tests for various constant conditions (eg, simple type vs complex type) and may compute arguments that are not needed by the kernel function (input index, output offset, output cursor), and it checks conditions whose state may be apparent from the kernel itself (kernel's result type). It will be necessary to get rid of this overhead if parallel computation is going to show worthwhile speedup over serial computation on low-core-count systems. Niko's summary on IRC, edited: mapPar is (modulo bugs) callsite cloned, this basically means we make a copy of the fn per callsite so that the type information is not polluted. Also constant propagation improves. For example, simple vs complex type is totally predictable based on TI, and I think most of the choices are. we'll have to inline the accessor though. so I think what we have to change to make that happen is: make TypeDescrIsSimpleType intrinsic and inline it, or maybe just constant fold the DESCR_KIND() access, or both. And then I tihnk we probably will have to move the computation of inGrainTypeIscomplex into the kernel so that ion can see it better. (sometimes ion is conservative around upvars.) that gets rid of most of the ifs, except the return value. for that I was hoping that we'd inline the call to the kernel. actually TI should help here too because (I think) if the value is always undefined (or never undefined) we'll predict that and eliminate the `if (r !== undefined)` check. (end summary)
Comment 1•9 years ago
|
||
I don't think I set the flags properly to get the callsite cloning we need. In particular, helper fns generally need to be tagged as callsite cloned, unless they are declared inside another callsite cloned fn.
Reporter | ||
Comment 2•9 years ago
|
||
Some early investigations: - My benchmark is once again the cat convolver, mapping an array of indices to to implement the inner loop of a matrix algorithm where the matrix is stored in a 1D array. There are four cases: - Sequential map, index array is Array: 750ms - Sequential map, index array is TypedObject: 3194ms - Parallel map, index array is Array: 3360ms - Parallel map, index array is TypedObject: 3714ms One run, Quad MBP, 16GB RAM. I'll attach sources etc later. The slowdown for the sequential map has been filed as bug 983577. The rest of these notes pertain to the difference between the TypedObject cases, but should be seen as preliminary since it appears that there are some other effects - the sequential slowdown - that are really most significant. - Specializing the mapThread function within the parallel map implementation with mapThreadSimple(), in the case when both types are simple, does not make much of a difference even though the body of the latter is very straightforward. This could be because the time goes elsewhere after all, or because the slice manipulation is more costly than anticipated or has some sort of nasty side effect on performance. (Or because the change is not observable given the slowness of typed arrays.) - Specializing for the number of arguments in the mapped function also results in a small speedup.
Reporter | ||
Comment 3•9 years ago
|
||
Re comment #2: The slowness for the TypedObject sequential map could largely (though not entirely) be dealt with by specializing the inner loop for the simple-typed case. The corresponding fixes for the parallel case had, as observed earlier, little effect. Nor have some hacks to avoid initializing the variables for the complex case, if we're going to take the simple branch, had much effect. The performance loss is elsewhere.
Reporter | ||
Comment 4•9 years ago
|
||
This is a quad-core Intel-based machine with 2-way hyperthreading. Setting --thread-count=3 at the shell reduces the running time of the benchmark by half, which suggest a problem having to do with worker coordination. The default thread count is supposed to be 7, though --thread-count=7 yields a faster run than whatever truly is the default. 3 seems like the sweet spot, too.
Reporter | ||
Comment 5•9 years ago
|
||
I tried hacking on a few aspects of the work scheduling algorithm - the slice size on the self-hosted side, and the work-stealing in the thread pool (stealing half the available work from the victim instead of just one slice) - but none of this moved the needle in any meaningful way.
Reporter | ||
Comment 6•9 years ago
|
||
The other half of CAS overhead is that there's a CAS for every call to getSlice() from the worker. For a balanced algorithm this overhead is mostly wasted; for an algorithm with a relatively fine grain the overhead can be substantial. Some research I did last winter informally suggested costs for *uncontended* CAS in the 200-500 clocks range. It's not hard to imagine realistic mapped functions whose cost falls below that, and certainly that fall below 10x that. It seems to me that each worker could be a little more aggressive about biting off work for itself in an effort to keep the CAS cost under control. Of course, if that leaves workers idle because it reduces the efficacy of work stealing then that's bad, but it may be a worthwhile experiment anyway.
Reporter | ||
Comment 7•9 years ago
|
||
Re comment #6, an experiment that grabs slices aggressively and caches them locally in the worker to reduce CAS cost shows that the needle moved perhaps a tiny bit, but not more than that.
Reporter | ||
Comment 8•9 years ago
|
||
"Instruments" suggests that a lot of time is spent waiting for condition variables, which is somewhat credible. The program contains 60,000 parallel sections, all fairly short (640-element arrays). So I tried a little busy-waiting before blocking, but to no avail. (It's probably time to go after this in earnest with a profiler and an optimized build with symbols.)
Comment 9•9 years ago
|
||
I believe that the culprit here is more likely condvars than CAS contention. For your CAS experiment though, did you change just the C++ function? We inline that, so what you want to change is the asm here: http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/CodeGenerator-x86-shared.cpp#1876
Reporter | ||
Comment 10•9 years ago
|
||
(In reply to Shu-yu Guo [:shu] from comment #9) I missed that. Thanks. I also think the condition variables are more likely the problem, given how short the parallel sections are and how many there are of them. I think my next approach (beyond trying to get some numbers with a proper profiler) will be to mangle the benchmark to increase the grain of concurrency to see what that does.
Reporter | ||
Comment 11•9 years ago
|
||
So with a larger work increment size (20-40x what I had before) and a carefully selected number of workers, the parallel implementation is running as fast as the sequential implementation, on a quad-core Intel, dropping from 1500ms to about 900ms. All that really proves - taking into account that there isn't anything interesting going on between parallel sections - is that there is quite a bit of per-operation overhead in the parallel engine. I'm seeing some periodic bailouts due to bitwise operations and other arithmetic.
Reporter | ||
Comment 12•9 years ago
|
||
For now, benchmarks are here: https://github.com/lars-t-hansen/moz-sandbox Look in "convolve", the "larger work increment" program is cat-convolve-mapPar-wide.js.
Reporter | ||
Comment 13•9 years ago
|
||
So far we've found a few things, this goes beyond mapPar really: - On several hyperthreaded machines, on several benchmarks, we've found that setting --thread-count to the number of cores minus one significantly boosts performance over the default, which is the number of threads minus one. On a 4x2 MacBook Pro, setting thread-count=3 doubled the speed of the cat convolver. On a 8x2 Xeon, setting thread-count=7 (I think - Niko did this) improved the speed sixfold of the deform benchmark. (In the latter case it also improved the serial performance.) - There is some slightly hazy evidence that the main thread does a disproportionate amount of the work, suggesting that there's quite a spin-up lag after the condition variable is notified. (The notifyAll will basically require all threads to wake up and then contend for the same lock, thus serializing all the workers at the start of the run.) - The pendingSlices_ counter of the ThreadPool is hotly contented with atomic ops, everyone decrements this when grabbing a slice. The performance impact is unclear. - The sliceBounds_ variables in the ThreadPoolWorkers are hit heavily with atomic ops, and they may become contended when there's significant work stealing going on. The performance impact is unclear.
Reporter | ||
Comment 14•9 years ago
|
||
This patch represents a simple experiment: in-place 'mapTo' and 'mapParTo' methods that allow us to factor out allocation and GC overhead. Mapping across a single row at a time (640 elements), this reduces both sequential and parallel running times by 250-300ms. For the sequential code that's a big win (over 30%) but for the parallel code it is not (10-15%). The parallel code is now barely better than 1/3 the performance of the equivalent sequential code. The interesting result is that factoring out allocation/GC does not create a dramatic difference for the parallel code. Time is lost elsewhere. (4x2 MacBook Pro, --thread-count=4, benchmarks are cat-convolve-map{,Par}To.js at github/lars-t-hansen/moz-sandbox/convolve)
Reporter | ||
Comment 15•9 years ago
|
||
Re comment #14, the busy-wait work in #985887 has visible impact here since the parallel sections are short and there is a lot of start/stop overhead. Apple's Instruments shows about 40% of the time on worker threads being spent waiting for condition variables, without busy-waiting; this could be misleading because the entire run was probably taken into account (though it was a long run). (Instruments also suggests a lot of time is spent in _pthread_mutex_check_init; this could be misleading or it could be a symptom of creating locks at a high rate or creating locks that have error checking enabled.)
Reporter | ||
Comment 16•9 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #15) > (Instruments also suggests a lot of time is spent in > _pthread_mutex_check_init; this could be misleading or it could be a symptom > of creating locks at a high rate or creating locks that have error checking > enabled.) Locks are created with DEFAULT attributes, which is probably OK. (PosixNSPR.cpp) One lock and one condition variable is created per Monitor instance, and there's one Monitor per parallel job. Lock and condvar creation and destruction don't seem very expensive (looking at Apple's source, at least).
Reporter | ||
Comment 17•9 years ago
|
||
See bug 983864 for another benchmark, where the problem is that mapPar is not implemented for Sized TypedObject arrays.
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•