Closed Bug 1138876 Opened 10 years ago Closed 24 days ago

Backtracking allocator: 30% performance degradation on parallel Mandelbrot benchmark

Categories

(Core :: JavaScript Engine: JIT, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: lth, Unassigned)

Details

Attachments

(3 files)

This needs further investigation but it appears that the frame rate for the sab+atomics mandelbrot demos is about 30% lower on current m-i tip than it was about two weeks ago, on two different systems (AMD FX4100 quad on Ubuntu 14.10 and Core i7 4x2 on Mac OS X 10.10). Of course this could be anything. On the sab+atomics side the main thing that has happened is that the futex code was rewritten to use JS interrupts, not special hooks in the DOM.
Confirmed with this setup: - MBP "late 2013" 4x2 2.6GHz Core i7, 16GB RAM - parlib-simple/demo/mandelbrot-animation2 (github.com/lars-t-hansen) - parlib-simple on the "issue26" branch, which adds some assertions and a bug fix - e10s disabled - developer console not shown - build with --enable-debug-symbols only mozilla-inbound from 3 March: 4 workers 27.3fps 6 workers 34.2fps 8 workers 36.1fps mozilla-inbound from 20 February (d9a929677d0a) with patch applied for blocking-on-workers-only: 4 workers 27.2fps 6 workers 37.4fps 8 workers 45.6fps That changeset is the one immediately before the current futex code was landed. The patch I applied to that changeset will be attached to this bug.
I will attach two more patches, which respectively remove the new futex implementation from current m-i and apply the old futex implementation. With the old futex implementation on the new code, the slowdown is still there: mozilla-inbound from 4 March, with old futex implementation: 4 workers 27.1fps 6 workers 34.1fps 8 workers 36.0fps Ergo the futex implementation is not to blame, something else has caused this slowdown. * It could be a code generation issue: the slowdown only appears at higher utilization levels of this 4x2 system. * It could be some sort of throttling / nicing of worker threads when there are more workers than cores (for example). * It could be something to do with graphics, since this program copies a lot of data into a byte array for display. (I will attempt to bisect.)
Initial regression window (m-c to m-i merge points): Known bad: 230668:900075e013be Feb 25 Known good: 230479:be9b4a3b01ab Feb 24
Bisection implicates the backtracking register allocator: changeset: 230540:acc238be19a5 user: Brian Hackett <bhackett1024@gmail.com> date: Tue Feb 24 15:59:37 2015 -0600 summary: Bug 826741 - Use the backtracking register allocator by default, r=jandem.
Re-enabling the LSRA on tip brings the performance back up to 45fps with 8 workers, so this is definitely some effect of the backtracking allocator. Whether it's spilling or some awkward pressure on functional units or something else I don't know (yet). One factoid about the kernel of this demo is that translating it to asm.js did not improve its performance, Ion already did a stellar job generating code for it, it's monotyped and does no allocation. So it could just be very sensitive to minor perturbations.
Component: JavaScript Engine → JavaScript Engine: JIT
Summary: Apparent 30% performance degradation in parallel performance relative to ca mid-February → Backtracking allocator: 30% performance degradation on parallel Mandelbrot benchmark
On my 4-core AMD system, enabling or disabling the backtracking allocator has no effect on performance, just like the backtracking allocator had no negative effects on performance when running with only 4 workers on the 4-core (hyperthreaded) i7.
(I'll try to port the program to the JS shell, with luck it'll repro.)
Another observation: the simpler mandelbrot-animation program is not similarly sensitive to the register allocator (but it runs at the speed of the slow case for mandelbrot-animation2, 35fps). The difference between the two programs is chiefly that the simpler program computes strips of the output (one strip per worker in a deterministic assignment) before displaying it, while the more complicated program computes subgrids of the output (quite a lot of subgrids, 4 times the number of workers along each dimension, chosen from a queue as work is completed) and also overlaps the computation of the next image with the display of the previous image. The two programs probably have very different memory access and ownership patterns. (Not able to repro the phenomenon in the shell, thus far.)
FWIW, still running at the slower rate on current mozilla-inbound.
I just ran across this again in a similar program and found an interesting correlation. I think it may have to do with the number of parameters to the function leading to some poor register allocation choices. I have two variables, centerY and centerX, which were initially global. Their values are constant. They are used in the function but are not hot (only used before the nested loops). When I add these variables to the parameter list of the function and pass them in the call, performance drops by 30%. In this case, the number of parameters increased from four to six. AMD64, Linux, FF 46.0a2. (Have not had the chance to dig into this further.)
I should mention that in the latter case, there's no shared memory or atomics - it's standard JS.
Needs benchmark update and retest.
Priority: -- → P5
The working benchmark code is https://github.com/lars-t-hansen/parlib-simple, in demo/mandelbrot-animation2/mandelbrot.html. Pass the number of workers as a URL parameter, ?workers=n. The default is 4. I had hoped the fix to bug 1205073 might have fixed this too, but it has not. If anything, this program is slower (locally built Nightly, JSGC_DISABLE_POISONING=1) than before, I get 34fps with 8 workers.
Assignee: lhansen → nobody
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 24 days ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: