Closed
Bug 1138876
Opened 10 years ago
Closed 24 days ago
Backtracking allocator: 30% performance degradation on parallel Mandelbrot benchmark
Categories
(Core :: JavaScript Engine: JIT, defect, P5)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: lth, Unassigned)
Details
Attachments
(3 files)
33.09 KB,
patch
|
Details | Diff | Splinter Review | |
32.53 KB,
patch
|
Details | Diff | Splinter Review | |
33.40 KB,
patch
|
Details | Diff | Splinter Review |
This needs further investigation but it appears that the frame rate for the sab+atomics mandelbrot demos is about 30% lower on current m-i tip than it was about two weeks ago, on two different systems (AMD FX4100 quad on Ubuntu 14.10 and Core i7 4x2 on Mac OS X 10.10).
Of course this could be anything. On the sab+atomics side the main thing that has happened is that the futex code was rewritten to use JS interrupts, not special hooks in the DOM.
Reporter | ||
Comment 1•10 years ago
|
||
Confirmed with this setup:
- MBP "late 2013" 4x2 2.6GHz Core i7, 16GB RAM
- parlib-simple/demo/mandelbrot-animation2 (github.com/lars-t-hansen)
- parlib-simple on the "issue26" branch, which adds some assertions and a bug fix
- e10s disabled
- developer console not shown
- build with --enable-debug-symbols only
mozilla-inbound from 3 March:
4 workers 27.3fps
6 workers 34.2fps
8 workers 36.1fps
mozilla-inbound from 20 February (d9a929677d0a) with patch applied for blocking-on-workers-only:
4 workers 27.2fps
6 workers 37.4fps
8 workers 45.6fps
That changeset is the one immediately before the current futex code was landed. The patch I applied to that changeset will be attached to this bug.
Reporter | ||
Comment 2•10 years ago
|
||
Reporter | ||
Comment 3•10 years ago
|
||
I will attach two more patches, which respectively remove the new futex implementation from current m-i and apply the old futex implementation.
With the old futex implementation on the new code, the slowdown is still there:
mozilla-inbound from 4 March, with old futex implementation:
4 workers 27.1fps
6 workers 34.1fps
8 workers 36.0fps
Ergo the futex implementation is not to blame, something else has caused this slowdown.
* It could be a code generation issue: the slowdown only appears at higher utilization
levels of this 4x2 system.
* It could be some sort of throttling / nicing of worker threads when there are more workers
than cores (for example).
* It could be something to do with graphics, since this program copies a lot of data into
a byte array for display.
(I will attempt to bisect.)
Reporter | ||
Comment 4•10 years ago
|
||
Reporter | ||
Comment 5•10 years ago
|
||
Reporter | ||
Comment 6•10 years ago
|
||
Initial regression window (m-c to m-i merge points):
Known bad: 230668:900075e013be Feb 25
Known good: 230479:be9b4a3b01ab Feb 24
Reporter | ||
Comment 7•10 years ago
|
||
Bisection implicates the backtracking register allocator:
changeset: 230540:acc238be19a5
user: Brian Hackett <bhackett1024@gmail.com>
date: Tue Feb 24 15:59:37 2015 -0600
summary: Bug 826741 - Use the backtracking register allocator by default, r=jandem.
Reporter | ||
Comment 8•10 years ago
|
||
Re-enabling the LSRA on tip brings the performance back up to 45fps with 8 workers, so this is definitely some effect of the backtracking allocator.
Whether it's spilling or some awkward pressure on functional units or something else I don't know (yet).
One factoid about the kernel of this demo is that translating it to asm.js did not improve its performance, Ion already did a stellar job generating code for it, it's monotyped and does no allocation. So it could just be very sensitive to minor perturbations.
Component: JavaScript Engine → JavaScript Engine: JIT
Summary: Apparent 30% performance degradation in parallel performance relative to ca mid-February → Backtracking allocator: 30% performance degradation on parallel Mandelbrot benchmark
Reporter | ||
Comment 9•10 years ago
|
||
On my 4-core AMD system, enabling or disabling the backtracking allocator has no effect on performance, just like the backtracking allocator had no negative effects on performance when running with only 4 workers on the 4-core (hyperthreaded) i7.
Reporter | ||
Comment 10•10 years ago
|
||
(I'll try to port the program to the JS shell, with luck it'll repro.)
Reporter | ||
Comment 11•10 years ago
|
||
Another observation: the simpler mandelbrot-animation program is not similarly sensitive to the register allocator (but it runs at the speed of the slow case for mandelbrot-animation2, 35fps).
The difference between the two programs is chiefly that the simpler program computes strips of the output (one strip per worker in a deterministic assignment) before displaying it, while the more complicated program computes subgrids of the output (quite a lot of subgrids, 4 times the number of workers along each dimension, chosen from a queue as work is completed) and also overlaps the computation of the next image with the display of the previous image.
The two programs probably have very different memory access and ownership patterns.
(Not able to repro the phenomenon in the shell, thus far.)
Reporter | ||
Comment 12•10 years ago
|
||
FWIW, still running at the slower rate on current mozilla-inbound.
Reporter | ||
Comment 13•9 years ago
|
||
I just ran across this again in a similar program and found an interesting correlation. I think it may have to do with the number of parameters to the function leading to some poor register allocation choices.
I have two variables, centerY and centerX, which were initially global. Their values are constant. They are used in the function but are not hot (only used before the nested loops). When I add these variables to the parameter list of the function and pass them in the call, performance drops by 30%.
In this case, the number of parameters increased from four to six. AMD64, Linux, FF 46.0a2.
(Have not had the chance to dig into this further.)
Reporter | ||
Comment 14•9 years ago
|
||
I should mention that in the latter case, there's no shared memory or atomics - it's standard JS.
Reporter | ||
Comment 16•8 years ago
|
||
The working benchmark code is https://github.com/lars-t-hansen/parlib-simple, in demo/mandelbrot-animation2/mandelbrot.html. Pass the number of workers as a URL parameter, ?workers=n. The default is 4.
I had hoped the fix to bug 1205073 might have fixed this too, but it has not. If anything, this program is slower (locally built Nightly, JSGC_DISABLE_POISONING=1) than before, I get 34fps with 8 workers.
Assignee: lhansen → nobody
Updated•2 years ago
|
Severity: normal → S3
Updated•24 days ago
|
Status: NEW → RESOLVED
Closed: 24 days ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•