Closed
Bug 505849
Opened 16 years ago
Closed 12 years ago
JS multithreading slowdown
Categories
(Core :: JavaScript Engine, defect, P2)
Tracking
()
RESOLVED
WORKSFORME
mozilla1.9.2
People
(Reporter: shengnan.cong, Unassigned)
Details
Attachments
(2 files)
1.96 KB,
patch
|
Details | Diff | Splinter Review | |
2.00 KB,
patch
|
Details | Diff | Splinter Review |
The JS engine has a ~10% slowdown when a new thread is introduced even if the thread does nothing. (patch attached to the bug).
I created a dummy thread at the main function of js.cpp. The dummy thread does nothing, just created by pthread_create at the beginning of main and ended by pthread_join at the end of main. However, there is a >10% slowdown introduced by just adding such a dummy thread. The timing is for the elapse between js_InitJIT and js_FinishJIT
The detailed results on sunspider running with tracer enabled and disabled are listed below. There is more slowdowns running with –j. And the slowdown numbers with –j are consistent with what I have observed for parallel JIT on Mac (bug 488202). And there is no such slowdown observed on windows.
It seems that the slowdown is related to the PR_Lock/PR_Unlock in nspr which call pthread_mutex_lock/unlock. Is there anyone aware of slowdowns of PR_Lock/PR_Unlock on Mac?
with -j without -j
t/3d-cube.js -8.6% -1.8%
t/3d-morph.js -6.1% -2.1%
t/3d-raytrace.js -6.9% -1.2%
t/access-binary-trees.js -8.3% -3.0%
t/access-fannkuch.js -2.0% 0.0%
t/access-nbody.js -13.6% 2.5%
t/access-nsieve.js -8.8% -1.7%
t/bitops-3bit-bits-in-byte.js -2.2% -4.3%
t/bitops-bits-in-byte.js -5.9% 1.6%
t/bitops-bitwise-and.js -21.7% 1.0%
t/bitops-nsieve-bits.js -1.6% -0.1%
t/controlflow-recursive.js -1.0% -2.2%
t/crypto-aes.js -9.9% -6.9%
t/crypto-md5.js -12.4% -0.2%
t/crypto-sha1.js -14.0% -0.7%
t/date-format-tofte.js -6.2% -3.2%
t/date-format-xparb.js -6.3% -4.7%
t/math-cordic.js -6.1% -3.1%
t/math-partial-sums.js -1.9% -0.4%
t/math-spectral-norm.js -0.7% -2.0%
t/regexp-dna.js -18.5% -3.7%
t/string-base64.js -26.6% -9.6%
t/string-fasta.js -4.8% -5.7%
t/string-tagcloud.js -17.6% -11.5%
t/string-unpack-code.js -16.1% -11.5%
t/string-validate-input.js -15.5% -6.2%
average -9.4% -3.1%
Reporter | ||
Comment 1•16 years ago
|
||
Comment 2•16 years ago
|
||
Hi Shengnan,
yeah, mac's locking implemention is beyond horrible. But even if you remove those locks, the MT build is still significantly slower than the single-threaded one. We are very interested in more performance analysis why that's the case. I have been doing some work on trying to reduce it, so make sure you use the latest TraceMonkey version if you want to dig into this. Also, the slowdown is very visible on Windows as well, in case your tool support is better there.
Priority: -- → P2
Target Milestone: --- → mozilla1.9.2
Reporter | ||
Comment 3•16 years ago
|
||
Hi, Andreas,
I did the experiment above with a fresh downloaded copy of Mozilla-central yesterday. I didn't observe noticable performance problem on my windows machine. Here are the results for sunspider on windows with tracer enabled.
with -j
t/3d-cube.js -2.6%
t/3d-morph.js -0.1%
t/3d-raytrace.js -3.2%
t/access-binary-trees.js -1.6%
t/access-fannkuch.js -3.1%
t/access-nbody.js -0.5%
t/access-nsieve.js 0.5%
t/bitops-3bit-bits-in-byte.js 3.8%
t/bitops-bits-in-byte.js 0.4%
t/bitops-bitwise-and.js 0.4%
t/bitops-nsieve-bits.js -3.7%
t/controlflow-recursive.js 6.6%
t/crypto-aes.js -2.4%
t/crypto-md5.js -2.1%
t/crypto-sha1.js 0.1%
t/date-format-tofte.js 2.0%
t/date-format-xparb.js -0.6%
t/math-cordic.js -2.1%
t/math-partial-sums.js -3.6%
t/math-spectral-norm.js 1.3%
t/regexp-dna.js 0.1%
t/string-base64.js -0.5%
t/string-fasta.js -0.8%
t/string-tagcloud.js 3.2%
t/string-unpack-code.js 2.7%
t/string-validate-input.js 0.5%
average -0.2%
Reporter | ||
Comment 4•16 years ago
|
||
I didn't see performance slowdown on windows. Please let me know if different results are observed.
Comment 5•16 years ago
|
||
Could you also try a in-browser test run? The largest difference happen between ST shell and browser.
Reporter | ||
Comment 6•16 years ago
|
||
There is no visible performance difference on Mac with in-browswer test run of sunspider. Here are the results.
TEST COMPARISON FROM TO DETAILS
=============================================================================
** TOTAL **: - 914.8ms +/- 0.4% 913.2ms +/- 1.6%
=============================================================================
3d: - 129.8ms +/- 1.8% 129.6ms +/- 2.5%
cube: - 40.2ms +/- 1.4% 40.2ms +/- 4.0%
morph: - 26.4ms +/- 2.6% 26.0ms +/- 0.0%
raytrace: ?? 63.2ms +/- 3.2% 63.4ms +/- 3.3% not conclusive: might be *1.00x as slow*
access: ?? 124.8ms +/- 2.6% 127.0ms +/- 3.7% not conclusive: might be *1.02x as slow*
binary-trees: ?? 38.8ms +/- 2.7% 40.0ms +/- 7.3% not conclusive: might be *1.03x as slow*
fannkuch: - 49.0ms +/- 7.6% 48.8ms +/- 5.5%
nbody: ?? 24.8ms +/- 7.4% 26.6ms +/- 6.3% not conclusive: might be *1.07x as slow*
nsieve: - 12.2ms +/- 4.6% 11.6ms +/- 5.9%
bitops: 1.04x as fast 33.6ms +/- 3.3% 32.4ms +/- 2.1% significant
3bit-bits-in-byte: - 1.2ms +/- 46.3% 1.2ms +/- 46.3%
bits-in-byte: - 7.6ms +/- 9.0% 7.2ms +/- 7.7%
bitwise-and: - 2.0ms +/- 0.0% 2.0ms +/- 0.0%
nsieve-bits: - 22.8ms +/- 7.1% 22.0ms +/- 0.0%
controlflow: *1.01x as slow* 30.0ms +/- 0.0% 30.4ms +/- 3.7% significant
recursive: *1.01x as slow* 30.0ms +/- 0.0% 30.4ms +/- 3.7% significant
crypto: - 50.4ms +/- 7.9% 49.4ms +/- 9.7%
aes: - 29.2ms +/- 11.0% 28.4ms +/- 5.0%
md5: - 13.8ms +/- 36.3% 13.6ms +/- 32.7%
sha1: - 7.4ms +/- 9.2% 7.4ms +/- 9.2%
date: - 142.6ms +/- 3.3% 138.6ms +/- 0.5%
format-tofte: - 74.6ms +/- 3.0% 72.6ms +/- 0.9%
format-xparb: - 68.0ms +/- 3.7% 66.0ms +/- 0.0%
math: - 28.4ms +/- 2.4% 28.4ms +/- 3.9%
cordic: - 9.2ms +/- 6.0% 9.0ms +/- 0.0%
partial-sums: 1.03x as fast 13.0ms +/- 0.0% 12.6ms +/- 5.4% significant
spectral-norm: ?? 6.2ms +/- 9.0% 6.8ms +/- 8.2% not conclusive: might be *1.10x as slow*
regexp: - 59.0ms +/- 7.5% 59.0ms +/- 2.6%
dna: - 59.0ms +/- 7.5% 59.0ms +/- 2.6%
string: ?? 316.2ms +/- 0.5% 318.4ms +/- 3.2% not conclusive: might be *1.01x as slow*
base64: - 19.0ms +/- 6.5% 18.6ms +/- 3.7%
fasta: - 73.6ms +/- 1.5% 73.2ms +/- 1.4%
tagcloud: ?? 90.8ms +/- 1.1% 92.0ms +/- 6.1% not conclusive: might be *1.01x as slow*
unpack-code: *1.01x as slow* 100.0ms +/- 0.0% 100.6ms +/- 2.4% significant
validate-input: *1.04x as slow* 32.8ms +/- 1.7% 34.0ms +/- 8.2% significant
Comment 7•16 years ago
|
||
Let me run that here locally. I will post my results.
Comment 8•16 years ago
|
||
This is a very important bug. I'd suggest picking a benchmark that gets a big slowdown and has fairly simple code (string-validate-input?) and understanding that in detail.
Reporter | ||
Comment 9•16 years ago
|
||
I have checked Apple’s pthread source code at: http://www.opensource.apple.com/source/Libc/Libc-498/pthreads/
It seems that the LOCK is implemented in such a way that when the application is threaded (controlled by global variable __is_threaded), it will do spin_lock first before yield if lock is not available (pthread_spinlock.h, pthread.c). This could be the reason for the slowdowns since the SHARK profile of string-validate-input shows that __spin_lock takes more than half of the extra samples. I will do more experiments and try some fixes to further verify it.
Reporter | ||
Comment 10•16 years ago
|
||
I have verified that the (spin)lock implementation in the Mac kernel is the root cause of slowdown reported above. For string-validate-input, the 15% slowdown I observed can be broken down into two parts (based on shark profiles).
1) pthread_mutex_(un)lock overhead increase (contributes 8.5% of the 15%slowdown). This overhead might be reduced by eliminating unnecessary locks.
2) malloc, free, size, and other system calls overhead (contributesto the other 6.5% of the 15% slowdown). This overhead might be reduced if we could reduce or offload those calls.
Overall, the overhead seems to solely come from the (spin)lock implementation of the Mac kernel. A simple hack of resetting "__is_threaded" to 0 removes all the overhead of multithreading.
In the lastest TraceMonkey, multithreading seems to be used to offload deallocation memory in a separate thread (bug 505612). A 2.5% slowdown on Sunspider has been observed on my Mac with the new build. Seems that the overhead of multithreading offsets the performance benefit from parallel GC. Also, as you can expect, adding a dummy thread does not introduce any overhead anymore in the new build (good news for parallel jit :-)
![]() |
||
Comment 11•12 years ago
|
||
JSRuntime is now single-threaded(-ish).
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•