505849 - JS multithreading slowdown

Reporter

Description

•

16 years ago

The JS engine has a ~10% slowdown when a new thread is introduced even if the thread does nothing. (patch attached to the bug). I created a dummy thread at the main function of js.cpp. The dummy thread does nothing, just created by pthread_create at the beginning of main and ended by pthread_join at the end of main. However, there is a >10% slowdown introduced by just adding such a dummy thread. The timing is for the elapse between js_InitJIT and js_FinishJIT The detailed results on sunspider running with tracer enabled and disabled are listed below. There is more slowdowns running with –j. And the slowdown numbers with –j are consistent with what I have observed for parallel JIT on Mac (bug 488202). And there is no such slowdown observed on windows. It seems that the slowdown is related to the PR_Lock/PR_Unlock in nspr which call pthread_mutex_lock/unlock. Is there anyone aware of slowdowns of PR_Lock/PR_Unlock on Mac? with -j without -j t/3d-cube.js -8.6% -1.8% t/3d-morph.js -6.1% -2.1% t/3d-raytrace.js -6.9% -1.2% t/access-binary-trees.js -8.3% -3.0% t/access-fannkuch.js -2.0% 0.0% t/access-nbody.js -13.6% 2.5% t/access-nsieve.js -8.8% -1.7% t/bitops-3bit-bits-in-byte.js -2.2% -4.3% t/bitops-bits-in-byte.js -5.9% 1.6% t/bitops-bitwise-and.js -21.7% 1.0% t/bitops-nsieve-bits.js -1.6% -0.1% t/controlflow-recursive.js -1.0% -2.2% t/crypto-aes.js -9.9% -6.9% t/crypto-md5.js -12.4% -0.2% t/crypto-sha1.js -14.0% -0.7% t/date-format-tofte.js -6.2% -3.2% t/date-format-xparb.js -6.3% -4.7% t/math-cordic.js -6.1% -3.1% t/math-partial-sums.js -1.9% -0.4% t/math-spectral-norm.js -0.7% -2.0% t/regexp-dna.js -18.5% -3.7% t/string-base64.js -26.6% -9.6% t/string-fasta.js -4.8% -5.7% t/string-tagcloud.js -17.6% -11.5% t/string-unpack-code.js -16.1% -11.5% t/string-validate-input.js -15.5% -6.2% average -9.4% -3.1%

Shengnan Cong

Reporter

Comment 1

•

16 years ago

Attached patch Patch to reproduce the slowdowns — Details — Splinter Review

Andreas Gal :gal

Comment 2

•

16 years ago

Hi Shengnan, yeah, mac's locking implemention is beyond horrible. But even if you remove those locks, the MT build is still significantly slower than the single-threaded one. We are very interested in more performance analysis why that's the case. I have been doing some work on trying to reduce it, so make sure you use the latest TraceMonkey version if you want to dig into this. Also, the slowdown is very visible on Windows as well, in case your tool support is better there.

Priority: -- → P2

Target Milestone: --- → mozilla1.9.2

Shengnan Cong

Reporter

Comment 3

•

16 years ago

Hi, Andreas, I did the experiment above with a fresh downloaded copy of Mozilla-central yesterday. I didn't observe noticable performance problem on my windows machine. Here are the results for sunspider on windows with tracer enabled. with -j t/3d-cube.js -2.6% t/3d-morph.js -0.1% t/3d-raytrace.js -3.2% t/access-binary-trees.js -1.6% t/access-fannkuch.js -3.1% t/access-nbody.js -0.5% t/access-nsieve.js 0.5% t/bitops-3bit-bits-in-byte.js 3.8% t/bitops-bits-in-byte.js 0.4% t/bitops-bitwise-and.js 0.4% t/bitops-nsieve-bits.js -3.7% t/controlflow-recursive.js 6.6% t/crypto-aes.js -2.4% t/crypto-md5.js -2.1% t/crypto-sha1.js 0.1% t/date-format-tofte.js 2.0% t/date-format-xparb.js -0.6% t/math-cordic.js -2.1% t/math-partial-sums.js -3.6% t/math-spectral-norm.js 1.3% t/regexp-dna.js 0.1% t/string-base64.js -0.5% t/string-fasta.js -0.8% t/string-tagcloud.js 3.2% t/string-unpack-code.js 2.7% t/string-validate-input.js 0.5% average -0.2%

Shengnan Cong

Reporter

Comment 4

•

16 years ago

Attached patch patch for windows — Details — Splinter Review

I didn't see performance slowdown on windows. Please let me know if different results are observed.

Andreas Gal :gal

Comment 5

•

16 years ago

Could you also try a in-browser test run? The largest difference happen between ST shell and browser.

Shengnan Cong

Reporter

Comment 6

•

16 years ago

There is no visible performance difference on Mac with in-browswer test run of sunspider. Here are the results. TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: - 914.8ms +/- 0.4% 913.2ms +/- 1.6% ============================================================================= 3d: - 129.8ms +/- 1.8% 129.6ms +/- 2.5% cube: - 40.2ms +/- 1.4% 40.2ms +/- 4.0% morph: - 26.4ms +/- 2.6% 26.0ms +/- 0.0% raytrace: ?? 63.2ms +/- 3.2% 63.4ms +/- 3.3% not conclusive: might be *1.00x as slow* access: ?? 124.8ms +/- 2.6% 127.0ms +/- 3.7% not conclusive: might be *1.02x as slow* binary-trees: ?? 38.8ms +/- 2.7% 40.0ms +/- 7.3% not conclusive: might be *1.03x as slow* fannkuch: - 49.0ms +/- 7.6% 48.8ms +/- 5.5% nbody: ?? 24.8ms +/- 7.4% 26.6ms +/- 6.3% not conclusive: might be *1.07x as slow* nsieve: - 12.2ms +/- 4.6% 11.6ms +/- 5.9% bitops: 1.04x as fast 33.6ms +/- 3.3% 32.4ms +/- 2.1% significant 3bit-bits-in-byte: - 1.2ms +/- 46.3% 1.2ms +/- 46.3% bits-in-byte: - 7.6ms +/- 9.0% 7.2ms +/- 7.7% bitwise-and: - 2.0ms +/- 0.0% 2.0ms +/- 0.0% nsieve-bits: - 22.8ms +/- 7.1% 22.0ms +/- 0.0% controlflow: *1.01x as slow* 30.0ms +/- 0.0% 30.4ms +/- 3.7% significant recursive: *1.01x as slow* 30.0ms +/- 0.0% 30.4ms +/- 3.7% significant crypto: - 50.4ms +/- 7.9% 49.4ms +/- 9.7% aes: - 29.2ms +/- 11.0% 28.4ms +/- 5.0% md5: - 13.8ms +/- 36.3% 13.6ms +/- 32.7% sha1: - 7.4ms +/- 9.2% 7.4ms +/- 9.2% date: - 142.6ms +/- 3.3% 138.6ms +/- 0.5% format-tofte: - 74.6ms +/- 3.0% 72.6ms +/- 0.9% format-xparb: - 68.0ms +/- 3.7% 66.0ms +/- 0.0% math: - 28.4ms +/- 2.4% 28.4ms +/- 3.9% cordic: - 9.2ms +/- 6.0% 9.0ms +/- 0.0% partial-sums: 1.03x as fast 13.0ms +/- 0.0% 12.6ms +/- 5.4% significant spectral-norm: ?? 6.2ms +/- 9.0% 6.8ms +/- 8.2% not conclusive: might be *1.10x as slow* regexp: - 59.0ms +/- 7.5% 59.0ms +/- 2.6% dna: - 59.0ms +/- 7.5% 59.0ms +/- 2.6% string: ?? 316.2ms +/- 0.5% 318.4ms +/- 3.2% not conclusive: might be *1.01x as slow* base64: - 19.0ms +/- 6.5% 18.6ms +/- 3.7% fasta: - 73.6ms +/- 1.5% 73.2ms +/- 1.4% tagcloud: ?? 90.8ms +/- 1.1% 92.0ms +/- 6.1% not conclusive: might be *1.01x as slow* unpack-code: *1.01x as slow* 100.0ms +/- 0.0% 100.6ms +/- 2.4% significant validate-input: *1.04x as slow* 32.8ms +/- 1.7% 34.0ms +/- 8.2% significant

Andreas Gal :gal

Comment 7

•

16 years ago

Let me run that here locally. I will post my results.

David Mandelin [:dmandelin]

Comment 8

•

16 years ago

This is a very important bug. I'd suggest picking a benchmark that gets a big slowdown and has fairly simple code (string-validate-input?) and understanding that in detail.

Shengnan Cong

Reporter

Comment 9

•

16 years ago

I have checked Apple’s pthread source code at: http://www.opensource.apple.com/source/Libc/Libc-498/pthreads/ It seems that the LOCK is implemented in such a way that when the application is threaded (controlled by global variable __is_threaded), it will do spin_lock first before yield if lock is not available (pthread_spinlock.h, pthread.c). This could be the reason for the slowdowns since the SHARK profile of string-validate-input shows that __spin_lock takes more than half of the extra samples. I will do more experiments and try some fixes to further verify it.

Shengnan Cong

Reporter

Comment 10

•

16 years ago

I have verified that the (spin)lock implementation in the Mac kernel is the root cause of slowdown reported above. For string-validate-input, the 15% slowdown I observed can be broken down into two parts (based on shark profiles). 1) pthread_mutex_(un)lock overhead increase (contributes 8.5% of the 15%slowdown). This overhead might be reduced by eliminating unnecessary locks. 2) malloc, free, size, and other system calls overhead (contributesto the other 6.5% of the 15% slowdown). This overhead might be reduced if we could reduce or offload those calls. Overall, the overhead seems to solely come from the (spin)lock implementation of the Mac kernel. A simple hack of resetting "__is_threaded" to 0 removes all the overhead of multithreading. In the lastest TraceMonkey, multithreading seems to be used to offload deallocation memory in a separate thread (bug 505612). A 2.5% slowdown on Sunspider has been observed on my Mac with the new build. Seems that the overhead of multithreading offsets the performance benefit from parallel GC. Also, as you can expect, adding a dummy thread does not introduce any overhead anymore in the new build (good news for parallel jit :-)

Luke Wagner [:luke]

Comment 11

•

12 years ago

JSRuntime is now single-threaded(-ish).

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

Patch to reproduce the slowdowns 16 years ago Shengnan Cong 1.96 KB, patch		Details \| Diff \| Splinter Review
patch for windows 16 years ago Shengnan Cong 2.00 KB, patch		Details \| Diff \| Splinter Review

Bugzilla

JS multithreading slowdown

Categories

(Core :: JavaScript Engine, defect, P2)

Tracking

()

People

(Reporter: shengnan.cong, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type