Parallel JITing for TraceMonkey




10 years ago
7 years ago


(Reporter: shengnan.cong, Unassigned)



Firefox Tracking Flags

(Not tracked)



(3 attachments, 5 obsolete attachments)



10 years ago
I have been working on off-loading the JITing to another thread so that the interpreter does not need to pause for compilation. The current status is as follows:

On Windows, there is a reasonably good gains. On Mac, there is much slowdowns and I am still working on it.

Results from an Intel Core-2 Duo CPU T7300 @ 2.00GHz, 1.96 GB RAM running
Windows XP:

Sunspider Speedup 
sunspider/3d-raytrace.js              6.70% 
sunspider/3d-cube.js                  6.40% 
sunspider/access-nbody.js             4.70% 
sunspider/crypto-aes.js               4.00% 
sunspider/access-fannkuch.js          3.80% 
sunspider/crypto-sha1.js              2.20% 
sunspider/string-unpack-code.js       1.90% 
sunspider/access-binary-trees.js      1.70% 
sunspider/date-format-xparb.js        1.70% 
sunspider/3d-morph.js                 1.50% 
sunspider/date-format-tofte.js        1.50% 
sunspider/string-fasta.js             1.40% 
sunspider/math-spectral-norm.js       1.30% 
sunspider/bitops-nsieve-bits.js       0.70% 
sunspider/access-nsieve.js            0.50% 
sunspider/math-partial-sums.js        0.50% 
sunspider/controlflow-recursive.js    0.20% 
sunspider/string-validate-input.js    0.20% 
sunspider/string-tagcloud.js          0.00% 
sunspider/bitops-3bit-bits-in-byte.js -0.10% 
sunspider/crypto-md5.js               -0.20% 
sunspider/math-cordic.js              -0.20% 
sunspider/bitops-bits-in-byte.js      -0.50% 
sunspider/string-base64.js            -2.00% 
sunspider/bitops-bitwise-and.js       -3.80% 
sunspider/regexp-dna.js               -12.60% 

Results from Core 2 Duo 2.0GHz (2cores) + Mac OS X Leopard:
t/3d-raytrace.js                         13.00% 
t/access-nbody.js                        4.36% 
t/access-fannkuch.js                      3.60% 
t/crypto-aes.js                           1.35% 
t/controlflow-recursive.js                0.98% 
t/math-partial-sums.js                    0.79% 
t/3d-cube.js                              0.24% 
t/bitops-nsieve-bits.js                    -0.40% 
t/math-spectral-norm.js                    -1.06% 
t/3d-morph.js                              -1.10% 
t/crypto-md5.js                            -1.29% 
t/access-nsieve.js                         -1.32% 
t/bitops-bits-in-byte.js                   -1.46% 
t/crypto-sha1.js                           -1.63% 
t/math-cordic.js                           -2.62% 
t/date-format-tofte.js                     -3.30% 
t/access-binary-trees.js                   -4.03% 
t/date-format-xparb.js                     -4.63% 
t/string-fasta.js                          -6.16% 
t/bitops-3bit-bits-in-byte.js              -6.42% 
t/bitops-bitwise-and.js                    -7.42% 
t/string-validate-input.js                 -10.17% 
t/string-unpack-code.js                    -10.93% 
t/string-tagcloud.js                       -14.64% 
t/string-base64.js                         -20.60% 
t/regexp-dna.js                            -70.99% 

Notes: 1. The code is based on a snapshot of code of Mozilla-Central on Mar18. 
       2. The timing is obtained by running the benchmarks with js shell.
       3. There is occasionally crash on raytrace. (one out of 30 runs). 

Patch to come next.

Comment 1

10 years ago
Created attachment 372494 [details] [diff] [review]

-Parallel JITting enabled when defined PARALLEL_COMPILER in avmplus.h. -
-MEASURE_PAUSE is defined for timing. 
-On Mac, need to define DARWIN in avmplus.h and CompilerThread.h

Comment 2

10 years ago
Created attachment 372495 [details]

Additional file: CompilerThread.h to be placed in js/src/nanojit

Comment 3

10 years ago
Created attachment 372496 [details]

Additional file: CompilerThread.cpp to be placed in js/src/nanojit
First let me summarize the design to see if I understand: 

The basic design is to have a thread-safe worklist of things to compile or patch. Where the old code compiled, the new code adds something to the worklist. A compiler worker thread compiles code as it enters the worklist. The worklist is implemented with condition variables. 

The main other change is that some things that used to be attached to the lirbuf must be attached to the fragment, because the lirbuf data may be overwritten by the time the compiler thread gets to it. (By the way, I think this is a good development and maybe we should store all relevant data in fragment-specific storage instead of relying on the lirbuf.)

What jumps out of the numbers to me is the slowdowns on regexp-dna. Maybe you should set it up so that using parallel recompilation can be controlled independently for the tracer and the regexp compiler. 

One thing to note is that in the current code, regexps are compiled on demand, i.e., just before the first time they are used in a match operation. In a parallel compilation setup, it probably makes more sense to queue them for compilation immediately after they are created.

Comment 5

10 years ago
Created attachment 372730 [details] [diff] [review]
Patch for jsregexp.cpp

Modified jsregexp.cpp to queue the compilation of regexps earlier(before the match operation).

Comment 6

10 years ago
David, Thanks for the comments. I am not sure how to make the parallel compilation independent of the tracer. It seems to me that the type specialization done by the tracer is related to the compilation and could be hard to make them apart.

I agree with you that it would make more sense to queue the regexps earlier for compilation. I modified the code as the patch but it seems has no big change to the performance.

Comment 7

10 years ago
David, are you working on the patch? Please let me know if you need anything
from me. Thanks.
Shengnan, I am not currently working on that patch. For now, I read it and liked it. If there is anything in particular you'd *like* me to help with, let me know.

Comment 9

10 years ago
Created attachment 377772 [details] [diff] [review]
Fix for regexp-dna slowdown

I have found the reason for the regexp slowdowns. Basically, the interpreter for regexp is very slow. Interpreting even one iteration is much slower than waiting for the native ready and running the Jitted code. Although I put the regexps in compilation queue right after they are created, it may be still not early enough and may trigger the slow interpreter to go. 

So with the patch, I let the interpreter wait for the compilation if the native is not ready to avoid interpreting it. The performance of regexp has improved with the patch as below:
on Windows: from -12.60% to -0.30%
on Mac:     from -70.99% to -16.7%  

I am still working on optimizations. I will be on vacation next week and will resume the work the week after.

Attachment #372730 - Attachment is obsolete: true
Attachment #372495 - Attachment is patch: false
Attachment #372496 - Attachment is patch: false

Comment 10

9 years ago
Created attachment 398276 [details] [diff] [review]
New patch

I just merged my changes for parallel JITing with the latest TraceMonkey. Now, on both Mac and Windows, we get reasonably good speedups from using the parallelism between the JIT and the interpreter. Sunspider numbers follow. The speedups show the gain of the parallelized TM over the existing sequential version of TM on Core-2 Duo systems. On Mac, we have speedups in the range of [-2% to 18%], while on Windows we have speedups in the range [-3% to 15%]. I am wondering whether there is any larger workload that I can test with.

Sunspider Test	      Mac	Windows	   
t/3d-raytrace.js	17.7%	14.9%	   
t/crypto-sha1.js	11.5%	10.3%	   
t/date-format-xparb.js	6.6%	4.2%	   
t/access-nbody.js	6.2%	8.6%	   
t/access-fannkuch.js	4.3%	2.0%	   
t/math-spectral-norm.js	3.7%	2.4%	   
t/bitops-nsieve-bits.js	3.3%	-0.5%	   
t/bitops-bitwise-and.js	3.2%	-2.6%	   
t/string-unpack-code.js	2.4%	-0.1%	   
t/crypto-aes.js	2.4%	3.4%	   
t/bitops-bits-in-byte.js	2.2%	-0.6%	   
t/crypto-md5.js	2.2%	3.5%	   
t/date-format-tofte.js	1.9%	-0.6%	   
t/3d-cube.js	1.4%	3.0%	   
t/string-validate-input.js	1.4%	3.4%	   
t/string-tagcloud.js	0.9%	3.2%	   
t/math-cordic.js	0.9%	0.6%	   
t/string-base64.js	0.3%	0.9%	   
t/regexp-dna.js	0.1%	-0.3%	   
t/math-partial-sums.js	0.0%	0.1%	   
t/3d-morph.js	-0.2%	0.3%	   
t/controlflow-recursive.js	-0.3%	-0.9%	   
t/string-fasta.js	-0.3%	-0.8%	   
t/access-nsieve.js	-0.4%	1.1%	   
t/access-binary-trees.js	-1.7%	1.5%	   
t/bitops-3bit-bits-in-byte.js	-2.1%	-0.5%
Attachment #372494 - Attachment is obsolete: true
Attachment #377772 - Attachment is obsolete: true

Comment 11

9 years ago
Created attachment 398277 [details] [diff] [review]
CompilerThread.h (to be added in /js/nanojit)
Attachment #372495 - Attachment is obsolete: true

Comment 12

9 years ago
Created attachment 398278 [details] [diff] [review]
CompilerThread.cpp (to be added in /js/nanojit)
Attachment #372496 - Attachment is obsolete: true
Coool. For now I think besides SS we mainly have the v8 benchmarks (in our tree at js/src/v8), Dromaeo (, and Peacekeeper (

From a combination of looking at your data, talking to Brendan, and making stuff up ;-) I wonder if it is better to parallelize longer traces than shorter ones. It would be interesting to create a version that compiles in parallel only if the length of the trace is greater than K (in who knows what units--LIR instructions?) and tune that K.

Comment 14

9 years ago
Thanks for the pointers. I will try them and post the results.

Good suggestion. Since the interpreter checks whether the compiled code is ready when reaching back edges, it is possible that the compilation for a short trace finishes before the interpreter hits the back edge again. Parallel JIT does not show benifit in such cases. Actually in the new patch, I disabled the parallel JIT of regexp for the same reason. I will create a version as you suggested and update.
Obsolete with the removal of tracejit.
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.