Closed Bug 418436 Opened 16 years ago Closed 12 years ago

JavaScript engine is non-trivially slower with JS_THREADSAFE than without

Categories

(Core :: JavaScript Engine, defect)

Other Branch
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jorendorff, Unassigned)

Details

(Keywords: perf)

Attachments

(3 files)

Spun off from bug 412985 comment 17.
See bug 412985 omment #22 for sayrer's suggestion that the thing to test is js shell with and without JS_THREADSAFE defined.

See bug 54743 comment 94 for test results from when the request-amortized "fly-weight" locking went in that it resulting negligible effect of JS_THREADSAFE on performance, for that one benchmark.

Would like an investigation for 1.9, meaning soon.

/be
Flags: wanted1.9+
Flags: blocking1.9?
Summary: JavaScript engine is slower with JS_THREADSAFE than without → JavaScript engine is non-trivially slower with JS_THREADSAFE than without
Keywords: perf
jorendorff says he'll take a look
Assignee: general → jorendorff
The bug is confirmed at least.  On Mac, anyway.
This screenshot shows a profile created with Shark's "File->Compare..." menu option.  Unfortunately this compares the *proportion* of samples in each function rather than the *total* number of samples.  The latter would be more telling.  I found no way to generate those numbers via Shark.app, though, nor any way to access the raw sample data.  Suggestions?

For example.  js_Interpret shows a difference of -1.2%.  This doesn't mean that js_Interpret runs faster with JS_THREADSAFE--it actually gets slower.  I think the -1.2% number means that js_Interpret itself was at top of stack in 46.9% of all non-threadsafe samples and 44.5% of all threadsafe samples, and 44.5%-46.9%=-1.2% (it's actually -1.4% though... rounding error?).

The comparison shows a grand total of about 1.4% of the total running time of the JS_THREADSAFE test being spent in synchronization-related functions (pthread_mutex_lock, _PR_Darwin_x86_AtomicIncrement, etc.) that are only used in JS_THREADSAFE builds.  It shows a +0.8% change in JS_NewGCThing, which also makes sense.
To get rid of js_NewGCThing hitting rt->gcLock, please try raising

#define MAX_THREAD_LOCAL_THINGS 8

A lot (64?).

Please find the hot pthread_mutex_lock stacks once you've tuned the gcLock out of the profile. If anything's left, we need to see why it isn't thinlocking or -able.

/be
(In reply to comment #5)
> To get rid of js_NewGCThing hitting rt->gcLock, please try raising
> 
> #define MAX_THREAD_LOCAL_THINGS 8
> 
> A lot (64?).

I have tried it this one recently in the scope of bug 400902. With 32 doubles per free lists things were slower. I think this was due to bad interaction with cache prefetch. Now consider JSObject. With MAX_THREAD_LOCAL_THINGS == 8  populating the free list from freshly allocated arena affects 256 object bytes + 8 flag bytes, touching 5 cache lines in total on the latest CPUs. With a value set to 64 that will be 33 cache lines. Most of this lines will be evicted from L1 before the objects will be allocated for real.


> 
> Please find the hot pthread_mutex_lock stacks once you've tuned the gcLock out
> of the profile. If anything's left, we need to see why it isn't thinlocking or
> -able.


> 
> /be
> 

(In reply to comment #6)
> I have tried it this one recently in the scope of bug 400902. With 32 doubles
> per free lists things were slower. I think this was due to bad interaction with
> cache prefetch. Now consider JSObject. With MAX_THREAD_LOCAL_THINGS == 8 
> populating the free list from freshly allocated arena affects 256 object bytes
> + 8 flag bytes, touching 5 cache lines in total on the latest CPUs. With a
> value set to 64 that will be 33 cache lines. Most of this lines will be evicted
> from L1 before the objects will be allocated for real.

Given this I will even suggest to try to decrease MAX_THREAD_LOCAL_THINGS or even disable it to see the difference. Since thread-local lists live in JSThread, which lives in a different page then the rest of data, accessing JSThread means one less TLB entry for other things. I.e. the problem may not be with the lock, but rather with an efficient implementation of thread locals. 
It would be good to get line-level profiling for sure, but if we are thrashing a TLB 1 out of 64 js_NewGCThing calls, we still may win.

To avoid thrashing, your idea of clustering free things on lists would help -- then all the things could be in one page. I had a patch a long time ago that did this (grabbed a whole page at a time and carved it up).

If JSThread is one too many layers of indirection, we could move its freelist to each JSContext (there are many contexts in Firefox, but only a few are "busy" at a time).

Lots to try, little time. Suggest first tuning the existing param over a wide range (sure, try smaller too) and benchmarking if noise is low enough.

/be
with the bug 400902 landed it would be interesting to check this again. The new code for doubles has the same structure with or without THREADSAFE. The only difference is an extra lock/unlock. So now the benchmarks for doubles should show the pure overhead of JS_LOCK without adding extra cache pressure.
(In reply to comment #5)
> To get rid of js_NewGCThing hitting rt->gcLock, please try raising
> 
> #define MAX_THREAD_LOCAL_THINGS 8
> 
> A lot (64?).

I tried this with values of 64, 32, 16, and 8. No significant difference on any test. However, I will check again when bug 400902 lands.
(In reply to comment #3)
> Created an attachment (id=305015) [details]
> Sunspider comparison showing JS_THREADSAFE 9.6% slower
> 
> The bug is confirmed at least.  On Mac, anyway.
> 

The most interesting line here is

3bit-bits-in-byte: 1.146x as fast      69.8ms +/- 0.4%     60.9ms +/- 0.4% 

This benchmark never leaves the interpreter during its execution and never takes any locks. So here is that cache spoiler again. The problem with it is the scale of the speedup. If alignment changes can affect the benchmark to such extent, then it could affect the conclusion that threadsafe introduces non-trivial slowdown.

On the other hand my recent experience tells that the bytecodes that 3bit-bits-in-byte executes on the current trunk has an optimal alignment. Just change it little bit, and the benchmark slows by up to 20%.   
Flags: tracking1.9? → blocking1.9?
Not going to block on this, but, Jason, please continue investigation.
Flags: blocking1.9? → blocking1.9-
I profiled Firefox 3.0 beta 5 pre on Nokia N810. 
FF without JS_THREADSAFE is about 15% faster than with.

The result shows the gap between with JS_THREADSAFE and without in embedded device is a little bigger than desktop.

SUT (System Under Test)
  - Hardware : Nokia N810
     - Kernel Version : Linux Nokia-N810-50-2 2.6.21-omap 1
     - CPU : Texas Instruments OMAP 2420 (400MHz)
     - RAM : 256MB RAM, 128Mb Flash
  - Test Date : 2008-03-19
  - Build Identifier : Mozilla/5.0(X11; U; Linux armv6l; en-US; rv:1.9b5pre)
                       Gecko/2008031812 Minefield/3.0b5pre
(In reply to comment #13)
> 
> The result shows the gap between with JS_THREADSAFE and without in embedded
> device is a little bigger than desktop.

We need to add ARMv6 atomic operations to NSPR and jslock.c

This will probably narrow the gap.
I'm not working on this.

I agree it shouldn't block 1.9 and agree with comment 14 (separate bug, though, and Cc the mobile team).

Cc-ing Jason Evans.  jemalloc locks an arena on every alloc and every free.  It has some #ifdef MALLOC_BALANCE code to minimize lock contention.  (But Mozilla isn't even using it; there's not enough contention in Mozilla to make it pay.)
Assignee: jorendorff → general
All this locking stuff has been removed.  Threadsafe shells are now a good bit faster (background sweeping, free, compilation, etc).
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: