Closed Bug 418436 Opened 16 years ago Closed 12 years ago

JavaScript engine is non-trivially slower with JS_THREADSAFE than without

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: jorendorff, Unassigned)

Details

(Keywords: perf)

Attachments

(3 files)

Sunspider comparison showing JS_THREADSAFE 9.6% slower 16 years ago Jason Orendorff [:jorendorff] 3.58 KB, text/plain		Details
Shark screenshot comparing 16 years ago Jason Orendorff [:jorendorff] 110.56 KB, image/png		Details
Sunspider comparison on Nokia N810 16 years ago gyuyoung kim 3.81 KB, text/plain		Details

Jason Orendorff [:jorendorff]

Reporter

Description

•

16 years ago

Spun off from bug 412985 comment 17.

Brendan Eich [:brendan]

Comment 1

•

16 years ago

See bug 412985 omment #22 for sayrer's suggestion that the thing to test is js shell with and without JS_THREADSAFE defined.

See bug 54743 comment 94 for test results from when the request-amortized "fly-weight" locking went in that it resulting negligible effect of JS_THREADSAFE on performance, for that one benchmark.

Would like an investigation for 1.9, meaning soon.

/be

Flags: wanted1.9+

Flags: blocking1.9?

Summary: JavaScript engine is slower with JS_THREADSAFE than without → JavaScript engine is non-trivially slower with JS_THREADSAFE than without

Jesse Ruderman

Updated

•

16 years ago

Keywords: perf

Robert Sayre

Comment 2

•

16 years ago

jorendorff says he'll take a look

Assignee: general → jorendorff

Jason Orendorff [:jorendorff]

Reporter

Comment 3

•

16 years ago

Attached file Sunspider comparison showing JS_THREADSAFE 9.6% slower — Details

The bug is confirmed at least.  On Mac, anyway.

Jason Orendorff [:jorendorff]

Reporter

Comment 4

•

16 years ago

Attached image Shark screenshot comparing — Details

This screenshot shows a profile created with Shark's "File->Compare..." menu option.  Unfortunately this compares the *proportion* of samples in each function rather than the *total* number of samples.  The latter would be more telling.  I found no way to generate those numbers via Shark.app, though, nor any way to access the raw sample data.  Suggestions?

For example.  js_Interpret shows a difference of -1.2%.  This doesn't mean that js_Interpret runs faster with JS_THREADSAFE--it actually gets slower.  I think the -1.2% number means that js_Interpret itself was at top of stack in 46.9% of all non-threadsafe samples and 44.5% of all threadsafe samples, and 44.5%-46.9%=-1.2% (it's actually -1.4% though... rounding error?).

The comparison shows a grand total of about 1.4% of the total running time of the JS_THREADSAFE test being spent in synchronization-related functions (pthread_mutex_lock, _PR_Darwin_x86_AtomicIncrement, etc.) that are only used in JS_THREADSAFE builds.  It shows a +0.8% change in JS_NewGCThing, which also makes sense.

Brendan Eich [:brendan]

Comment 5

•

16 years ago

To get rid of js_NewGCThing hitting rt->gcLock, please try raising

#define MAX_THREAD_LOCAL_THINGS 8

A lot (64?).

Please find the hot pthread_mutex_lock stacks once you've tuned the gcLock out of the profile. If anything's left, we need to see why it isn't thinlocking or -able.

/be

Igor Bukanov

Comment 6

•

16 years ago

(In reply to comment #5)
> To get rid of js_NewGCThing hitting rt->gcLock, please try raising
> 
> #define MAX_THREAD_LOCAL_THINGS 8
> 
> A lot (64?).

I have tried it this one recently in the scope of bug 400902. With 32 doubles per free lists things were slower. I think this was due to bad interaction with cache prefetch. Now consider JSObject. With MAX_THREAD_LOCAL_THINGS == 8  populating the free list from freshly allocated arena affects 256 object bytes + 8 flag bytes, touching 5 cache lines in total on the latest CPUs. With a value set to 64 that will be 33 cache lines. Most of this lines will be evicted from L1 before the objects will be allocated for real.


> 
> Please find the hot pthread_mutex_lock stacks once you've tuned the gcLock out
> of the profile. If anything's left, we need to see why it isn't thinlocking or
> -able.


> 
> /be
>

Igor Bukanov

Comment 7

•

16 years ago

(In reply to comment #6)
> I have tried it this one recently in the scope of bug 400902. With 32 doubles
> per free lists things were slower. I think this was due to bad interaction with
> cache prefetch. Now consider JSObject. With MAX_THREAD_LOCAL_THINGS == 8 
> populating the free list from freshly allocated arena affects 256 object bytes
> + 8 flag bytes, touching 5 cache lines in total on the latest CPUs. With a
> value set to 64 that will be 33 cache lines. Most of this lines will be evicted
> from L1 before the objects will be allocated for real.

Given this I will even suggest to try to decrease MAX_THREAD_LOCAL_THINGS or even disable it to see the difference. Since thread-local lists live in JSThread, which lives in a different page then the rest of data, accessing JSThread means one less TLB entry for other things. I.e. the problem may not be with the lock, but rather with an efficient implementation of thread locals.

Brendan Eich [:brendan]

Comment 8

•

16 years ago

It would be good to get line-level profiling for sure, but if we are thrashing a TLB 1 out of 64 js_NewGCThing calls, we still may win.

To avoid thrashing, your idea of clustering free things on lists would help -- then all the things could be in one page. I had a patch a long time ago that did this (grabbed a whole page at a time and carved it up).

If JSThread is one too many layers of indirection, we could move its freelist to each JSContext (there are many contexts in Firefox, but only a few are "busy" at a time).

Lots to try, little time. Suggest first tuning the existing param over a wide range (sure, try smaller too) and benchmarking if noise is low enough.

/be

Igor Bukanov

Comment 9

•

16 years ago

with the bug 400902 landed it would be interesting to check this again. The new code for doubles has the same structure with or without THREADSAFE. The only difference is an extra lock/unlock. So now the benchmarks for doubles should show the pure overhead of JS_LOCK without adding extra cache pressure.

Robert Sayre

Comment 10

•

16 years ago

(In reply to comment #5)
> To get rid of js_NewGCThing hitting rt->gcLock, please try raising
> 
> #define MAX_THREAD_LOCAL_THINGS 8
> 
> A lot (64?).

I tried this with values of 64, 32, 16, and 8. No significant difference on any test. However, I will check again when bug 400902 lands.

Igor Bukanov

Comment 11

•

16 years ago

(In reply to comment #3)
> Created an attachment (id=305015) [details]
> Sunspider comparison showing JS_THREADSAFE 9.6% slower
> 
> The bug is confirmed at least.  On Mac, anyway.
> 

The most interesting line here is

3bit-bits-in-byte: 1.146x as fast      69.8ms +/- 0.4%     60.9ms +/- 0.4% 

This benchmark never leaves the interpreter during its execution and never takes any locks. So here is that cache spoiler again. The problem with it is the scale of the speedup. If alignment changes can affect the benchmark to such extent, then it could affect the conclusion that threadsafe introduces non-trivial slowdown.

On the other hand my recent experience tells that the bytecodes that 3bit-bits-in-byte executes on the current trunk has an optimal alignment. Just change it little bit, and the benchmark slows by up to 20%.

Mike Connor [:mconnor]

Updated

•

16 years ago

Flags: tracking1.9? → blocking1.9?

Damon Sicore (:damons)

Comment 12

•

16 years ago

Not going to block on this, but, Jason, please continue investigation.

Flags: blocking1.9? → blocking1.9-

gyuyoung kim

Comment 13

•

16 years ago

Attached file Sunspider comparison on Nokia N810 — Details

I profiled Firefox 3.0 beta 5 pre on Nokia N810. 
FF without JS_THREADSAFE is about 15% faster than with.

The result shows the gap between with JS_THREADSAFE and without in embedded device is a little bigger than desktop.

SUT (System Under Test)
  - Hardware : Nokia N810
     - Kernel Version : Linux Nokia-N810-50-2 2.6.21-omap 1
     - CPU : Texas Instruments OMAP 2420 (400MHz)
     - RAM : 256MB RAM, 128Mb Flash
  - Test Date : 2008-03-19
  - Build Identifier : Mozilla/5.0(X11; U; Linux armv6l; en-US; rv:1.9b5pre)
                       Gecko/2008031812 Minefield/3.0b5pre

Robert Sayre

Comment 14

•

16 years ago

(In reply to comment #13)
> 
> The result shows the gap between with JS_THREADSAFE and without in embedded
> device is a little bigger than desktop.

We need to add ARMv6 atomic operations to NSPR and jslock.c

This will probably narrow the gap.

Jason Orendorff [:jorendorff]

Reporter

Comment 15

•

16 years ago

I'm not working on this.

I agree it shouldn't block 1.9 and agree with comment 14 (separate bug, though, and Cc the mobile team).

Cc-ing Jason Evans.  jemalloc locks an arena on every alloc and every free.  It has some #ifdef MALLOC_BALANCE code to minimize lock contention.  (But Mozilla isn't even using it; there's not enough contention in Mozilla to make it pay.)

Assignee: jorendorff → general

Luke Wagner [:luke]

Comment 16

•

12 years ago

All this locking stuff has been removed.  Threadsafe shells are now a good bit faster (background sweeping, free, compilation, etc).

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.