Closed Bug 312238 Opened 19 years ago Closed 19 years ago

Use per-thread freelists to speed up JS_XDRString

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: bryner, Assigned: feng.qian.moz)

References

Details

(Keywords: perf)

Attachments

(9 files, 11 obsolete files)

134.04 KB, patch
Details | Diff | Splinter Review
86.77 KB, text/plain
Details
50.17 KB, patch
Details | Diff | Splinter Review
4.21 KB, patch
Details | Diff | Splinter Review
40.82 KB, patch
Details | Diff | Splinter Review
39.16 KB, patch
Details | Diff | Splinter Review
15.08 KB, patch
Details | Diff | Splinter Review
13.07 KB, patch
brendan
: review+
feng.qian.moz
: review+
Details | Diff | Splinter Review
7.70 KB, patch
feng.qian.moz
: review+
Details | Diff | Splinter Review
Spun off from bug 279839.  Script deserialization spends a lot of time acquiring
the gc lock for JSString allocation.
Flags: testcase-
This will still be an issue with the fix for bug 321985, so I'm going to fix this soon.  Will coordinate with Igor on related jsgc.c patches.

/be
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → mozilla1.9alpha
It won't apply cleanly -- best to hand-patch in a bottom up way, starting with the jscntxt.[ch] changes, then jsapi.c, then jsgc.c.

/be
Feng has kindly offered to take this one.

/be
Assignee: brendan → feng.qian
Status: ASSIGNED → NEW
first patch for discussion:
1. create thread private freelists
2. fill in thread private freelists when empty
3. no lock when accessing thread private freelists
Attachment #213358 - Flags: review?(brendan)
Comment on attachment 213358 [details]
create thread-local freelists to reduce the number of gc locks

I'll review this or a newer patch tomorrow -- thanks for taking this on, again.  First some quick general nits:

- http://wiki.mozilla.org/JavaScript:SpiderMonkey:Coding_Style talk about how major comment style is

01234567890123456789012345678901234567890123456789012345678901234567890123456789
    /*
     * More like this than what's seen in the patch, which often starts on the
     * very first line (the "/*\n" line two lines above).
     */

- One declaration had a side-by-side comment that overflowed to the next line, and
the preferred local style is

    int foozle;     /* write a comment that would wrap past the 80th column
                       like this, as one comment, not two */

- Also in the style guide: prevailing style does not brace then clauses of if statements where neither the if condition nor the then clause takes more than one line by itself.

More substantive comment: can we avoid duplicating so much code from js_NewGCThing in js_LocalNewGCThing?  It should be possible to inline the latter in the former, then unify common exit paths.

/be
Igor, I'm delinquent yet again with a review for your patch for bug 324278, with which any patch to this bug will conflict.  I'd appreciate your comments on this bug's patch, and on best order of landing.  You are certainly entitled to go first and let Feng pick up the pieces.  Again, my apologies for the tardy review I owe you.  If you want to go first, I'll do it tomorrow.

/be
Brendan wrote in comment #6 :
> Igor, I'm delinquent yet again with a review for your patch for bug 324278,
> with which any patch to this bug will conflict.

I would appreciate to get bug 324278 resolved first since the patches there are almost one year old and it takes efforts to sync with CVS.

>  I'd appreciate your comments on this bug's patch,

I think benchmarks that show patch's benefits are required before any landing. I spent some time implementing to use dependable strings for xdr script to share single malloced array for characters to address bug 321985 just to find out when benchmarking that the problem was caused by something else.


 and on best order of landing.  You are certainly entitled
> to go first and let Feng pick up the pieces.  Again, my apologies for the tardy
> review I owe you.  If you want to go first, I'll do it tomorrow.
> 
> /be
> 

The benchmark is a modified tablesort.js from:
http://webfx.eae.net/dhtml/tablesort/demo.html

On my Linux box: 
thread-local freelist:
sorting took 972ms

no thread-local freelist:
sorting took 1235ms

about 20% improvement.
Ok, Igor goes first (reviewing now).  Sorry, this will mean conflicts for Feng, which is all my fault.  Sorry about that, I have no good excuse.

This bug will need an updated patch after Igor lands the final patch for bug 324278.

/be
(In reply to comment #5)
> (From update of attachment 213358 [details] [edit])
> I'll review this or a newer patch tomorrow -- thanks for taking this on, again.
>  First some quick general nits:
> 
> - http://wiki.mozilla.org/JavaScript:SpiderMonkey:Coding_Style talk about how
> major comment style is
> 
> 01234567890123456789012345678901234567890123456789012345678901234567890123456789
>     /*
>      * More like this than what's seen in the patch, which often starts on the
>      * very first line (the "/*\n" line two lines above).
>      */
> 
> - One declaration had a side-by-side comment that overflowed to the next line,
> and
> the preferred local style is
> 
>     int foozle;     /* write a comment that would wrap past the 80th column
>                        like this, as one comment, not two */
> 
> - Also in the style guide: prevailing style does not brace then clauses of if
> statements where neither the if condition nor the then clause takes more than
> one line by itself.
> 
> More substantive comment: can we avoid duplicating so much code from
> js_NewGCThing in js_LocalNewGCThing?  It should be possible to inline the
> latter in the former, then unify common exit paths.

Certainly we can, but from readability viewpoint, I think a separate function makes it easier to read than a lengthy one with #ifdef and gotos. 

I made a new patch that fixes some style violations, and does a little bit code refactoring of jsgc.c. 

> 
> /be
> 
Attached patch a revised patch (obsolete) — Splinter Review
Attachment #213470 - Attachment is obsolete: true
Attachment #213491 - Flags: review?(igor.bukanov)
Attachment #213470 - Attachment is obsolete: false
Attachment #213470 - Attachment is patch: true
sorry, guys, you may got tons of spams as I am playing with the bugzilla.
Attachment #213470 - Attachment is patch: false
Attachment #213358 - Attachment is obsolete: true
Attachment #213358 - Attachment is patch: false
Attachment #213358 - Flags: review?(brendan)
(In reply to comment #12)
> sorry, guys, you may got tons of spams as I am playing with the bugzilla.

No problem.

You'll notice from the rest of js*.c that (almost always) downward goto is not considered harmful.  Code duplication (which exists elsewhere, no excuses there -- should be fixed) is harmful, for bloat and bug habitat duplication reasons.

/be

Since the patch for the bug 324278 is not ready and I will be traveling until 2006-03-08, I remove my objections against landing changes for this bug before that.
(In reply to comment #13)
> (In reply to comment #12)
> > sorry, guys, you may got tons of spams as I am playing with the bugzilla.
> 
> No problem.
> 
> You'll notice from the rest of js*.c that (almost always) downward goto is not
> considered harmful.  Code duplication (which exists elsewhere, no excuses there
> -- should be fixed) is harmful, for bloat and bug habitat duplication reasons.

How about what I did in the current patch? 

Common code in js_LocalNewGCThing and js_NewGCThing is refactored into js_PostAllocation. For performance reason, I'd like it to be inlined into js_LocalNewGCThing. Same as js_LocalNewGCThing, it is inlined into js_NewGCThing by the compiler. It can still increase code sizes because of inlining, but it is good from software engineering point of view.

One difficulty of fully inlining js_LocalNewGCThing into js_NewGCThing is that, there are two exit paths after common code, path from js_LocalNewGCThing needs not do anything, but path from failing js_LocalNewGCThing needs to unlock the GC lock:
js_NewGCThing(...)
{
#ifdef JS_THREADSAFE
   thing = js_LocalNewGCThing(...);
   if (thing)
      goto succ;
#endif

   /* allocation from global freelist or the last arena ... */
   for (...) {
      ...
      // try gc
      ...   
   }

#ifdef JS_THREADSAFE
succ:
#endif
   lsr = ...

   JS_UNLOCK_GC(rt);            // how to avoid this statement if the control 
                                // flow came from 'goto succ'?
   return thing;
fail:
   ...
}   

> 
> /be
> 

It looks like JS_INLINE is defined usefully only for Windows.  But the patch doesn't compile on Linux for me at all:

jsgc.c: In function `js_LocalNewGCThing':
jsgc.c:675: warning: ISO C90 forbids mixed declarations and code
jsgc.c:724: warning: implicit declaration of function `js_PostAllocation'
jsgc.c:724: warning: assignment makes pointer from integer without a cast
jsgc.c: At top level:
jsgc.c:744: error: conflicting types for 'js_PostAllocation'
jsgc.c:724: error: previous implicit declaration of 'js_PostAllocation' was heremake: *** [jsgc.o] Error 1

The old patch (attachment 212176 [details] [diff] [review]) used a flag to condition gc-unlocking in the common exit path from the allocator.  I continue to believe (dinosaur that I am) that the "if" statement can be as good software engineering, and better code size and performance, often enough. :-/

I'd like to keep reviewing the patch. Igor is welcome to as well. Style nits are minor but non-zero, don't mind them -- they will get picked by someone before checkin.  My main concerns are (a) does it work on all platforms; (b) does it avoid gratuitous code size increases.  These are legitimate concerns.

/be
Attached patch reviewed patch again (obsolete) — Splinter Review
eliminated js_LocalNewThing and js_PostAllocation, code are embedded into js_NewGCThing using #ifdef and gotos. 

js binary produced by gcc -O2 increased about 4K on the base of 595K. Performance improvement of the benchmark I attached is about 15% to 20%.

The error you saw is probably due to the declaration of a local variable in a block. It should be gone.
Attachment #213491 - Attachment is obsolete: true
Attachment #213934 - Flags: review?(brendan)
Attachment #213491 - Flags: review?(igor.bukanov)
Comment on attachment 213934 [details] [diff] [review]
reviewed patch again

This is good, thanks!  Rather than go over it in minute detail I will post a patch based on it, with a few cleanups and minor fixes.

/be
Attachment #213934 - Flags: review?(brendan) → review+
Attached patch revised version of last patch (obsolete) — Splinter Review
I haven't had the chance even to test-compile this JS_THREADSAFE, but I did want to fix a few things (a C++ style declaration of lastThing in js_NewGCThing -- note also that I replaced that with a double-indirect lastptr to eliminate the need for basis case if-tests; I also eliminated dependent cx->thread->id loads in jslock.c and jslock.h where before only cx->thread was loaded).  Comments and testing are most welcome!

I hope to have time tonight to get my build together and test this.  Sorry for any blunders, all credit to you for patching this bug.

/be
Attachment #215928 - Flags: review?(feng.qian)
One more question: the heuristic of using GC_NBYTES_MAX / nbytes as the maximum number of things to grab off the global freelist for nbytes is quite small.  Feng, what do you think of 10x this ratio?  That would be between 100 and 10 things, ranging over 8 to 80 nbytes values.

/be
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
(In reply to comment #20)
> One more question: the heuristic of using GC_NBYTES_MAX / nbytes as the maximum
> number of things to grab off the global freelist for nbytes is quite small. 
> Feng, what do you think of 10x this ratio?  That would be between 100 and 10
> things, ranging over 8 to 80 nbytes values.

You are right, GC_NBYTES_MAX / nbytes is a mistake, I think what I wanted is GC_PAGE_SIZE/nbytes, but suggestion sounds reasonable to me.

> 
> /be
> 

Brendan, you've marked this bug as RESOLVED FIXED, did you really mean to do that?
Somehow, typeahead find (I enable it and usually swear by it) can lead to resolving bugs without meaning to.  Sorry, reopening.  I still need to get VS8 or at least 7.1 installed on my lifeboard WinXP partition (Linux one is pretty badly damaged) and get the patch tested and landed.

/be
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Typeahead find + my fat fingers, I mean.  Still like the feature, wish it could be brendan-proofed.

/be
Status: REOPENED → ASSIGNED
Latest version doubly-links JSContexts associated with a given JSThread, to help js_GC deal with context-happy embeddings such as Mozilla more efficiently.

Also unroll the current (GC'ing) thread's iteration from the loop to clear all thread-local gcFreeLists, to reduce redundant memset-to-zero cycles, again for the many-contexts-on-main-thread Gecko apps.

/be
Attachment #213934 - Attachment is obsolete: true
Attachment #215928 - Attachment is obsolete: true
Attachment #216190 - Flags: review?(feng.qian)
Attachment #215928 - Flags: review?(feng.qian)
Comment on attachment 216190 [details] [diff] [review]
latest, still need vs8 and JS_THREADSAFE build to test

> void *
> js_NewGCThing(JSContext *cx, uintN flags, size_t nbytes)
> {
...
>+#ifdef JS_THREADSAFE
>+    gcLocked = JS_FALSE;
>+    JS_ASSERT(cx->thread);
>+    flbase = cx->thread->gcFreeLists;
>+    JS_ASSERT(flbase);
>+    thing = flbase[flindex];
>+    if (thing) {
>+        flagp = thing->flagp;
>+        flbase[flindex] = thing->next;
>+        METER(rt->gcStats.localalloc++);  /* this is not thread-safe */
>+        goto success;
>+    }

....

>@@ -1773,17 +1875,16 @@ js_GC(JSContext *cx, uintN gcflags)
...
>     /*
>      * Free phase.
>      * Free any unused arenas and rebuild the JSGCThing freelist.
>      */
>+#ifdef JS_THREADSAFE
>+    /*
>+     * Set all thread local freelists to NULL. We may visit a thread's
>+     * freelist more than once. To avoid redundant clearing we unroll the
>+     * current thread's step.
>+     */
>+    memset(cx->thread->gcFreeLists, 0, sizeof cx->thread->gcFreeLists);
>+    iter = NULL;
>+    while ((acx = js_ContextIterator(rt, JS_FALSE, &iter)) != NULL) {
>+        if (acx->thread == cx->thread)
>+            continue;
>+        memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);
>+    }
>+#endif

AFAICS there is a race here between 
    flbase[flindex] = thing->next 
and
    memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);

Or did I miss something?
(In reply to own comment #25)
> 
> Or did I miss something?
> 

Please ignore this: when gc runns, no other thread threads must wait for it. So the code is OK.

(In reply to own comment #25)
> 
> Or did I miss something?
> 

Please ignore this: when gc runns, no other thread can be active inside BeginRequest. So the code is OK.
Comment on attachment 216190 [details] [diff] [review]
latest, still need vs8 and JS_THREADSAFE build to test

> void *
> js_NewGCThing(JSContext *cx, uintN flags, size_t nbytes)
> {
...
>+    if (thing) {
>+        flagp = thing->flagp;
>+        flbase[flindex] = thing->next;
>+        METER(rt->gcStats.localalloc++);  /* this is not thread-safe */
>+        goto success;
>+    }
...
> #ifdef TOO_MUCH_GC
> #ifdef WAY_TOO_MUCH_GC
>     rt->gcPoke = JS_TRUE;
> #endif
>     js_GC(cx, GC_KEEP_ATOMS | GC_ALREADY_LOCKED);
>-    tried_gc = JS_TRUE;
>+    triedGC = JS_TRUE;
> #else
>-    tried_gc = JS_FALSE;
>+    triedGC = JS_FALSE;
> #endif

This removes almost all usability of TOO_MUCH_GC with JS_THREADSAFE. Should the per-thread lists be used only when TOO_MUCH_GC/WAY_TOO_MUCH_GC are not defined?
Comment on attachment 216190 [details] [diff] [review]
latest, still need vs8 and JS_THREADSAFE build to test

Here is another problem with the patch. When thing comes from the thread list, rt->gcPrivateBytes/gcBytes would be modified outside the lock.
This is a quick and dirty patch to test performance of using compare-and-swap instructions to grab things from the free lock.
Slightly less quicker version with correct order of arguments to CompareAndSwap.
Attachment #216227 - Attachment is obsolete: true
Comment on attachment 216229 [details] [diff] [review]
Using compare-and-swap for freelist with fixes

No need for JS_CompareAndSwap, that would be a public API by naming convention (JS_ not js_), for which there is no call.  We already have js_CompareAndSwap.

We can't use js_CompareAndSwap if the target architecture does not support it (see jslock.c), so more #if testing than defined(JS_THREADSAFE) is needed (not a big deal for top-tier target platforms).

More substantively, the original patch I hacked long ago actually allocated an arena per thread and filled the thread-local freelist from it.  The patch that feng revived and cleaned up nicely with NSPR thread-private storage, and this patch, rely on the global freelist being non-empty often enough to amortize locking costs.  Is that likely to be the case?

/be
(In reply to comment #33)
> ... rely on the global freelist being non-empty often enough to amortize
> locking costs.  Is that likely to be the case?

See bug 331770 which makes JS_GCMETER to include stats about the average free list density.
(In reply to comment #33)
> (From update of attachment 216229 [details] [diff] [review] [edit])
> No need for JS_CompareAndSwap, that would be a public API by naming convention
> (JS_ not js_), for which there is no call.  We already have js_CompareAndSwap.
> 
> We can't use js_CompareAndSwap if the target architecture does not support it
> (see jslock.c), so more #if testing than defined(JS_THREADSAFE) is needed (not
> a big deal for top-tier target platforms).
> 
> More substantively, the original patch I hacked long ago actually allocated an
> arena per thread and filled the thread-local freelist from it.  The patch that
> feng revived and cleaned up nicely with NSPR thread-private storage, and this
> patch, rely on the global freelist being non-empty often enough to amortize
> locking costs.  Is that likely to be the case?

Not solely rely on the global freelist. If the global list has free things, fill the local-free lists from it. Otherwise, take the free space from the last arena. 



> 
> /be
> 

(In reply to comment #30)
> (From update of attachment 216190 [details] [diff] [review] [edit])
> Here is another problem with the patch. When thing comes from the thread list,
> rt->gcPrivateBytes/gcBytes would be modified outside the lock.
> 

Yes, it is a problem. How about update gcPrivateBytes/gcBytes when filling local-free list? so we count bytes in local-free list as allocated bytes. 
(In reply to comment #36)
> (In reply to comment #30)
> > (From update of attachment 216190 [details] [diff] [review] [edit] [edit])
> > Here is another problem with the patch. When thing comes from the thread list,
> > rt->gcPrivateBytes/gcBytes would be modified outside the lock.
> > 
> 
> Yes, it is a problem. How about update gcPrivateBytes/gcBytes when filling
> local-free list? so we count bytes in local-free list as allocated bytes. 
> 

Another possibility is to count gcBytes by arena allocations under assumption that if a thing is on the free list, then still its memory is allocated. This does not change the semantic of gcMaxBytes checks in js_NewGCThing but the behavior of JS_MaybeGC would be affected. Given the stats from bug 331770 I think changing the condition in JS_MaybeGC 

    if ((bytes > 8192 && bytes > lastBytes + lastBytes / 2) ||
        rt->gcMallocBytes > rt->gcMaxMallocBytes) {

to just

    if ((bytes > 8192 && bytes > lastBytes) ||
        rt->gcMallocBytes > rt->gcMaxMallocBytes) {

would restore the balance.
Brendan, what's the update on this bug? are you waiting for my review? or should I take your patch and polish it?
Sorry, I was going to check in, then Igor provocatively suggested just using compare-and-swap on the runtime's freelist for the desired size-class.  That led to the question of "freelist density", although even with the measurements from bug 331770, it was not clear what approach would win in runtime.

What we could really use is some Firefox performance testing of attachment 216190 [details] [diff] [review] vs. attachment 216229 [details] [diff] [review].  Feng, Igor: I'm traveling next week and still recovering my disk, so it is up to you guys.  Thanks,

/be
(In reply to comment #39)
> Sorry, I was going to check in, then Igor provocatively suggested just using
> compare-and-swap on the runtime's freelist for the desired size-class.  That
> led to the question of "freelist density", although even with the measurements
> from bug 331770, it was not clear what approach would win in runtime.
> 
> What we could really use is some Firefox performance testing of attachment
> 216190 [edit] vs. attachment 216229 [details] [diff] [review] [edit].  Feng, Igor: I'm traveling next week and still
> recovering my disk, so it is up to you guys.  Thanks,

I remember I did such a comparison study a while ago using the benchmark I attached. Thread-local freelist was the winner, although two attachments were not conflicting. We could verify it again. 

> 
> /be
> 

(In reply to comment #39)
> Sorry, I was going to check in, then Igor provocatively suggested just using
> compare-and-swap on the runtime's freelist for the desired size-class.  That
> led to the question of "freelist density", although even with the measurements
> from bug 331770, it was not clear what approach would win in runtime.
> 
> What we could really use is some Firefox performance testing of attachment
> 216190 [edit] vs. attachment 216229 [details] [diff] [review] [edit].  Feng, Igor: I'm traveling next week and still
> recovering my disk, so it is up to you guys.  Thanks,
> 

To test the performance speedup of using think lock, I simply replace all JS_LOCK_GC and JS_UNLOCK_GC by js_CompareAndSwap. Use the benchmark I attached above, but repeated 10 times, and remove one iteration that causes GC. Here are some numbers:

PR_Lock:
  avg 1197 ms
CompareAndSwap
  avg 1062 ms  (11.3%)
Local freelist
  avg 958 ms   (20.0%)





> /be
> 

(In reply to comment #39)
> Sorry, I was going to check in, then Igor provocatively suggested just using
> compare-and-swap on the runtime's freelist for the desired size-class.  That
> led to the question of "freelist density", although even with the measurements
> from bug 331770, it was not clear what approach would win in runtime.
> 
> What we could really use is some Firefox performance testing of attachment
> 216190 [edit] vs. attachment 216229 [details] [diff] [review] [edit].  Feng, Igor: I'm traveling next week and still
> recovering my disk, so it is up to you guys.  Thanks,
> 
> /be
> 

(In reply to comment #39)
> Sorry, I was going to check in, then Igor provocatively suggested just using
> compare-and-swap on the runtime's freelist for the desired size-class.  That
> led to the question of "freelist density", although even with the measurements
> from bug 331770, it was not clear what approach would win in runtime.
> 
> What we could really use is some Firefox performance testing of attachment
> 216190 [edit] vs. attachment 216229 [details] [diff] [review] [edit].  Feng, Igor: I'm traveling next week and still
> recovering my disk, so it is up to you guys.  Thanks,
> 
> /be
> 

(In reply to comment #41)
> PR_Lock:
>   avg 1197 ms
> CompareAndSwap
>   avg 1062 ms  (11.3%)
> Local freelist
>   avg 958 ms   (20.0%)
> 

Retrospectively this is what one would expect since PR_Lock also uses cmp-and-swap when the lock is uncontested, but it calls the function twice. Thus  speedup from mostly lock-free solution should be slightly less then 2 times compared with cmp-and-swap solution.

Still it would be nice to split the patch into 2 parts, infrastructure and implementation, since it would be possible to test yet another free list alternative.

Currently GC assembles the free list in a separated scan-for-free-things loop after it freed things in the sweep loop. But what about moving this scanning into js_NewGCThing? I.e. the sweep loop would need just to remove arenas that becomes empty from arena list and then js_GC instead of the scanning loop would just destroy empty arenas.

Then js_NewGCThing has to do scanning when it allocates things. The scanning can be done mostly lock-free since each thread can reserve a range to scan for itself and get things from there without using any locks. Moreover, since such reservation can be done using compare-and-swap, one can get lock-free allocation of all free things in the arena list.
I filed bug 333236 about not-building-free-list optimization.
Blocks: 203448
Igor, feng, how is this for "just the infrastructure"?  It can go into the trunk at any time if you both vouch for it working.  I will check in if I can, but I am traveling this week (ECMA TC39 meeting in Geneva).

/be
Attachment #217941 - Flags: review?(feng.qian.moz)
(In reply to comment #44)
> Created an attachment (id=217941) [edit]
> just the infrastructure, with #if 0 JSThread.gcFreeLists decl
> 
> Igor, feng, how is this for "just the infrastructure"?  It can go into the
> trunk at any time if you both vouch for it working.  I will check in if I can,
> but I am traveling this week (ECMA TC39 meeting in Geneva).

This is nice as now I can try to put lockless allocation on top of bug 333236. Too bad I have to resolve a hardware issue first: screen on my 6 years old notebook died and I am in a processess of odering a new one...  
The previous patch did not compile against Makefile.ref, so here is something that does.

Plus I have a question: previously all nspr thread-related functionality was hidden behind macros/functions from jslock.c. This patch breaks that tradition and calls nspr functions directly from jsapi.c and jscntx.c. Is this intentional?
(In reply to comment #46)
> Created an attachment (id=218200) [edit]
> just the infrastructure with fixes
> 
> The previous patch did not compile against Makefile.ref, so here is something
> that does.
> 
> Plus I have a question: previously all nspr thread-related functionality was
> hidden behind macros/functions from jslock.c. This patch breaks that tradition
> and calls nspr functions directly from jsapi.c and jscntx.c. Is this
> intentional? 

Not intentional, at least from my part. I also didn't realiaze we hide nspr thread-related functionality. 

> 

(In reply to comment #47)
> Not intentional, at least from my part. I also didn't realiaze we hide nspr
> thread-related functionality. 

We have, but mainly to avoid #ifdef JS_THREADSAFE proliferation and minimize code that is conditionally compiled for JS_THREADSAFE.  E.g., via the JS_ACQUIRE_LOCK and JS_RELEASE_LOCK macros, code can be written thus:

    JS_ACQUIRE_LOCK(rt->deflatedStringCacheLock);
    hep = JS_HashTableRawLookup(rt->deflatedStringCache, hash, str);
    he = *hep;
    if (he) {
#ifdef DEBUG
        rt->deflatedStringCacheBytes -= JSSTRING_LENGTH(str);
#endif
        free(he->value);
        JS_HashTableRawRemove(rt->deflatedStringCache, hep, he);
    }
    JS_RELEASE_LOCK(rt->deflatedStringCacheLock);

(from jsstr.c).

With PR_NewThreadPrivateIndex, PR_GetThreadPrivate, and PR_SetThreadPrivate, there is no way to "macro-stub" such calls, and they need to occur within #ifdef JS_THREADSAFE blocks anyway.  So for these NSPR APIs, a macro layer in jslock.h does not pay for itself.

People do try to use SpiderMonkey with JS_THREADSAFE defined but without NSPR, but they perforce end up defining their own PR_* workalike subset of NSPR (sometimes called "miniNSPR", and possibly something that will be contributed as an optional subdirectory of js/src).

I checked in the last attachment, crediting all hands with Feng first -- thanks, Igor.  I'll attach a patch for the rest of the work Feng did here, the per-thread gcFreeList changes.

/be
Attachment #218757 - Flags: review?(feng.qian.moz)
(In reply to comment #49)
> Created an attachment (id=218757) [edit]
> thread-local lock-free GC freelist followup patch
> 

This still modifies rt->gcBytes outside the lock. 
Here is a patch that is based on comment #37. It makes rt->gc(Private)?Bytes to reflect the total memory allocated for GC arenas so the counters has to be modified only on arena allocation/release. Then the patches for lockless allocation do not need to wory about updating rt->gc(Private)?Bytes.
Comment on attachment 218834 [details] [diff] [review]
A solution to rt->gsBytes modifications 

This conflicts big-time with the previous patch, and I don't have time to merge them.  To avoid races to update gcBytes, this patch should go in first or be folded into the local gcFreeLists patch.  Anyway, detailed comments follow.

/be

>Index: src/jsapi.c
>===================================================================
>--- src.orig/jsapi.c	2006-04-18 09:10:23.000000000 +0200
>+++ src/jsapi.c	2006-04-18 17:31:19.000000000 +0200
>@@ -1850,20 +1850,20 @@
>     JS_GC(cx);
> #else
>     JSRuntime *rt;
>     uint32 bytes, lastBytes;
> 
>     rt = cx->runtime;
>     bytes = rt->gcBytes;
>     lastBytes = rt->gcLastBytes;
>-    if ((bytes > 8192 && bytes > lastBytes + lastBytes / 2) ||
>+    if ((bytes > 8192 && bytes >= lastBytes) ||

This is a change of behavior (also, lastBytes is single use now, not needed if we want this change).  Any reason for it?  All things equal, I would rather we keep compatibility.

>+            bytesptr = (alindex != 0 ? &rt->gcPrivateBytes : &rt->gcBytes);

Prevailing style parenthesizes the condition only, not whole ?: r.h.s. of = operator.

>         /*
>+         * Assert that at this point the last arena shoud have space for more
>+         * things.
>+         */
>+        JS_ASSERT(offset + nbytes <= GC_THINGS_SIZE);

Comment should not say "Assert ..." since the next line does that, but just note "At this point, the last arena should have space for more things." (typo: shoud).

>+                bytesptr = (i != 0 ? &rt->gcPrivateBytes : &rt->gcBytes);

Ditto.

/be
(In reply to comment #52)
> (From update of attachment 218834 [details] [diff] [review] [edit])
> This conflicts big-time with the previous patch, and I don't have time to merge
> them.  To avoid races to update gcBytes, this patch should go in first or be
> folded into the local gcFreeLists patch.

I would like to base an implementation of the lock-less version of bug 333236 on this patch. There if one wants to move scanning for free GC cells outside the lock, then rt->gcBytes can not be updated as soon as a thread would find a free cell. So either some extra efforts has to be taken to update rt->gcBytes next time thread wants to take lock or rt->gcBytes is changed to reflect the total allocation, not the total allocation - size of free cells. 

The second option simplifies the code and makes patch for this bug thread safe.

> >-    if ((bytes > 8192 && bytes > lastBytes + lastBytes / 2) ||
> >+    if ((bytes > 8192 && bytes >= lastBytes) ||
> 
> This is a change of behavior (also, lastBytes is single use now, not needed if
> we want this change).  Any reason for it?  All things equal, I would rather we
> keep compatibility.

This is done to *preserve* compatibility. With the patch rt->gcBytes means the total size of arenas or size of live things + the size of available free cells, not simply the size of live things. So before the patch the condition read:

current size of allocated gc things > 3/2 size of live things after GC.

Now given the stats for free list density from bug 331770 which is about 25-30% for things that contribute to rt->gcBytes, that "3/2 size of live things after GC" is close to "the total size of GC arenas after GC". Thus as an approximation to the current condition we can state:

current size of allocated gc things > the total size of GC arenas after GC

Now again due to the same stats about the free list densisty when the original condition passes, then free list would be empty, hence we can restate the original condition as:

the total size of GC arenas > the total size of GC arenas after GC

Now with the patch this would be translated to:

rt.gcBytes > rt.gcLastBytes or bytes > lastBytes

Now I wrote bytes >= lastBytes in the patch, but that is a bad typo since the condition would always pass. It should really be bytes > lastBytes which effectively states "it make sence to call GC if we were forced to allocated new arenas after exhausting the free list."

Does it sounds reasonable?

(In reply to comment #53)
> The second option simplifies the code and makes patch for this bug thread safe.

Yeah, I'm hip already. ;-)

> Now given the stats for free list density from bug 331770 which is about 25-30%
> for things that contribute to rt->gcBytes, that "3/2 size of live things after
> GC" is close to "the total size of GC arenas after GC".

But gcBytes and gcLastBytes have the same dimensions with your patch -- they both measure arena-quantized counts of bytes.  It's true that from the density measure you made, maybe 7/10ths to 3/4ths of these counts are free, but left and right operands of >= are in the same units, so I still wonder how you can change the 50% nominal growth heuristic here.  Would not the prorating of 3/2 by [7/10,3/4] need to apply to both left and right operands?

/be
(In reply to comment #54)
> (In reply to comment #53)
> > The second option simplifies the code and makes patch for this bug thread safe.
> 
> Yeah, I'm hip already. ;-)
> 
> > Now given the stats for free list density from bug 331770 which is about 25-30%
> > for things that contribute to rt->gcBytes, that "3/2 size of live things after
> > GC" is close to "the total size of GC arenas after GC".
> 
> But gcBytes and gcLastBytes have the same dimensions with your patch -- they
> both measure arena-quantized counts of bytes.  It's true that from the density
> measure you made, maybe 7/10ths to 3/4ths of these counts are free, but left
> and right operands of >= are in the same units, so I still wonder how you can
> change the 50% nominal growth heuristic here.  Would not the prorating of 3/2
> by [7/10,3/4] need to apply to both left and right operands?

The stats about the free list density tells about the state right after GC, not about the density at an arbitraty moment. The current statement:

  current size of allocated gc things > 3/2 size of live things after GC 

is equivalent to:

  S*(1-F) > 3/2 Sl*(1-Fl) 

where S and F are the total arena size and free list density currently and Sl and Fl are the size and density right after GC. The stats implies that 1/3 is a good approximation for Fl so we have:

  S*(1-F) > 3/2 Sl*(1-1/3) or S*(1 - F) > Sl

Now if we have not allocatet any new arenas yet and use only free list, then S == Sl and in this case we should do GC if

  S*(1-F) > S or F < 0 

or in other words never. On the other hand when we were forced to allocate new arenas, then F == 0 and S > Sl and the condition holds trivially.

So we can rewrite the current condition as:

  Do GC if we allocated arenas after making the free list empty. 

With the patch applied this is just rt.gcBytes > rt.gcLastBytes.

Btw, if one consider the case that Fl == 1/4, then we would have 
S*(1-F) > 3/2 Sl*(1-1/4) or S*(1-F) > 9/8 Sl which is by the same reasoning translates into 

  Do GC if we allocated extra 1/8 memory for arenas after making the free list empty. 

or 

  rt.gcBytes > 9/8 rt.gcLastBytes.

Perhaps it is better to use a condition like this then just rt.gcBytes > rt.gcLastBytes as my stats for Fl closer to 1/4 rather then 1/3 but I like the simplicity of "Do GC if we allocated arenas after making the free list empty."
Besides addressing the nits, the patch uses:

  bytes > lastBytes + lastBytes / 16
or
  rt->gcBytes > 17/16 rt->gcLastBytes

to ensure exponential growth of the GC heap compared to the number of calls to js_GC via JS_MaybeGC.
Attachment #218834 - Attachment is obsolete: true
Attachment #218954 - Flags: review?(brendan)
Comment on attachment 218954 [details] [diff] [review]
A solution to rt->gsBytes modifications v2

I buy it, but could you cite this bug in the comment, or else recap more of the analysis (with algebra), and the measured density.  Best to do this in a comment before the outer if statement in JS_MaybeGC, I think -- more room for the comment to breathe.

r=me in any case, thanks again.

/be
Attachment #218954 - Flags: review?(brendan) → review+
Comment on attachment 218200 [details] [diff] [review]
just the infrastructure with fixes

>Index: src/js.c
...
>+#ifdef JS_THREADSAFE
>+    JS_BeginRequest(cx);
>+#endif
>+
>     glob = JS_NewObject(cx, &global_class, NULL, NULL);
>     if (!glob)
>         return 1;
> #ifdef LAZY_STANDARD_CLASSES
>     JS_SetGlobalObject(cx, glob);
> #else
>     if (!JS_InitStandardClasses(cx, glob))
>         return 1;

I was messing with the request functions in bug 176182 and I noticed this change... I may be missing something, but don't you have to add a JS_EndRequest at every possible return?
(In reply to comment #58)
> I was messing with the request functions in bug 176182 and I noticed this
> change... I may be missing something, but don't you have to add a JS_EndRequest
> at every possible return?

No, see the function name shown by diff -p -- this is in main, the returns are to cause process exit with the given status.  There's no point in doing more than the minimum required to test a JS_THREADSAFE-compiled js shell.

/be
This is a patch version to commit with heavy comments in JS_MaybeGC.
Attachment #218954 - Attachment is obsolete: true
I committed rt->gcBytes accounting changes.
Attached patch Small cleanup for NewGCThing (obsolete) — Splinter Review
This is a cleanup to make js_NewGCThing easier to follow. I would like to update the lock-less patch from here and from bug 333263 based on this.
Attachment #219156 - Flags: review?(brendan)
Comment on attachment 219156 [details] [diff] [review]
Small cleanup for NewGCThing

>+    bytesptr = (arenaList == &rt->gcArenaList[0])
>+               ? &rt->gcBytes
>+               : &rt->gcPrivateBytes;
>+    *bytesptr += GC_ARENA_SIZE;
>+

This commoning in NewGCArena is cool.

Below is a skeleton of proposed control flow in js_NewGCThing:

>+  retry:
>+    thing = arenaList->freeList;
>+    if (thing) {
          [1]
>+        goto success;
>+    }
>+    if ([BigCondition]) {
          [2]
>+        goto success;
>     }
> 
      [3]
>+    goto retry;
>+
>+  success:

How about instead:

  retry:
    thing = arenaList->freeList;
    if (thing) {
        [1]
    } else {
        if (!BigCondition) {
            [3]
            goto retry;
        }
        [2]
    }
  success:

And then success label and goto successes are not needed?

/be
I reorganised the control flow in js_NewGCThing as suggested above and moved the checks for rt->gcBytes to NewGCArena so it is easier to read the condition. For symmetry rt->gcBytes is decreased in DestroyGCArena and not in js_GC.
Attachment #219293 - Flags: review?(brendan)
Attachment #219156 - Attachment is obsolete: true
Attachment #219156 - Flags: review?(brendan)
My mailbox has more bug mail than I can read. Sorry for such late reply.

Brendan, which patches do you need me to review? I saw there are three now, and Igor has one waiting for you to review.
Brendan, should I reassign the review request for the attchment from comment 64 to somebody else?
Comment on attachment 219293 [details] [diff] [review]
Small cleanup for NewGCThing v2

I would be grateful if feng reviewed.  Sorry I missed this review request -- my thinkpad died. I'm back up on a new macbook but not recovered fully, data-wise.  I should just query bugzilla.

/be
Attachment #219293 - Flags: review?(feng.qian.moz)
Attachment #219293 - Flags: review?(brendan)
Attachment #219293 - Flags: review+
Note I hope to have initial version of lockless allocation on top of the patch from bug 333236 tomorrow.
The patch looks great! 

I cannot change the flag of the attachment (no autherization). What are right steps to review and approve a patch?

(In reply to comment #67)
> (From update of attachment 219293 [details] [diff] [review] [edit])
> I would be grateful if feng reviewed.  Sorry I missed this review request -- my
> thinkpad died. I'm back up on a new macbook but not recovered fully, data-wise.
>  I should just query bugzilla.
> 
> /be
> 

Feng, I gave you canconfirm and editbug capabilities at bugzilla.mozilla.org.

/be
Attachment #219293 - Flags: review?(feng.qian.moz) → review+
I committed the patch from comment 64.
This is a version of the patch from comment 49 with the following changes besides the sync with CVS head:

1. The patch uses the same limit on the maximum size of local pool, 10 * GC_NBYTES_MAX / (nbytes), when the pool is populated both from the global free list and from the tail of the last allocated arena. I do not see a reason for such difference. 

2. In js_GC cx->thread->gcFreeLists are cleared before the marking phase to avoid clearing them again after GC restart.
Attachment #218757 - Attachment is obsolete: true
Attachment #221917 - Flags: superreview?(brendan)
Attachment #221917 - Flags: review?(feng.qian.moz)
Attachment #218757 - Flags: review?(feng.qian.moz)
Here is stats for the following test case for the js shell:

for (var loop = 0; loop != 10; ++loop) {
        var bag = [];
        for (var i = 0; i != 10*1000*(loop + 1); ++i) {
                var a = [1,2,3];
                if (i % 3 == 0)
                        bag.push(i);
        }
        gc();
}

On my  Pentium-M 1.1 Ghz laptom with Fedora Core 5 with GCC 4.1 I have for the shell build with BUILD_OPT=1 JS_THREADSAFE=1 (best real time for 10 runs):

Without the patch:

         -O           -Os          -O3 -fomit-frame-pointer -march=pentium-m
real  0m5.309s     real  0m5.250s           real    0m4.501s
user  0m5.280s     user  0m5.216s           user    0m4.476s
sys   0m0.020s     sys   0m0.016s           sys     0m0.020s


With the patch:

    -O                -Os          -O3 -fomit-frame-pointer -march=pentium-m
real  0m5.067s     real  0m4.942s           real  0m4.196s
user  0m5.028s     user  0m4.912s           user  0m4.172s
sys   0m0.032s     sys   0m0.024s           sys   0m0.020s

The speedup:

    -O                -Os          -O3 -fomit-frame-pointer -march=pentium-m
    4.6%              5.9%                      6.8%                 

Time to consider at least -Os optimization for JS in browser builds?
Great!  Can someone get Windows MSVC numbers?

We've had a bug asking for -Os in the past, but IIRC it was considered not the best option.  Times have changed for gcc, so we should reconsider.  It's fine with me to start by testing more inputs on SpiderMonkey, and if -Os wins, switching just js/src/Makefile.in over to it.

I'll look at the patch as soon as I can.  Thanks,

/be
Here are the results I got, with Igor's test case run in the JS Shell built using MSVC 7.1 on a 3.0 GHz Pentium D (also BUILD_OPT=1 JS_THREADSAFE=1). Unfortunately I'm not able to test VC8 at the moment. I did two sets of five runs with each combination, these are the averages.

With the patch:
-O:   5.5312 s    5.4656 s
-Os:  5.8564 s    5.8282 s

Without the patch:
-O:   5.6064 s    5.6564 s
-Os:  6.0438 s    6.0711 s
In the prevevious version of the patch I forgot to remove an assert that cheked that js_NewGCThing converted exactly the whole just allocated arena into a thread-local pool. Since my changes to the initial patch always limit the number of things in the pool by 10 * GC_NBYTES_MAX / (nbytes), the assert is no longe applicable.

I hit the assert immediately when started to use a debug build of the browser.
Attachment #221917 - Attachment is obsolete: true
Attachment #221950 - Flags: superreview?(brendan)
Attachment #221950 - Flags: review?(feng.qian.moz)
Attachment #221917 - Flags: superreview?(brendan)
Attachment #221917 - Flags: review?(feng.qian.moz)
Igor, 

how about making MAX_THREAD_LOCAL_THINGS(nbytes) to be GC_PAGE_SIZE/nbytes. Orignal GC_NBYTES_MAX is my typo. Acquiring a page at once seems a better heuristics.

Indentation is not right after the line:
  arenaList = &rt->gcArenaList[flindex];
Is it tab vs. whitespace problem?

Otherwise, it looks good to me.

Thanks,
Feng

(In reply to comment #76)
> Created an attachment (id=221950) [edit]
> Updated thread-local allocation patch v2
> 
> In the prevevious version of the patch I forgot to remove an assert that cheked
> that js_NewGCThing converted exactly the whole just allocated arena into a
> thread-local pool. Since my changes to the initial patch always limit the
> number of things in the pool by 10 * GC_NBYTES_MAX / (nbytes), the assert is no
> longe applicable.
> 
> I hit the assert immediately when started to use a debug build of the browser.
> 

This version of the patch puts MAX_THREAD_LOCAL_THINGS just to 8 without any dependency on the size of allocated things. Here is the reason for this:

I use the following modification of the test from comment 73:

for (var loop = 0; loop != 10; ++loop) {
	var bag = [];
	for (var i = 0; i != 10*1000*(loop + 1); ++i) {
		var a = {};
		if (i % 3 != 0)
			bag.push(i);
	}
	gc();
}

Effectively the test is dominated by the allocation and garbage collection of 2- and 6-words things, that is JSObject and its 6 initial slots. The "i % 3" test should emulate 33% free list density when GC finishes.

I run the test against a JS_THREADSAFE optimized build of jsshell compiled with
   -march=pentium-m -O3 -fomit-frame-pointer
flags to optimizes gains from non-locking code for my Pentium 1.1 Ghz laptop with Fedora Core 5 and GCC 4.1. 

Here is the best user-time in seconds among 10 invocations of js shell with various MAX_THREAD_LOCAL_THINGS values:

  128  1.784
  64   1.784
  32   1.780
  16   1.836
  10   1.808
  8    1.792
  4    1.828
  2    1.872
  1    1.912
  0    1.980  (without the patch)

Note that a slight bump after 8 is real: I tried to run the test case for 100 or so times with MAX_THREAD_LOCAL_THINGS and 1.808 is best that I got. It can be attributes to the fact that GCC at -O3 uses loop unrolling only when number is 8 or less. In any case I do not see any practical improvements after 8 so that number went to the patch. 

Of cause for a heavily contest lock the results would be different, but since in the browser embeddings JS is single threaded most of the time, the number 8 should be a good approximation there as well.
Attachment #221950 - Attachment is obsolete: true
Attachment #222158 - Flags: superreview?(brendan)
Attachment #222158 - Flags: review?(feng.qian.moz)
Attachment #221950 - Flags: superreview?(brendan)
Attachment #221950 - Flags: review?(feng.qian.moz)
We can do a simple math here too, we need one lock for every MAX_THREAD_LOCAL_THINGS allocation approximately. In this case, it is about 7/8 locks were optimized away. The difference is kind of small between 7/8 and up.

(In reply to comment #78)
> Created an attachment (id=222158) [edit]
> Updated thread-local allocation patch v3
> 
> This version of the patch puts MAX_THREAD_LOCAL_THINGS just to 8 without any
> dependency on the size of allocated things. Here is the reason for this:
> 
> I use the following modification of the test from comment 73:
> 
> for (var loop = 0; loop != 10; ++loop) {
>         var bag = [];
>         for (var i = 0; i != 10*1000*(loop + 1); ++i) {
>                 var a = {};
>                 if (i % 3 != 0)
>                         bag.push(i);
>         }
>         gc();
> }
> 
> Effectively the test is dominated by the allocation and garbage collection of
> 2- and 6-words things, that is JSObject and its 6 initial slots. The "i % 3"
> test should emulate 33% free list density when GC finishes.
> 
> I run the test against a JS_THREADSAFE optimized build of jsshell compiled with
>    -march=pentium-m -O3 -fomit-frame-pointer
> flags to optimizes gains from non-locking code for my Pentium 1.1 Ghz laptop
> with Fedora Core 5 and GCC 4.1. 
> 
> Here is the best user-time in seconds among 10 invocations of js shell with
> various MAX_THREAD_LOCAL_THINGS values:
> 
>   128  1.784
>   64   1.784
>   32   1.780
>   16   1.836
>   10   1.808
>   8    1.792
>   4    1.828
>   2    1.872
>   1    1.912
>   0    1.980  (without the patch)
> 
> Note that a slight bump after 8 is real: I tried to run the test case for 100
> or so times with MAX_THREAD_LOCAL_THINGS and 1.808 is best that I got. It can
> be attributes to the fact that GCC at -O3 uses loop unrolling only when number
> is 8 or less. In any case I do not see any practical improvements after 8 so
> that number went to the patch. 
> 
> Of cause for a heavily contest lock the results would be different, but since
> in the browser embeddings JS is single threaded most of the time, the number 8
> should be a good approximation there as well.   
> 

Comment on attachment 222158 [details] [diff] [review]
Updated thread-local allocation patch v3

I didn't do compilation and run regression tests. Should reviewers do it?
Attachment #222158 - Flags: review?(feng.qian.moz) → review+
I committed the patch from comment 76. But my hope to get a better startup time did not realized: http://tinderbox.mozilla.org//showbuilds.cgi?tree=Firefox shows that startup increased from 593->594 ms. Perhaps other commits affected that?
Startup went down on luna and planetoid.  Most of the Ts boxes are way too noisy to notice a change in fewer than 10-20 cycles, imo, and the 594 is back down to 593 on pacifica in the very next cycle.  Some boxes that are reasonably low-noise show no effect (e.g. pacifica).  Sounds like this works in at least some cases, and doesn't hurt in others.
(In reply to comment #82)
>  Sounds like this works in at least some cases, and doesn't hurt in others.
> 

I mark the bug as fixed then :)

Status: ASSIGNED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Did this hurt Tp on pacifica?  It went up from ~116 to ~118 around the same time luna's Ts went down.
Attachment #222158 - Flags: superreview?(brendan)
The js_Unlock change to remove compare and swap caused bug 353962. See that bug for patch and fix credit.

/be
Depends on: 353962
Can the two pending review requests here be canceled? 
Attachment #217941 - Flags: review?(feng.qian.moz)
Attachment #216190 - Flags: review?(feng.qian.moz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: