Avoid perf hit from js_PurgeDeflatedStringCache

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
9 years ago
7 years ago

People

(Reporter: dmandelin, Assigned: dmandelin)

Tracking

({fixed1.9.1})

Trunk
fixed1.9.1
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

(Assignee)

Description

9 years ago
In Dromaeo's version of fasta, fastRepeat is called repeatedly, 200+ times per second. In this workload, we allocate 32 MB via malloc after about 0.4 seconds. We allocate 122 bytes for the new string in the second part of the loop. This happens 2229 times per call of fastaRepeat, for 271938 bytes allocated via JS_malloc per call, which works out to a GC every 123 calls.

Each GC takes about .346 seconds, meaning GC is sucking up about 40% of run time.

Most of the GC run time is spent in js_FinalizeStringRT, the standard finalizer for strings. And most of that time is in js_PurgeDeflatedStringCache, which deletes any deflated string cache entries for that string. This is pretty unfortunate given that this workload creates almost no deflated strings.

I would suggest that deleting cached deflated strings exactly (i.e., deleting each cache entry when and exactly when its owning string is freed) is probably a bad thing. It's pretty expensive because it involves mucking with a hash table. Also, the current code has a lock/unlock pair on the cache for every string that gets deleted, which accounts for most of the time taken.

I think a definitive fix will require more thought, but I have at least a couple of simple ideas:

Option 1. Move the lock/unlock pair to surround the arena sweep loop, so it happens once per GC instead of once per string. I tried this and got a GC time reduction from .346s to .145s. 

Option 2. Add a flag to JSRuntime that records if a deflated string has been created since the last GC. Run js_PurgeDeflatedStringCache only if that flag has been set. I tried that and got a reduction from .346s to .094s most of the time, although a few GCs still took .346s.

Using both options would result in GC times of .094s or .145s every time. This is about a 1.5x improvement in throughput, which presumably would apply to most long-running string-heavy workloads.
(Assignee)

Comment 1

9 years ago
We talked about this a bit the other day and decided that it should be done with a bit per string (taken from the length field). Moving the lock/unlock pair is risky because it will deadlock if any finalizer tries to deflate a string. The idea is to set the bit when a deflated string cache entry is created, clear the cache entry only if the bit is set, and clear the bit when the cache entry is cleared.

The tricky part is to make the bit-fiddling work with concurrency. It's fine for bit modifications to be guarded by the deflated string cache lock. But we don't want to have to lock to check the bit, because avoiding those locks when there is no deflated string is 75% of the benefit. Also, we have to worry about other threads messing with other bits at the same time.

I need to study up on how we do this for existing string bits. In the meantime, if anyone wants to steal this bug, feel free.
(In reply to comment #1)
> We talked about this a bit the other day and decided that it should be done
> with a bit per string (taken from the length field). Moving the lock/unlock
> pair is risky because it will deadlock if any finalizer tries to deflate a
> string.

How much effort is it worth to avoid just saying "don't do that from finalizers"?  We already tell people not to allocate from finalizers, so not getting the bytes from a string for the first time seems like a reasonable thing to bar as well.
(Assignee)

Comment 3

9 years ago
(In reply to comment #2)
> (In reply to comment #1)
> > We talked about this a bit the other day and decided that it should be done
> > with a bit per string (taken from the length field). Moving the lock/unlock
> > pair is risky because it will deadlock if any finalizer tries to deflate a
> > string.
> 
> How much effort is it worth to avoid just saying "don't do that from
> finalizers"?  We already tell people not to allocate from finalizers, so not
> getting the bytes from a string for the first time seems like a reasonable
> thing to bar as well.

I don't know. Another problem with my previous solution is that it still does all the useless hashtable lookups if at least 1 string has been deflated since the last GC.

I've thought about the concurrency issues before, and I think it may not be too bad. I think it's not necessary to clear the bit, because it is checked as the string is being finalized, so presumably that string will never be seen again, anyway. Now, the stuff I'm less sure about:

- If a string is being finalized, then it should be unreachable from all threads. So only the finalizer can be running on it, and there is no problem with checking the bit in the finalizer.

- When setting the bit, there is a hazard. Another thread could call JSFLATSTR_SET_ATOMIZED to set the atomized bit in the length field. The bit-set codes are non-atomic, so, e.g., if we are setting ATOMIZED and DEFLATED bits simultaneously, one of the sets could end up not reflected in the final state.

There should be little contention, so I think OCC (using compare-and-swap) is the way to go.

Comment 4

9 years ago
(In reply to comment #1)
> The tricky part is to make the bit-fiddling work with concurrency. 

It should work since clearing of atomized or deflated bits due to concurrent bit update would mean taking an extra lock, doing a lookup in the hash and restoring the bit. Granted, a bad flip-flop state can occur due to two threads constantly restoring different bits, but that is unlikely and would just mean a performance loss.

More problematic issue is what to do with dependent strings. Taking a bit from them may hurt performance due the need to undepend them earlier when growing.

Comment 5

9 years ago
(In reply to comment #4)
> but that is unlikely and would just mean a
> performance loss.

The lost deflated bit may result in the GC not removing the deflated cache entry. Again, it could be tolerated, but that suggestion to use compare-and-swap sounds much safer.

Comment 6

9 years ago
(In reply to comment #1)
> Moving the lock/unlock
> pair is risky because it will deadlock if any finalizer tries to deflate a
> string.

Object finalizers are run before the final sweeping that frees objects and strings. So it should not be a problem to iterate over deflated cache entry before the final sweep and remove dead ones.
(In reply to comment #6)
> (In reply to comment #1)
> > Moving the lock/unlock
> > pair is risky because it will deadlock if any finalizer tries to deflate a
> > string.
> 
> Object finalizers are run before the final sweeping that frees objects and
> strings. So it should not be a problem to iterate over deflated cache entry
> before the final sweep and remove dead ones.

How do you decide what's dead? If you mark deflated string cache entries (why not? M&S works for strings and could be used to manage the deflated strings too), then you have to hash by JSString pointer for every live string, which could hurt as badly as hashing from js_FinalizeStringRT.

Locking overhead can be reduced to ~0, yeah. Avoiding the useless hashing is the real trick.

/be

Comment 8

9 years ago
(In reply to comment #7)
> (In reply to comment #6)
> > Object finalizers are run before the final sweeping that frees objects and
> > strings. So it should not be a problem to iterate over deflated cache entry
> > before the final sweep and remove dead ones.
> 
> How do you decide what's dead?

If the string is not GC-marked, then it is dead and the corresponding deflated string entry can be removed.
(In reply to comment #8)
> (In reply to comment #7)
> If the string is not GC-marked, then it is dead and the corresponding deflated
> string entry can be removed.

Sure, but most strings do not have deflated bytes cached, and the hash lookup costs enough for us to want to avoid it, if we can. We hash in the finalizer now, we could hash in a mark (gc-trace) phase instead, to mark the deflated string cache entry. But either way requires a hash lookup, unless we have that flag bit.

/be

Comment 10

9 years ago
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > If the string is not GC-marked, then it is dead and the corresponding deflated
> > string entry can be removed.
> 
> Sure, but most strings do not have deflated bytes cached, and the hash lookup
> costs enough for us to want to avoid it, if we can.

There is no need to do hash lookup, enumeration over deflated hash entries is sufficient to find which ones are free. 

Now, the current callback-based hash enumeration API are slow. That means that if there would be a lot of live deflated strings, it could be worse to enumerate over them rather than check if dead strings are in the cache. I guess one just need try and see.
(Assignee)

Comment 11

9 years ago
Created attachment 372945 [details] [diff] [review]
Patch for per-string deflate flag
Attachment #372945 - Flags: review?(jwalden+bmo)
(Assignee)

Updated

9 years ago
Attachment #372945 - Flags: review?(jwalden+bmo) → review?(brendan)
(Assignee)

Updated

9 years ago
Attachment #372945 - Flags: review?(igor)
Attachment #372945 - Flags: review?(brendan) → review+
Comment on attachment 372945 [details] [diff] [review]
Patch for per-string deflate flag

>+#define JSSTRING_SET_DEFLATED(str)  js_AtomicSetMask((jsword*) &(str)->length,\

Uber-nit: I'd crunch the )& together and put the space before the \ but it's ok either way (cast deserves to lose the space first, agreed).

>@@ -182,8 +190,10 @@ struct JSString {
>  * with the atomized bit set.
>  */
> #define JSFLATSTR_SET_ATOMIZED(str)                                           \
>-    ((void)(JS_ASSERT(JSSTRING_IS_FLAT(str) && !JSSTRING_IS_MUTABLE(str)),    \
>-            (str)->length |= JSSTRFLAG_ATOMIZED))
>+    do {                                                                      \
>+        JS_ASSERT(JSSTRING_IS_FLAT(str) && !JSSTRING_IS_MUTABLE(str));        \
>+        js_AtomicSetMask((jsword*) &(str)->length, JSSTRFLAG_ATOMIZED);       \
>+    } while (0);

Use JS_BEGIN_MACRO and JS_END_MACRO for these bigger-bodied macros -- no need for naked do { and } while (0) -- also lose that final ;, it would break any unbraced consequent use of this macro as if it were a function-call expression statement.

/be
(Assignee)

Comment 13

9 years ago
Pushed to TM with fixes as 9e0f6160ef35. Any fixes/backouts from Igor's review will go in when ready.

Comment 14

9 years ago
Comment on attachment 372945 [details] [diff] [review]
Patch for per-string deflate flag

>diff -r 2a8287df3669 js/src/jslock.h
>--- a/js/src/jslock.h	Sat Apr 11 16:03:13 2009 -0700
>+++ b/js/src/jslock.h	Wed Apr 15 11:48:18 2009 -0700
>@@ -311,6 +311,17 @@ js_CompareAndSwap(jsword *w, jsword ov, 
> 
> #endif
> 
>+static inline void
>+js_AtomicSetMask(jsword *w, jsword mask)
>+{

Use JS_INLINE here, not inline - this is a header that LiveConnect includes. Also, I am not sure that this should be inlined - setting of the flag happens after a heavy operation, so lets not add unnecessary code bloat. 

> 
>@@ -3444,6 +3444,7 @@ js_SetStringBytes(JSContext *cx, JSStrin
>     if (ok)
>         rt->deflatedStringCacheBytes += length;
> #endif
>+    JSSTRING_SET_DEFLATED(str);

The deflated flag should only be set when ok is true. 


>@@ -3498,6 +3499,7 @@ js_GetStringBytes(JSContext *cx, JSStrin
> #ifdef DEBUG
>                 rt->deflatedStringCacheBytes += JSSTRING_LENGTH(str);
> #endif
>+                JSSTRING_SET_DEFLATED(str);

Could you attach at least -U8 patch to have more context? 


>+++ b/js/src/jsstr.h	Wed Apr 15 11:48:18 2009 -0700
>@@ -108,8 +108,9 @@ struct JSString {
> #define JSSTRFLAG_PREFIX            JSSTRING_BIT(JS_BITS_PER_WORD - 2)
> #define JSSTRFLAG_MUTABLE           JSSTRFLAG_PREFIX
> #define JSSTRFLAG_ATOMIZED          JSSTRING_BIT(JS_BITS_PER_WORD - 3)
>+#define JSSTRFLAG_DEFLATED          JSSTRING_BIT(JS_BITS_PER_WORD - 4)

Document JSSTRFLAG_DEFLATED at the big comment block at the beging of the file.

> #define JSFLATSTR_SET_ATOMIZED(str)                                           \
>-    ((void)(JS_ASSERT(JSSTRING_IS_FLAT(str) && !JSSTRING_IS_MUTABLE(str)),    \
>-            (str)->length |= JSSTRFLAG_ATOMIZED))
>+    do {                                                                      \

Use JS_BEGIN_MACRO/JS_END_MACRO to avoid over warnings with MSVC.
Attachment #372945 - Flags: review?(igor) → review-
(Assignee)

Comment 15

9 years ago
Created attachment 373215 [details] [diff] [review]
Revised per Igor's review

Revised patch. I had to install a new version of Mercurial to get the 8-line diff context.

I went ahead and made the atomic-mask function non-inline since it's probably dominated by the time of the compare-and-swap anyway. I'm happy to let the compiler decide what to do with it.
Assignee: general → dmandelin
Attachment #372945 - Attachment is obsolete: true
Attachment #373215 - Flags: review?(igor)

Comment 16

9 years ago
Comment on attachment 373215 [details] [diff] [review]
Revised per Igor's review

>  *
>+ * Any string with JSSTRFLAG_DEFLATED set means that the string has an entry
>+ * in the deflated string cache. The reason for this flag is so that the
>+ * string finalizer can try to remove the deflated string cache only if it
>+ * exists. The removal is an expensive step with hash lookups and locking.

Nit: replace the comments starting from "The reason" with :

The GC uses this flag to optimize the string finalization and avoid an expensive cache lookup for strings that were never deflated.


(Note: I am not sure about article usage in the above sentence especially in "the string finalization".)
Attachment #373215 - Flags: review?(igor) → review+

Comment 17

9 years ago
(In reply to comment #15)
> Created an attachment (id=373215) [details]
> Revised per Igor's review
> 
> Revised patch. I had to install a new version of Mercurial to get the 8-line
> diff context.

Welcome to the club - I had to do the same on a server running Ubuntu 8.04.
(Assignee)

Comment 18

9 years ago
Diffs between first and second patch pushed to TM as ba70256c0155.
(Assignee)

Comment 19

9 years ago
Created attachment 373236 [details] [diff] [review]
Fix shell bustage
Attachment #373236 - Flags: review?(brendan)
Attachment #373236 - Flags: review?(brendan) → review+
Comment on attachment 373236 [details] [diff] [review]
Fix shell bustage

r=me with the #ifdef fusion already suggested and made in person.

/be
(Assignee)

Comment 21

9 years ago
Shell bustage fix pushed as to TM as c7ce33adf7e2.

Comment 22

9 years ago
The landed code contains:

JS_ATOMIC_SET_MASK(w, mask) *(w) |= mask

This is a bad style - use parenthesis around mask and the whole expression.

At this point I think it is better to backout all the pieces and re-land them as one changeset.
Sorry, I was looking over dmandelin's shoulder and did not catch that. I should have re-reviewed but was in a hurry out the door.

Re: landing as one changeset, it's really up to sayrer. So far he has tracked multiple landings per bug. One per bug is best, but if not, is it really worth the extra churn?

/be
(Assignee)

Comment 24

9 years ago
Ick. I'm amazed I put the parens around the 'w' but then forgot the rest. Well, I still blame CPP. ;-)
(Assignee)

Comment 25

9 years ago
Btw, I pushed the parens to TM as 5f3c359ce64e.

Backing out doesn't seem useful to me because it doesn't get rid of the bad revs, just adds more revs. But I haven't thought about it deeply. I guess the backouts give one a clue that the bad revs are bad, if one looks for backouts. Either way, it should be fairly easy to stack up these changesets if it's decided to backout and reland as one patch.

Comment 26

9 years ago
Created attachment 373286 [details] [diff] [review]
fixing js1_5/extensions/regress-342960.js

js1_5/extensions/regress-342960.js constructs a string with 256 MB of chars. With the patch the maximum limit becomes 128 MB causing unanticipated test case failure. This fixes the test.

I wish this would be addressed prior landing of patches to avoid wasting time to figure out what caused a test case regression.

Updated

9 years ago
Attachment #373286 - Flags: review?(bclary)

Updated

9 years ago
Attachment #373286 - Flags: review?(bclary) → review+

Comment 27

9 years ago
updated test - http://hg.mozilla.org/tracemonkey/rev/6e609b154f7c

Updated

9 years ago
Depends on: 488808

Comment 28

9 years ago
http://hg.mozilla.org/mozilla-central/rev/9e0f6160ef35
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED

Comment 29

9 years ago
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/470ad6bcb075
Keywords: fixed1.9.1

Comment 30

9 years ago
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/bdcc9ac7a8d9

Comment 31

9 years ago
/cvsroot/mozilla/js/tests/js1_5/extensions/regress-342960.js,v  <--  regress-342960.js
new revision: 1.4; previous revision: 1.3
Depends on: 583615
You need to log in before you can comment on or make changes to this bug.