eliminating the local root check on the fast path of the GC allocator

RESOLVED FIXED

Status

()

Core
JavaScript Engine
--
enhancement
RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: Igor Bukanov, Assigned: Igor Bukanov)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 -)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 attachment, 3 obsolete attachments)

(Assignee)

Description

8 years ago
Changes from the bug #521390 makes the fast path of the GC thing allocator to do only two checks. The first checks for a non-empty free list and the second checks if the allocated thing should be stored in the local roots, not as a newborn weak root. Since the local roots is only used for E4X, this second check in almost all cases just penalizes the allocation with an extra checks. 

To remove this check I suggest to store the free lists not in JSThreadData::gcFreeLists but in some other location when the local roots are active. This way the fast path of the allocator can only check for non-empty free lists, the check for local roots will be done on a slow path.
(Assignee)

Comment 1

8 years ago
Created attachment 407718 [details] [diff] [review]
v1

With the patch the local root stack contains its own copy of GC free lists. This copy is queried only when the main lists are empty eliminating the local root check on the fast path of the GC allocator. 

The patch moves the local root stack to JSThreadData from JSContext to simplify the free list management stored in JSThreadData.

The patch makes SS 1.6% faster on 64-bit Linux where the speedup comes from the GC-allocator sensitive benchmarks:

  string:              1.032x as fast    243.3ms +/- 0.4%   235.6ms +/- 0.3%     significant
    base64:            -                  10.8ms +/- 1.2%    10.8ms +/- 1.5% 
    fasta:             1.012x as fast     48.4ms +/- 0.5%    47.8ms +/- 0.4%     significant
    tagcloud:          1.018x as fast     78.2ms +/- 0.2%    76.9ms +/- 0.2%     significant
    unpack-code:       1.069x as fast     88.0ms +/- 1.1%    82.3ms +/- 0.6%     significant
Attachment #407718 - Flags: review?(brendan)
(Assignee)

Updated

8 years ago
Blocks: 523792
Comment on attachment 407718 [details] [diff] [review]
v1

Andreas said local root scopes in jsxml.cpp are not needed with conservative scanning. If true then can we save merge work by not landing this patch? I'm ok either way, we need to settle the issue of whether jsxml.cpp needs local root scopes.

>+struct JSLocalRootStack;
>+
>+struct JSLocalRootChunk;
>+
>+#define JSLRS_CHUNK_SHIFT       8
>+#define JSLRS_CHUNK_SIZE        JS_BIT(JSLRS_CHUNK_SHIFT)
>+#define JSLRS_CHUNK_MASK        JS_BITMASK(JSLRS_CHUNK_SHIFT)
>+
>+struct JSLocalRootChunk {
>+    jsval               roots[JSLRS_CHUNK_SIZE];
>+    JSLocalRootChunk    *down;
>+};
>+
>+struct JSLocalRootStack {

No need to forward-declare struct JSLocalRootStack.

>-            JSFreePointerListTask* task = JS_THREAD_DATA(this)->deallocatorTask;
>+            JSFreePointerListTask* task = thread->deallocatorTask;

This is all good cleanup -- could separate into another bug to get in ASAP.

>+CheckGCFreeListLink(JSGCThing *thing)
>+{
>+    /*
>+     * The GC things on the free lists come from one arena and the things on
>+     * the free list are strictly in the ascending order.

"the things on the free list are linked in ascending address order."

I like the SS perf win. The allocator path changes, which duplicate the newborn and lrs rooting, will conflict with the patches in bug 519947 and bug 505898 (Andreas, those two bugs overlap in their summaries; please fix). So the big-picture issue to settle here is whether we need local root scopes given the impending conservative stack scanning.

/be
Attachment #407718 - Flags: review?(gal)
(In reply to comment #2)
> and lrs rooting, will conflict with the patches in bug 519947 and bug 505898

Sorry, going off saved bugmail -- I should have noticed that bug 505898 was WONTFIXed. What's the bug for eliminating local root scopes that Andreas clued me into yesterday?

/be
(Assignee)

Comment 4

8 years ago
(In reply to comment #2)
> (From update of attachment 407718 [details] [diff] [review])
> Andreas said local root scopes in jsxml.cpp are not needed with conservative
> scanning.

A quick look over jsxml.cpp shows that, for example, XMLToXMLString needs local roots even with the stack scanning. Granted, such use can be converted into temporary roots but this is an extra work. So the point of this bug is that we can keep local roots for now without penalizing the allocator. 



 If true then can we save merge work by not landing this patch? I'm ok
> either way, we need to settle the issue of whether jsxml.cpp needs local root
> scopes.
> 
> >+struct JSLocalRootStack;
> >+
> >+struct JSLocalRootChunk;
> >+
> >+#define JSLRS_CHUNK_SHIFT       8
> >+#define JSLRS_CHUNK_SIZE        JS_BIT(JSLRS_CHUNK_SHIFT)
> >+#define JSLRS_CHUNK_MASK        JS_BITMASK(JSLRS_CHUNK_SHIFT)
> >+
> >+struct JSLocalRootChunk {
> >+    jsval               roots[JSLRS_CHUNK_SIZE];
> >+    JSLocalRootChunk    *down;
> >+};
> >+
> >+struct JSLocalRootStack {
> 
> No need to forward-declare struct JSLocalRootStack.
> 
> >-            JSFreePointerListTask* task = JS_THREAD_DATA(this)->deallocatorTask;
> >+            JSFreePointerListTask* task = thread->deallocatorTask;
> 
> This is all good cleanup -- could separate into another bug to get in ASAP.
> 
> >+CheckGCFreeListLink(JSGCThing *thing)
> >+{
> >+    /*
> >+     * The GC things on the free lists come from one arena and the things on
> >+     * the free list are strictly in the ascending order.
> 
> "the things on the free list are linked in ascending address order."
> 
> I like the SS perf win. The allocator path changes, which duplicate the newborn
> and lrs rooting, will conflict with the patches in bug 519947 and bug 505898
> (Andreas, those two bugs overlap in their summaries; please fix). So the
> big-picture issue to settle here is whether we need local root scopes given the
> impending conservative stack scanning.
> 
> /be

Comment 5

8 years ago
This is the conservative stack scanning patch. I think its ready to land:

https://bugzilla.mozilla.org/show_bug.cgi?id=516832

This is the bug for removing LocalRootScopes. I looked at the use in jsxml. We can remove pretty much all uses as far as I can tell. I have seen 2 or 3 cases where I wasn't sure, but those we can easily turn into TempValueRooters. I don't think we need 3+ different rooting schemes. If we have a chance to eliminate LRS, we should. XML is the only user of them, and the entire concept is really difficult to use and dangerous (see bug).

https://bugzilla.mozilla.org/show_bug.cgi?id=519949
(Assignee)

Comment 6

8 years ago
(In reply to comment #5)
> This is the bug for removing LocalRootScopes. I looked at the use in jsxml. We
> can remove pretty much all uses as far as I can tell. I have seen 2 or 3 cases
> where I wasn't sure, but those we can easily turn into TempValueRooters. I
> don't think we need 3+ different rooting schemes. If we have a chance to
> eliminate LRS, we should. XML is the only user of them, and the entire concept
> is really difficult to use and dangerous (see bug).

The local roots were advertised as a better rooting API so there is a chance that people use it. And what if some people do in a way that is similar to jsxml.cpp? So I suggest at least report this in the newsgroup and wait for some time. In the mean time this bug shows that local roots does not harm the performance and results just in source and maintenance. Which is bad, no doubt about this, but at least that can be tolerated for few more weeks.

Comment 7

8 years ago
I think in our GC futures debate we agreed that we want to do sweeping changes to improve our GC performance. We are still massively behind in this area. I am not convinced that we have proved that LRS do not harm performance. I think the proof for that is to remove them and measure the performance impact. If there is none, I would still argue to remove them since its a pretty dangerous API and it adds a fair amount of complexity, but it wouldn't be as urgent any more. If we want a pointer-bumping mostly copying GC, we are going to have to let go of some of our 14 year old APIs. Its inevitable. Embedders can chose to come along for the ride, or stick with 1.8 which works well and is stable.
(Assignee)

Comment 8

8 years ago
(In reply to comment #7)
> I think in our GC futures debate we agreed that we want to do sweeping changes
> to improve our GC performance. We are still massively behind in this area.

I suspect that an inlined GC allocator, see bug 523792, typed finalization and your idea of fixed-size mark bitmap things will make things better in a short term.

> I am
> not convinced that we have proved that LRS do not harm performance. I think the
> proof for that is to remove them and measure the performance impact.

I have not observed any differences in SS and v8 benchmarks with this patch versus a patch that just removes any traces of local roots in the code.

> If there
> is none, I would still argue to remove them since its a pretty dangerous API
> and it adds a fair amount of complexity, but it wouldn't be as urgent any more.
> If we want a pointer-bumping mostly copying GC, we are going to have to let go
> of some of our 14 year old APIs. Its inevitable. 

I fully agree about this. But IMO we should try to optimize and simplify to the max the performance of the current code base so we would have clear data that wins from copying GC comes from better GC architecture and not from code that was written with focus on the performance.

Comment 9

8 years ago
var d = new Date();
for (var i = 0; i < 10000000; ++i)
    ({})
print (new Date() - d);

unmodified:

whale:src gal$ ./Darwin_OPT.OBJ/js -j x.js
2024

with LRS check in allocator path removed:

whale:src gal$ ./Darwin_OPT.OBJ/js -j x.js
1965

This is on top of the SS 10ms speedup I get for removing newborn roots.

Having 3 rooting APIs, all of which we have to keep up to date and consult in various paths is not free. Conservative stack walking makes almost all engine-internal uses of these APIs obsolete. I really don't know why we want to hang on to code that doesn't fulfill any purpose any more. Anything on the stack will be marked. Explicitly rooting stack items is redundant. It doesn't make anything safer. At best it makes everything slower.

I tried to hack up a testcase that loses a root by allocating and then sticking the value into malloc-ed memory but not keeping it on the stack. Thats actually really hard to achieve since we also scan all machine registers. You have to go through a lot of gymnastics to even trigger this case (i.e. call a lot of code after a NewObject and all this time not root the newly allocated object and not pass it as stack argument). In other words writing such buggy code is really hard even if you try to do it intentionally. I am not looking to get this into 3.6, but 3.7 is 9 month out or so. By then we should really be able to let go of APIs that don't do anything except spend cycles on rooting things that get scanned anyway.
(Assignee)

Comment 10

8 years ago
Created attachment 407961 [details] [diff] [review]
v2

The new version addresses the nits from the comment 522867.
Attachment #407718 - Attachment is obsolete: true
Attachment #407961 - Flags: review?(gal)
Attachment #407718 - Flags: review?(gal)
Attachment #407718 - Flags: review?(brendan)

Updated

8 years ago
Attachment #407961 - Attachment is patch: true
Attachment #407961 - Attachment mime type: application/octet-stream → text/plain
(Assignee)

Comment 11

8 years ago
(In reply to comment #9)
> var d = new Date();
> for (var i = 0; i < 10000000; ++i)
>     ({})
> print (new Date() - d);

With either the patch in this bug or simply removing LRS from the code I see no difference in this test. In both cases the speedup is about 12%. This is for 64 bit Linux.

> Having 3 rooting APIs, all of which we have to keep up to date and consult in
> various paths is not free. Conservative stack walking makes almost all
> engine-internal uses of these APIs obsolete. I really don't know why we want to
> hang on to code that doesn't fulfill any purpose any more.

I am not arguing against that. My point is that the local roots can be changed so they do not harm performance and do not block other performance-related work in the GC. As such we can remove them, say, in few weeks after landing the conservative scaner, announcing that on the mail list and ensuring that jsxml.cpp does not rely on them in a way that defeats conservative stack scanner.
We can't introduce even temporary GC safety bugs, of course. So conservative scanner first, then LRS removal. Seems everyone agrees.

I'm in favor of ditching LRS (JNI inspired, they seem to work the same unless I have not kept up with JNI evolution). A newsgroup post would be the right thing. Andreas, can you do it? Thanks,

/be

Comment 13

8 years ago
Will do.
(Assignee)

Comment 14

8 years ago
So what about this bug before we eliminate the local roots? I need this for bug 523792.

Comment 15

8 years ago
Comment on attachment 407961 [details] [diff] [review]
v2


> static void
>+MarkThreadData(JSTracer *trc, JSThreadData *td)
>+{
>+#ifdef JS_TRACER
>+    td->traceMonitor.mark(trc);
>+#endif
>+    if (td->localRootStack)
>+        MarkLocalRoots(trc, td->localRootStack);
>+}

Why make this a static method instead of a member?

>+
>+static void
> PurgeThreadData(JSContext *cx, JSThreadData *data)
> {
>-    data->gcFreeLists.purge();
>+    data->purgeGCFreeLists();

Why this change?

>-    thread->data.mark((JSTracer *)arg);
>+    MarkThreadData((JSTracer *) arg, &thread->data);

See above.


>     if (!lrs) {
>-        lrs = (JSLocalRootStack *) cx->malloc(sizeof *lrs);
>-        if (!lrs)
>-            return JS_FALSE;
>+        lrs = (JSLocalRootStack *) js_malloc(sizeof *lrs);
>+        if (!lrs) {
>+            js_ReportOutOfMemory(cx);
>+            return false;
>+        }

Why no longer cx->malloc?

>-        cx->localRootStack = lrs;
>+        td->gcFreeLists.moveTo(&lrs->gcFreeLists);
>+        td->localRootStack = lrs;

I don't mind applying the patch if its tested and ready to go. I don't think this will make removing LRS harder.
(Assignee)

Comment 16

8 years ago
(In reply to comment #15)
> (From update of attachment 407961 [details] [diff] [review])
> 
> > static void
> >+MarkThreadData(JSTracer *trc, JSThreadData *td)
> >+{
> >+#ifdef JS_TRACER
> >+    td->traceMonitor.mark(trc);
> >+#endif
> >+    if (td->localRootStack)
> >+        MarkLocalRoots(trc, td->localRootStack);
> >+}
> 
> Why make this a static method instead of a member?

That is a left over from an initial version of the patch. I have kept it not to pollute a structure in a header file with a declaration of the method that is only called by other static methods. I can change this back to member, but since this a matter of style, I would like to have a second opinion on that.

> >+
> >+static void
> > PurgeThreadData(JSContext *cx, JSThreadData *data)
> > {
> >-    data->gcFreeLists.purge();
> >+    data->purgeGCFreeLists();
> 
> Why this change?

purgeGCFreeLists purges either data->gcFreeLists or, when the local roots are active, the free list copy in the local root stack while checking that data->gcFreeLists all points to null.


> Why no longer cx->malloc?

That fixes preexisting bug. cx->malloc should be used only for memory that is released during GC finalization. Local roots, on the other hand, are destroyed  explicitly when the code deactivates them.
(In reply to comment #16)
> > Why make this a static method instead of a member?
> 
> That is a left over from an initial version of the patch. I have kept it not
> to pollute a structure in a header file with a declaration of the method that
> is only called by other static methods. I can change this back to member, but
> since this a matter of style, I would like to have a second opinion on that.

Second opinion: member functions make for tighter code with less "area" to read to understand it.  Use a member function.  :-)


> > Why no longer cx->malloc?
> 
> That fixes preexisting bug. cx->malloc should be used only for memory that is
> released during GC finalization. Local roots, on the other hand, are destroyed 
> explicitly when the code deactivates them.

I keep reading this and thinking I understand the stated problem, but the kernel of it eludes me.  Can you explain your second sentence further?  It sounds like there should be an assertion in JSContext::free about the identity of the current thread, whether we're in the middle of a GC, and the identity of the potential GCing thread.
(Assignee)

Comment 18

8 years ago
Created attachment 408635 [details] [diff] [review]
v3

The new version of the patch turns all JSThreadData-related functions into member methods of JSThreadData.
Attachment #407961 - Attachment is obsolete: true
Attachment #408635 - Flags: review?(gal)
Attachment #407961 - Flags: review?(gal)
(Assignee)

Comment 19

8 years ago
(In reply to comment #17)
> I keep reading this and thinking I understand the stated problem, but the
> kernel of it eludes me.  Can you explain your second sentence further? 

Local roots follows the same LIFO as the native stack. In particular, all the memory backing them is released when js_LeaveLocalRottsWithResult is called independent from any GC activity. As such that memory does not contribute to GC memory pressure.

> It
> sounds like there should be an assertion in JSContext::free about the identity
> of the current thread, whether we're in the middle of a GC, and the identity of
> the potential GCing thread.

It is hard to avoid false positives with such assert. For example, xpconnect calls JSContext::free from GC_END callback when all the GC activities is finished.
(Assignee)

Comment 20

8 years ago
review ping

Comment 21

8 years ago
Comment on attachment 408635 [details] [diff] [review]
v3

Looks good. I recommend fuzzing this a bit before committing it. Jesse and Gary are always happy to help.

>-        lrs = (JSLocalRootStack *) cx->malloc(sizeof *lrs);
>-        if (!lrs)
>-            return JS_FALSE;
>+        lrs = (JSLocalRootStack *) js_malloc(sizeof *lrs);
>+        if (!lrs) {
>+            js_ReportOutOfMemory(cx);
>+            return false;
>+        }

Why this change? Thats basically exactly what cx->malloc does, plus it accounts for memory pressure.


>-        lrc = (JSLocalRootChunk *) cx->malloc(sizeof *lrc);
>-        if (!lrc)
>+        lrc = (JSLocalRootChunk *) js_malloc(sizeof *lrc);
>+        if (!lrc) {
>+            js_ReportOutOfMemory(cx);
>             return -1;
>+        }

dito.

>+struct JSLocalRootChunk;
>+
>+#define JSLRS_CHUNK_SHIFT       8
>+#define JSLRS_CHUNK_SIZE        JS_BIT(JSLRS_CHUNK_SHIFT)
>+#define JSLRS_CHUNK_MASK        JS_BITMASK(JSLRS_CHUNK_SHIFT)
>+
>+struct JSLocalRootChunk {
>+    jsval               roots[JSLRS_CHUNK_SIZE];
>+    JSLocalRootChunk    *down;
>+};
>+
>+struct JSLocalRootStack {
>+    uint32              scopeMark;
>+    uint32              rootCount;
>+    JSLocalRootChunk    *topChunk;
>+    JSLocalRootChunk    firstChunk;
>+
>+    /* See comments in js_NewFinalizableGCThing. */
>+    JSGCFreeLists       gcFreeLists;
>+};
>+
>+const uint32 JSLRS_NULL_MARK = uint32(-1);
>+

Could we use js::vector for this? I just remove a similiar manual data structure from jsgc.

>         if (!canGC) {
>             METER(rt->gcStats.doubleArenaStats.fail++);
>             JS_UNLOCK_GC(rt);
>-
>-            if (!JS_ON_TRACE(cx)) {
>-                /* Trace code handle this on its own. */
>-                js_ReportOutOfMemory(cx);
>-            }

Maybe assert here?
Attachment #408635 - Flags: review?(gal) → review+
(Assignee)

Comment 22

8 years ago
(In reply to comment #21)
> >+        lrs = (JSLocalRootStack *) js_malloc(sizeof *lrs);
> >+        if (!lrs) {
> >+            js_ReportOutOfMemory(cx);
> >+            return false;
> >+        }
> 
> Why this change? Thats basically exactly what cx->malloc does, plus it accounts for memory pressure.

The local roots follow normal LIFO order and running the GC cannot free any memory holding them. So as such the should not use cx->malloc and the above fixes an old bug. Note that I agree 100% that it would be nice if we could call just the system malloc like in bug 506125, but until that is fixed it is better to follow the existing model.

> Could we use js::vector for this? I just remove a similiar manual data
> structure from jsgc.

That could be done in a separated bug. On the other hand currently local roots use a chunked list of memory blocks, not a vector. As such they follow more closely the JS arena model and IMO it would be better to adopt that here.
(Assignee)

Comment 23

8 years ago
Created attachment 411658 [details] [diff] [review]
v4

The new version syncs the patch with the tip.
Attachment #408635 - Attachment is obsolete: true
Attachment #411658 - Flags: review+
(Assignee)

Comment 24

8 years ago
Gary, could you fuzz the patch from the comment 23? It is made against the revision 73efaee1b2ba in tracemonkey.

Note that without at least some E4X/XML and the GC calls the code paths that patch alters most will not be exercised.
(In reply to comment #24)
> Gary, could you fuzz the patch from the comment 23? It is made against the
> revision 73efaee1b2ba in tracemonkey.
> 
> Note that without at least some E4X/XML and the GC calls the code paths that
> patch alters most will not be exercised.

Sure!
(In reply to comment #23)
> Created an attachment (id=411658) [details]
> v4
> 
> The new version syncs the patch with the tip.

This seems to hold up just fine after 30-60mins of fuzzing. I'm not yet sure if it'll turn up anything in the future though.
(Assignee)

Comment 27

8 years ago
(In reply to comment #26)
> This seems to hold up just fine after 30-60mins of fuzzing.

Thanks!
Whiteboard: fixed-in-tracemonkey
(Assignee)

Updated

8 years ago
blocking2.0: --- → ?

Comment 29

8 years ago
http://hg.mozilla.org/mozilla-central/rev/b8d0c2166e3d
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED

Updated

8 years ago
blocking2.0: ? → -
You need to log in before you can comment on or make changes to this bug.