Last Comment Bug 688641 - Allocate GC chunks using helper thread.
: Allocate GC chunks using helper thread.
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla10
Assigned To: Igor Bukanov
:
Mentors:
Depends on: 694883
Blocks: 671702
  Show dependency treegraph
 
Reported: 2011-09-22 15:52 PDT by Igor Bukanov
Modified: 2011-12-07 08:05 PST (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (39.45 KB, patch)
2011-09-23 12:18 PDT, Igor Bukanov
no flags Details | Diff | Review
v2 (41.75 KB, patch)
2011-09-24 11:36 PDT, Igor Bukanov
no flags Details | Diff | Review
v3 (41.35 KB, patch)
2011-09-24 12:21 PDT, Igor Bukanov
no flags Details | Diff | Review
v4 (45.40 KB, patch)
2011-09-25 14:23 PDT, Igor Bukanov
no flags Details | Diff | Review
v5 (44.30 KB, patch)
2011-09-27 06:47 PDT, Igor Bukanov
no flags Details | Diff | Review
v6 (47.33 KB, patch)
2011-09-28 04:23 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Review

Description Igor Bukanov 2011-09-22 15:52:38 PDT
Using smaller GC chunks as suggested in bug 671702 as a way to improve memory utilization requires a faster way to allocate GC chunks. Otherwise at least on OSX we take too big performance hit as on that platform mmap is rather expensive.

One idea is to use a background thread for chunk allocation, the same thread that currently release GC chunks back to OS. This way thee chunks will be allocated and released by the same kernel thread improving cache utilization on multi-core CPU. Also we can release empty chunks earlier and keep their number small minimizing the JS heap memory usage.

The question is if this improves the overall allocation speed or weather the extra locking/signaling would offset any wins.
Comment 1 Igor Bukanov 2011-09-23 12:18:01 PDT
Created attachment 562127 [details] [diff] [review]
v1

The patch uses small 4-chunk pool to allocate chunks in the background. When the number of chunks in the pool 2 or less the background thread starts and replenishes the pool. The pool effectively replaces the previous empty chunk list and corresponding chunk aging code. On fast 4-core i5-2400 3.1 GHz under Linux I have for http://v8.googlecode.com/svn/data/benchmarks/v6/run.html :

Before the patch            After the patch
		  	    
Score: 7810	  	    Score: 7941
Richards: 9825	  	    Richards: 9962
DeltaBlue: 12258  	    DeltaBlue: 12562
Crypto: 16898	  	    Crypto: 16730
RayTrace: 4387	  	    RayTrace: 4427
EarleyBoyer: 9327 	    EarleyBoyer: 9305
RegExp: 1907	  	    RegExp: 1917
Splay: 11164      	    Splay: 12044

This shows a 7% improvement in splay.

I will test this more.
Comment 2 Igor Bukanov 2011-09-23 12:31:35 PDT
When I test the patch under js shell using js/src/v8/run.js I observe the speedup rather sporadically. That is, 3 of 4 runs shows no change from the base in total score or in splay, but one in 4 runs show 6-7% speedup.
Comment 3 Igor Bukanov 2011-09-23 13:03:58 PDT
On dual-core AMD Athlon(tm) II Neo K325 1.3 GHz CPU (AMD answer to Intel Atom) under 32 bit Linux I see over 10% speedup in splay when run under js shell:

Splay: 3928   ->    Splay: 4572

The rest of benchmarks is unaffected.
Comment 4 Igor Bukanov 2011-09-23 13:34:56 PDT
To test how the patch behaves on one core CPU I disabled one of two cores on that Neo K325 1.3 GHz CPU. With the patch I see no change in splay benchmark, the score stays around 3400. However, DeltaBlue and RayTrace regress badly:

DeltaBlue: 3914 -> DeltaBlue: 3362 , 18% slowdown
RayTrace: 1498  -> RayTrace: 1367 , 10 % slowdown
Comment 5 Gregor Wagner [:gwagner] 2011-09-23 13:53:27 PDT
(In reply to Igor Bukanov from comment #4)
> To test how the patch behaves on one core CPU I disabled one of two cores on
> that Neo K325 1.3 GHz CPU. With the patch I see no change in splay
> benchmark, the score stays around 3400. However, DeltaBlue and RayTrace
> regress badly:
> 
> DeltaBlue: 3914 -> DeltaBlue: 3362 , 18% slowdown
> RayTrace: 1498  -> RayTrace: 1367 , 10 % slowdown

So you do only allocation on the background thread and if there is no chunk left we wait for the background thread to run?
Comment 6 Igor Bukanov 2011-09-23 14:05:36 PDT
(In reply to Gregor Wagner from comment #5)
> So you do only allocation on the background thread and if there is no chunk
> left we wait for the background thread to run?

No - the patch never waits for the background thread. If it has not managed to get a chunk in time before the allocation thread emptied the pool, the the allocation thread always allocates a chunk on its own.

But perhaps this is wrong on single-core CPU due to potentially bad contention when 2 threads tries to mmap 1MB chunks simultaneously. So I will update the patch so the allocation thread waits if it see that the background tries to allocates a chunk.

If that would not help with the regression, I will add a runtime switch to the patch to switch to the old behavior. The switch can be activated on single-core CPU.
Comment 7 Igor Bukanov 2011-09-24 11:36:51 PDT
Created attachment 562248 [details] [diff] [review]
v2

I was not able to recover the regression in v1 on single-core CPU via changes in scheduling policy. mmap has inherent cost and with one core the only way to hide it is to avoid mmap.

So in v2 the GC initialization code queries the number of CPU in the system and switches the background allocation on only on multi-core. On single-code CPU the patch continue to use the current chunk aging code.
Comment 8 Igor Bukanov 2011-09-24 12:21:53 PDT
Created attachment 562253 [details] [diff] [review]
v3

According to http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man3/sysconf.3.html sysconf(_SC_NPROCESSORS_ONLN) is available on OSX starting from 10.4 so there is no need in Mac-specific version of GetCPUCount() that uses sysctl/CTL_HW/HW_AVAILCPU/HW_NCPU.
Comment 9 Igor Bukanov 2011-09-25 14:23:32 PDT
Created attachment 562329 [details] [diff] [review]
v4

The patch implements background chunk allocation when the number of CPU is at least 2. In that case if the number of available empty drops below ChunkPool::BACKGROND_CHUNK_MIN, then the helper thread is asked to get more chunks. It does that until there are at least ChunkPool::BACKGROND_CHUNK_MAX chunks.

In the patch I have chosen 3 and 6 as values for min/max after testing on that dual-core 1.3 GHz Neo netbook. On a 4-core desktop even 0 and 1 would work just as fine. I suppose on a that desktop there is always a core to spare to run the helper thread so the thread always manages to allocate one more chunk before the last one is exhausted. But on the netbook a small queue is necessary to mitigate sporadic busy state of the system.

The allocation thread never waits for the background allocation to finish. If there are not empty chunks, it allocates a chunk on its own even if the helper thread is trying to do the same. This not only simplifies the code but also avoids introducing a bottleneck. On a busy system with many cores with many threads allocating the helper thread may not be able to keep up with chunk demand. In that case any wait would slow down the system.

The GC does wait for the allocation to finish. In principle it there is no need to do that before the finalization starts. Only at that stage the chunk pool will be accessed by the GC thread without taking the GC lock. But doing the wait for the allocation together with the wait for the sweeping lead to simpler code.

Another place when the code waits for background allocation to finsish is when retiring from OOM error. There I also added a call to empty the chunk pool. It should also help when the background allocation is disabled.

The drawback of the patch is that on my netbook the V8 score for Splay and to a lesser extend DeltaBlue became more volatile even if Splay is improved by 15% on that system.
Comment 10 Igor Bukanov 2011-09-26 07:17:48 PDT
Comment on attachment 562329 [details] [diff] [review]
v4

This version of the patch allocates 6 chunks in the background if the runtime needs only one ever, like the web workers runtime or a shell running small script. I will fix this in a new patch.
Comment 11 Igor Bukanov 2011-09-27 06:47:00 PDT
Created attachment 562746 [details] [diff] [review]
v5

Besides the problem with wasting too much memory I observed regressions with v4 in benchmarks besides the splay. The Linux perf tools showed that that came from significant increase cache misses. The problem is that patch aggressively released chunks during GC. I assumed that the background allocation would allow to get chunk back quickly so there was no point to keep those chunks around. However chunk release and allocation invalidates caches including caches for the most recently freed chunk. That was enough to slow down things.

It is possible to fix this via releasing the least recently used chunk, but for now I decided to take simpler approach and keep the chunk aging code as is. The new patch also starts the background allocation only when there are no empty chunks. It turned out that even my 1.3GHz Dual-core Neo netbook can get a new chunk in time for a new allocation even on relatively busy system as long as it does not race in kernel with another thread doing mmap (another problem in v4). 

The new patch preserves on the netbook that 15% win in V8 splay and small but noticeable win in deltablue. On 3.1 a desktop with 4-core intel i5-2400 CPU the win is within noise. Again, perf tool shows that on the netbook the CPU utilization with the patch increased from 107% to 130%, while on the desktop that changed from 101% to 101.5%. Apparently on the desktop mmap is fast enough not to worry about its performance implications.
Comment 12 Igor Bukanov 2011-09-28 04:23:00 PDT
Created attachment 563031 [details] [diff] [review]
v6

The new version adds reporting on ratio of chunks allocated in background/allocation thread.

Bill, could you review this as Gregor does not have time?
Comment 13 Bill McCloskey (:billm) 2011-10-05 17:27:37 PDT
Comment on attachment 563031 [details] [diff] [review]
v6

Review of attachment 563031 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, Igor. I saw a speedup on splay under Linux with this patch.

::: js/src/jscntxt.cpp
@@ +1489,5 @@
>  JS_FRIEND_API(void *)
>  JSRuntime::onOutOfMemory(void *p, size_t nbytes, JSContext *cx)
>  {
> +    /*
> +     * Retry when we done with the background sweeping and stopped all the

"when we done" -> "when we are done"
"and stopped" -> "and have stopped"

::: js/src/jsgc.cpp
@@ +428,5 @@
> +           rt->gcChunkSet.count() >= 4;
> +}
> +#endif
> +
> +/* Must be called with the GC lock taken. */

Can we assert this? (And in other places?)

@@ +455,5 @@
> +#ifdef JS_THREADSAFE
> +    /*
> +     * We schedule the background allocation after a potential allocate call
> +     * above to avoid getting chunks from two threads in parallel and races in
> +     * the OS.

Are you saying here that you don't want to start the background allocation earlier because then two threads may try to allocate a chunk simultaneously, which would increase lock contention in the OS?

@@ +2084,5 @@
>  
> +inline JSRuntime *
> +GCHelperThread::runtime()
> +{
> +    return reinterpret_cast<JSRuntime *>(reinterpret_cast<uintptr_t>(this) - offsetof(JSRuntime, gcHelperThread));

Uh, is this really necessary? If you want to eliminate the argument, why not just add a field to store the runtime?

@@ +2160,5 @@
> +                if (rt->gcChunkPool.wantBackgroundAllocation(rt))
> +                    break;
> +            }
> +            if (state == ALLOCATING)
> +                state = IDLE;

Could you try to simplify the control-flow here? Something like this seems better to me:
    case ALLOCATING:
        do {
            Chunk *chunk;
            {
                AutoUnlockGC unlock(rt);
                chunk = Chunk::allocate();
            }
            if (!chunk)
                break;
            rt->gcChunkPool.put(rt, chunk);
        } while (state == ALLOCATING && rt->gcChunkPool.wantBackgroundAllocation(rt));
        if (state == ALLOCATING)
            state = IDLE;

@@ +2213,5 @@
> +}
> +
> +/* Must be called with the GC lock taken. */
> +inline void
> +GCHelperThread::startBackgroundAllocationIfIdle() {

Brace should be on new line.

@@ +2250,5 @@
> +     * Expire the chunks released during the GC so they will be available to
> +     * the rest of the system immediately.
> +     */
> +    JSRuntime *rt = runtime();
> +    rt->gcChunkPool.expire(rt, shouldShrink());

I understand why you want to release early. However, if this is a shrinking GC, isn't it possible that more chunks will be freed below and we will fail to free them immediately? Could you put a second expiration after the backgroundFinalize calls?

Also, is it important that this expiration happens before the unlock? Does it reduce contention or something?

@@ -2723,5 @@
>  
>          {
> -#ifdef JS_THREADSAFE
> -            rt->gcHelperThread.waitBackgroundSweepEnd(rt);
> -#endif

This call never made any sense to me. Do you know why it was there in the first place?

::: js/src/jslock.h
@@ +197,5 @@
>  
>  #define JS_ATOMIC_SET_MASK(w, mask) js_AtomicSetMask(w, mask)
>  #define JS_ATOMIC_CLEAR_MASK(w, mask) js_AtomicClearMask(w, mask)
>  
> +extern  unsigned

Two spaces here.
Comment 14 Igor Bukanov 2011-10-06 02:01:30 PDT
(In reply to Bill McCloskey (:billm) from comment #13)
> > +/* Must be called with the GC lock taken. */
> 
> Can we assert this? (And in other places?)

I will file a separated bug about that. 

> > +inline JSRuntime *
> > +GCHelperThread::runtime()
> > +{
> > +    return reinterpret_cast<JSRuntime *>(reinterpret_cast<uintptr_t>(this) - offsetof(JSRuntime, gcHelperThread));
> 
> Uh, is this really necessary? If you want to eliminate the argument, why not
> just add a field to store the runtime?

Then it would be necessary to add #ifdef JS_THREADSAFE to JSRuntime constructor around gcHelperThread(this). That is even uglier IMO so I went with that offsetof hack that also eliminates an extra load.

> > +    rt->gcChunkPool.expire(rt, shouldShrink());
> 
> I understand why you want to release early. However, if this is a shrinking
> GC, isn't it possible that more chunks will be freed below and we will fail
> to free them immediately? Could you put a second expiration after the
> backgroundFinalize calls?

During the background sweep the empty chunks are released immediately. There is no need for the second expire call. 

> Also, is it important that this expiration happens before the unlock? Does
> it reduce contention or something?

I should file another bug about this. To reduce risk of regressions I have not changed the present logic of releasing chunks under the lock.

> 
> @@ -2723,5 @@
> >  
> >          {
> > -#ifdef JS_THREADSAFE
> > -            rt->gcHelperThread.waitBackgroundSweepEnd(rt);
> > -#endif
> 
> This call never made any sense to me. Do you know why it was there in the
> first place?

I do not know.
Comment 15 Igor Bukanov 2011-10-06 03:17:12 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed with using JSRuntime * as an extra field to avoid the offsetof hack above. With the bug 647390 fixed adding the gcHelperThread(rt) initializer to ::JSRunime is no longer ugly.
Comment 16 Bill McCloskey (:billm) 2011-10-06 10:34:07 PDT
(In reply to Igor Bukanov from comment #15)
> http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed
> with using JSRuntime * as an extra field to avoid the offsetof hack above.
> With the bug 647390 fixed adding the gcHelperThread(rt) initializer to
> ::JSRunime is no longer ugly.

It looks like GCHelperThread::runtime() is still there.
Comment 17 Igor Bukanov 2011-10-06 10:52:52 PDT
(In reply to Bill McCloskey (:billm) from comment #16)
> (In reply to Igor Bukanov from comment #15)
> > http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed
> > with using JSRuntime * as an extra field to avoid the offsetof hack above.
> > With the bug 647390 fixed adding the gcHelperThread(rt) initializer to
> > ::JSRunime is no longer ugly.
> 
> It looks like GCHelperThread::runtime() is still there.

Sorry, I forgot hg refresh before exporting the patch. I will land the extra changes separately.
Comment 18 Igor Bukanov 2011-10-06 10:59:32 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/5e51a2a6ef4f - the missing part of the previous commit.

Note You need to log in before you can comment on or make changes to this bug.