Open Bug 638660 Opened 9 years ago Updated Last year

Use background thread for parallel marking.

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

blocking-kilimanjaro -

People

(Reporter: gwagner, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [Snappy:P3])

Attachments

(1 file, 10 obsolete files)

I implemented a first draft version of parallel marking.

The first design includes 2 GCMarkers. One for the main thread that performs the GC and one for the background thread. Each GCMarker has a deque that acts as a marking stack. A GCMarker pushes and pops from one end and another GCMarker can steal objects from the other end.
The main thread executes MarkRuntime and only pushes roots on the deque. The background thread starts stealing objects from the other side. After all initial roots are pushed, the main thread also starts marking.
Assignee: general → anygregor
Attached patch very first draft (obsolete) — Splinter Review
This is an early draft with lots of debug code in there.

I tried the realtime raytracer with the new approach and the marking time goes down from 200ms to 85 - 90ms!

A typical GC looks like (in ms):

       Total, Marking, sweeping
Trunk: 300.3,  205.0,   94.5,
Patch: 180.5,   85.1,   94.6,

Very promising results!
Depends on: 616666
Attached patch patch (obsolete) — Splinter Review
Updated version based on the new marking stack.
The marking time reduction is around 20-40% all over the place.
Since we don't spend much time sweeping any more the marking time reduction is equal to the whole GC pause time reduction.
Attachment #517021 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
Tryserver is green.

The gcHelperThread can steal work from the main mark stack.

Large objects are split into half sized largeMarkStackItems and pushed on the largeMarkingStack. The gcHelperThread can steal now the second half. This is definitely an important for the realtime raytracer since it only has one large object with about 1 mill kids.

I also changed OSAtomicCompareAndSwapPtrBarrier on mac to __sync_bool_compare_and_swap because because it is about 10% faster.
Attachment #527426 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
update
Attachment #528326 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
Attachment #528697 - Attachment is obsolete: true
Attachment #528857 - Flags: review?(gal)
Attached patch patch (obsolete) — Splinter Review
Don't use CAS for single-threaded marking.
Attachment #528857 - Attachment is obsolete: true
Attachment #528857 - Flags: review?(gal)
Attachment #528986 - Flags: review?(gal)
Attached patch patch (obsolete) — Splinter Review
merge
Attachment #528986 - Attachment is obsolete: true
Attachment #528986 - Flags: review?(gal)
Attachment #529235 - Flags: review?(gal)
It looks like the usage of rt->marking has a problem similar to the bug 652416. That is, it  assumes that !rt->marking in GCHelperThread::doMark implies that it can query the state of variables written on the GC thread. This is unsound even on x86.
Other problems:

delayMarkingChildren and drainMarkStack() are not modified for safe parallel marking. I suppose one safest solution would be to abandon the parallel marking if we need delayed marking. 

GCMarker::helpOtherMarker asserts that tmp->isMarked(). I am not sure that we can do that. I assume that as a part of the queue implementation some memory barriers would be issued, but I do not see that those would be enough to guarantee the read ordering here. I guess we can just remove the assert with comments here.
The queue algorithm assumes that the maximum queue size on 32-bit CPU is 2^21 so the rest 11 bits can be used to implement unique tagging. I am not sure that this is enough to ensure that the tag never wraps in practice. So what about changing the bit split to use just 16 bits for queue size as we do not use bigger queues?
The patch adds the second signal to the background thread intended to be used to signal the start of marking phase. As implemented that could lead to a deadlock in GCHelperThread::finish as that thread sends the wakeup signal. If the background thread waits for the the marking signal at that memonet, then it may never wake up. This could be fixed via keeping just one signal and using an enum to describe the reason for the wakeup like MARK, SWEEP, SHUTDOWN.
(In reply to comment #8)
> It looks like the usage of rt->marking has a problem similar to the bug 652416.
> That is, it  assumes that !rt->marking in GCHelperThread::doMark implies that
> it can query the state of variables written on the GC thread. This is unsound
> even on x86.

Thx for looking over it!

rt->marking should be surrounded by Locks (start/end BackgroundMark). What control-flow do you see where it doesn't work?

For delayMarkingChildren and drainMarkStack():
I see what you mean and we can definitely use a lock there and don't need to make it more complicated.

Reordering in the CPU can lead to a situation where both GCMarkers can get the same object and therefore call ScanObject on it as you mentioned in an email but I don't see a problem there.
(In reply to comment #12)
> rt->marking should be surrounded by Locks (start/end BackgroundMark). What
> control-flow do you see where it doesn't work?

Hm, I do not see how rt->marking is protected by the locks. In all cases AFAICS it is read/modified outside the lock. Fortunately the ordering here does not matter as the usage of it in doMark does not depend on it.

But it is rather unclear what is the ourpose of that field. Is it to ensure that the background marking would not stop prematurely? That is, the code prefer for that thread to busy-loop if the main thread has not managed to generate enough work, right? In any case, the patch needs comments explaining this.

> 
> For delayMarkingChildren and drainMarkStack():
> I see what you mean and we can definitely use a lock there and don't need to
> make it more complicated.

Yes, the lock seems to be the simplest solution.

> 
> Reordering in the CPU can lead to a situation where both GCMarkers can get the
> same object and therefore call ScanObject on it as you mentioned in an email
> but I don't see a problem there.

The problem is with any code that uses IS_GC_MARKING_TRACER to modify data structures under the real marking. Currently these do not tolerate the parallel marking. So we either fix the queue implementation to ensure that two threads cannot pop the same object or fix the users of IS_GC_MARKING_TRACER.

BTW, regarding the implementation, do you know if the usage of dequeue in V8 in fact tolerates popping of the same object? Also it would be interesting to know if at least on X86 it is impossible for a thread A to read an older value of some variable that was updated by compare-and-swap on thread B. http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf (table on page 16) mentioned that on X86 atomic operation and load/store are never reordered, but I do not see if this applies across CPUs or means that an atomic operation on X86 acts as a read-write barrier for the CPU that issues them and has no influence on other CPUs.
For references, according to Intel manual , http://download.intel.com/design/processor/manuals/253668.pdf , page 329:

Reads may be reordered with older writes to different locations but not with older writes to the same location.

The example on the page 334 illustrates that explicitly.

Hence the Deque implementation from the patch is not safe even on x86. In popBottom() the assignment "bot = currBot;" can happen after the the code read oldTaggedTop.getTop() in "currBot > oldTaggedTop.getTop()". If that condition is true and after this read but before the re-ordered assignment "bot = currBot;" another CPU will be in popTop(), then for it the condition "currBot <= oldTaggedTop.getTop()" will be false so it ptoceed with compare-and-swap to update the top. That will succeed as popBottom() does not change oldTaggedTop in the fast pop case. This results in popBottom() and popTop() returning the same element.
(In reply to comment #14)
> For references, according to Intel manual ,
> http://download.intel.com/design/processor/manuals/253668.pdf , page 329:
> 
> Reads may be reordered with older writes to different locations but not with
> older writes to the same location.
> 
> The example on the page 334 illustrates that explicitly.
> 
> Hence the Deque implementation from the patch is not safe even on x86. In
> popBottom() the assignment "bot = currBot;" can happen after the the code read
> oldTaggedTop.getTop() in "currBot > oldTaggedTop.getTop()". If that condition
> is true and after this read but before the re-ordered assignment "bot =
> currBot;" another CPU will be in popTop(), then for it the condition "currBot
> <= oldTaggedTop.getTop()" will be false so it ptoceed with compare-and-swap to
> update the top. That will succeed as popBottom() does not change oldTaggedTop
> in the fast pop case. This results in popBottom() and popTop() returning the
> same element.

We don't ensure exclusive pushing either so we can always end up with a reference to the same object in both GCMarkers.
We might want to add this for large objects but I don't see this as a real problem now.
(In reply to comment #15)
> We don't ensure exclusive pushing either so we can always end up with a
> reference to the same object in both GCMarkers.
> We might want to add this for large objects but I don't see this as a real
> problem now.

See comment 13 above why scanning the same object from two threads can be bad. Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag.
(In reply to comment #16)
> (In reply to comment #15)
> > We don't ensure exclusive pushing either so we can always end up with a
> > reference to the same object in both GCMarkers.
> > We might want to add this for large objects but I don't see this as a real
> > problem now.
> 
> See comment 13 above why scanning the same object from two threads can be bad.
> Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places
> are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag.

I think we also need to worry about shape-regenerating GCs. The code for these happens in ScanShape/ScanObject, where we already know that IS_GC_MARKING_TRACER holds.
(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #15)
> > > We don't ensure exclusive pushing either so we can always end up with a
> > > reference to the same object in both GCMarkers.
> > > We might want to add this for large objects but I don't see this as a real
> > > problem now.
> > 
> > See comment 13 above why scanning the same object from two threads can be bad.
> > Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places
> > are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag.
> 
> I think we also need to worry about shape-regenerating GCs. The code for these
> happens in ScanShape/ScanObject, where we already know that
> IS_GC_MARKING_TRACER holds.

Right now I don't do parallel marking for shape-regenerating GCs.
Do we have any data regarding patch impact on single core CPU without or with hyperthreading like recent Atoms? Also IIRC on ARM CompareAndSwap is relatively expensive and using it for the marking bitmap may offsets the gains from going to 2-core CPU.

I think it is important to collect those data now as we may need to have API that not only disable the background thread but also switch the marking bitmap to use the marking code without CompareAndSwap.
(In reply to comment #19)
> Do we have any data regarding patch impact on single core CPU without or with
> hyperthreading like recent Atoms? Also IIRC on ARM CompareAndSwap is relatively
> expensive and using it for the marking bitmap may offsets the gains from going
> to 2-core CPU.
> 
> I think it is important to collect those data now as we may need to have API
> that not only disable the background thread but also switch the marking bitmap
> to use the marking code without CompareAndSwap.


I would love to test it on different settings but I don't have any equipment for that. The CAS implementation/library is also an important performance factor.
Is mobile currently using the background thread?
(In reply to comment #20)
> I would love to test it on different settings but I don't have any equipment
> for that.

OK, I will test this on single core Atom and on dual-core Neo K345 where I can disable the second core. I suppose it would be sufficient to measure the marking phase timing with some synthetic benchmarks.

> Is mobile currently using the background thread?

The background finalization benefits even single-core CPU as it reduces the GC pause even there for the price of slower performance later and I saw the benefits on Atom CPU clearly last year. This is very different from the parallel marking which would definitely make things slower on single-core CPU. It may even harm hyperthreading as the latter effectively reduces per-thread cache by 50%.

For this reasons I assume we would need API that selectively disables the parallel marking without affecting the background finalize.
(In reply to comment #21)
> (In reply to comment #20)
> > I would love to test it on different settings but I don't have any equipment
> > for that.
> 
> OK, I will test this on single core Atom and on dual-core Neo K345 where I can
> disable the second core. I suppose it would be sufficient to measure the
> marking phase timing with some synthetic benchmarks.
> 
> > Is mobile currently using the background thread?
> 
> The background finalization benefits even single-core CPU as it reduces the GC
> pause even there for the price of slower performance later and I saw the
> benefits on Atom CPU clearly last year. This is very different from the
> parallel marking which would definitely make things slower on single-core CPU.
> It may even harm hyperthreading as the latter effectively reduces per-thread
> cache by 50%.
> 

I think we should focus on the 20-40% improvement for the more common architectures and don't spend too much time measuring synthetic benchmarks on rare or outdated environments. Let's fix the problems we have here and then we can do the performance tests on real web workload and other benchmarks.

Avoiding regressions for special environments shouldn't be a problem once we are aware of them.
(In reply to comment #22)
> I think we should focus on the 20-40% improvement for the more common
> architectures and don't spend too much time measuring synthetic benchmarks on
> rare or outdated environments.

Netbook shipments are at least 10% of all PCs for the last quoter. Typically they do not have dual-core CPU.
Dual core Atom netbooks have been on sale for 3 quarters, and typically have a small price premium (~$50 last year, maybe less now).  I don't think we should be designing for single-core anywhere, especially since mobile is headed that way too.  We should spend the time on other things, like helping with the generational GC work, rather than this or sub-percent gains on benchmarks.
I tested with my new Netbook. (Single atom core with hyperthreading)
I don't see much difference for the v8 benchmark in the browser. I usually get better scores with this patch but the GCTimer results are noisy.
(In reply to comment #25)
> I tested with my new Netbook. (Single atom core with hyperthreading)
> I don't see much difference for the v8 benchmark in the browser. I usually
> get better scores with this patch but the GCTimer results are noisy.

Could you disable hyperthreading in BIOS and try with that?
(In reply to comment #26)
> (In reply to comment #25)
> > I tested with my new Netbook. (Single atom core with hyperthreading)
> > I don't see much difference for the v8 benchmark in the browser. I usually
> > get better scores with this patch but the GCTimer results are noisy.
> 
> Could you disable hyperthreading in BIOS and try with that?

There is no option to disable it. 
Wiki says that all atom cores come with HT now.
Attached patch patch (obsolete) — Splinter Review
Update with most of the comments addressed.
I added locks to the delayed marking and XML marking, removed the second cond var and added an enum that describes the state.

rt->marking is used by the main marker to signal the background thread that no more marking work has to be done. Afterwards the main marker waits for the background marker to finish.

It's a 10% performance increase for the web-based v8 on my netbook with single core.
Attachment #529235 - Attachment is obsolete: true
Attachment #529235 - Flags: review?(gal)
Comment on attachment 532109 [details] [diff] [review]
patch

Review of attachment 532109 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.h
@@ +380,2 @@
>              *word |= mask;
> +#endif

The non-black color is used only when marking the extra xpcom roots. The patch does not mark them parallel. So we do not need  compare-and-swap after color != BLACK check. Also, why the code uses while + goto rather than if + goto?
(In reply to comment #29)
> Comment on attachment 532109 [details] [diff] [review] [review]
> patch
> 
> Review of attachment 532109 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jsgc.h
> @@ +380,2 @@
> >              *word |= mask;
> > +#endif
> 
> The non-black color is used only when marking the extra xpcom roots. The
> patch does not mark them parallel. So we do not need  compare-and-swap after
> color != BLACK check. Also, why the code uses while + goto rather than if +
> goto?

Thats true. I will update.

Another problem is in JSString::mark(JSTracer *trc) for rope strings. 
Is it safe to mark them in parallel? It's also possible that we mark the same strings in both threads. 
I guess I also need a CAS for the TAG and UNTAG right? Or is it enough to just remove the Asserts in there? Any Rope-String experts here?
Hey, for when this lands, what's the plan to be able to pref it off? Ideally there would be at least a compile-time switch controlling whether we do parallel marking.
(In reply to comment #31)
> Hey, for when this lands, what's the plan to be able to pref it off? Ideally
> there would be at least a compile-time switch controlling whether we do
> parallel marking.

There is a patch in bug 653488 that enables/disables the GCHelperThread but maybe we want something more detailed.
I was talking to Luke about the concurrent marking of rope strings today. We can't use the current implementation because of race conditions. 

There are 2 solutions:
1) change it to a stack based marking
2) add a mark-stack for ropes that is only processed by one thread but both threads can insert.
Depends on: 658041
Attached patch patch (obsolete) — Splinter Review
Update based on bug 658041.
Attachment #532109 - Attachment is obsolete: true
Attachment #535269 - Flags: review?(igor)
(In reply to comment #34)
> Created attachment 535269 [details] [diff] [review] [review]
> patch

Deque from the patch is thread-unsafe, but the chance that we will see the bug is low so we would not have a good coverage for parallel scanning of the same objects. So we need either to fix this or to remove from the implementation any compare-and-swap and related complexity and just say that any object should allow parallel scanning and any marking code must not use any state.

As for the fix the queue implementation can try to use extra compare-and-swap to ensure consistency. Alternatively we can consider implementing thread-safe queue. For example, the queue can be implemented as a vector of blocks of, say, 64 elements. Another thread can only get full block of elements. This way the thread that owns the queue would need to take the lock only once per each 64 elements.
(In reply to comment #35)
> (In reply to comment #34)
> > Created attachment 535269 [details] [diff] [review] [review] [review]
> > patch
> 
> Deque from the patch is thread-unsafe, but the chance that we will see the
> bug is low so we would not have a good coverage for parallel scanning of the
> same objects. So we need either to fix this or to remove from the
> implementation any compare-and-swap and related complexity and just say that
> any object should allow parallel scanning and any marking code must not use
> any state.
> 
> As for the fix the queue implementation can try to use extra
> compare-and-swap to ensure consistency. Alternatively we can consider
> implementing thread-safe queue. For example, the queue can be implemented as
> a vector of blocks of, say, 64 elements. Another thread can only get full
> block of elements. This way the thread that owns the queue would need to
> take the lock only once per each 64 elements.


So you mean we need test-coverage for scanning objects in parallel?

I don't want to make the common case more complex. Somebody could write a testcase where locking in marking XML objects would hurt but I haven't seen any regression on real web-workload or benchmarks so far.
Locking in state-changing code looks fine to me but we can try to get rid of it if it's really a problem.
Hm, the implementation of the JSClass::trace hooks in XPC definitely use state. For example, XPC_WN_Helper_Trace,  http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/xpcwrappednativejsops.cpp#1099 , calls GetWrappedNativeOfJSObject, uses reference counting and may evebe allocate obejcts. 

How are we going to ensure that its is safe to do the parallel marking of XPConnext things? Perhaps it would be simpler to make sure that they are marked only from the main thread?
(In reply to comment #36)
> So you mean we need test-coverage for scanning objects in parallel?

I mean that we either:

   Ensure that we never scan the same object in parallel from two threads.

or

   Explicitly state that we are going to do so and make Dequeue implementation as fast as possible via removal of compapre-and-swap instructions there. This way at least we get better coverage from testing for any potential threading bugs and would not pay performance price for pseudo-safety.
(In reply to comment #37)
> Hm, the implementation of the JSClass::trace hooks in XPC definitely use
> state. For example, XPC_WN_Helper_Trace, 
> http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/
> xpcwrappednativejsops.cpp#1099 , calls GetWrappedNativeOfJSObject, uses
> reference counting and may evebe allocate obejcts. 
> 
> How are we going to ensure that its is safe to do the parallel marking of
> XPConnext things? Perhaps it would be simpler to make sure that they are
> marked only from the main thread?

Ok that looks pretty scary and it doesn't look like we can fix xpconnect easily.
Maybe we can use another color to make sure that we leave the JS engine only once for an object?
(In reply to comment #40)
> Maybe we can use another color to make sure that we leave the JS engine only
> once for an object?

That require one more compare-and-swap operation when taking the object from the queue. But then one can use compare-and-swap to set to NULL the object taken from the queue. IMO it would be simpler.

In any case, that would not solve XPConnect problem. The code there expects that it would be called from the main thread and simply fail if not in XPCCallContext::Init. So we must fix that before we can proceed.
Ok let me see if I understand this piece of code correctly.
We create a XPCCallContext in nsXPConnect::Collect() and set some thread private data that we rely on during clasp->trace() callbacks?
This results in a situation where we have to call the tracing callback from the main thread?
(In reply to comment #42)
> Ok let me see if I understand this piece of code correctly.
> We create a XPCCallContext in nsXPConnect::Collect() and set some thread
> private data that we rely on during clasp->trace() callbacks?

XPCCallContext::Init requires that the current thread in its private data should have a valid context associated with it. We should really rewrite this to avoid any XPCCallContext instances, but I have no idea what it would involve. mrbkap should know better.

Blake, could you comment why nsXPConnect::GetJSObjectOfWrapper needs XPCCallContext? That is not used in the function.
Andreas said it's ok to call the callback on another thread as long as the main thread is either stopped or also marking.
(In reply to comment #44)
> Andreas said it's ok to call the callback

Which callback?
(In reply to comment #45)
> (In reply to comment #44)
> > Andreas said it's ok to call the callback
> 
> Which callback?

Oh I mean the mark functions in XPConnect. No callbacks...
I don't know how useful this information is but the try-server is green with this patch. We definitely call the mark functions in XPConnect also from the background thread. So it doesn't fail right away.
Igor is right, we can't safely make XPCCallContexts on random threads, but as long we just call the mark function and the main thread is stalled (or marking as well) while we do so, we should be good. Igor, do you agree?
(In reply to comment #48)
> we can't safely make XPCCallContexts on random threads, but
> as long we just call the mark function and the main thread is stalled (or
> marking as well) while we do so, we should be good.

We need extra work - see comment 43.
(In reply to comment #47)
We definitely call the mark functions in XPConnect also from the
> background thread. So it doesn't fail right away.

Effectively the code in xpconnect early return AFAICS without marking anything. As typically the stuff that should be marked under those functions is reached via multiple ways we would need very specific test cases to trigger the problem. In any case the code there is a mess. We should not calling all those functions just to call JS_TRACE on GC things stored in those wrappers. So lets fix that in a different bug and then land this one with the knowledge that parallel tracing is really possible in all cases. 

On a different subject dom/src/threads contain asserts that insists on the main thread during marking. Those should be fixed. Searching on mxr.mozilla.org for JSTracer shows those places.
GetJSObjectOfWrapper should not take a ccx. Creating a ccx during GC is kinda bad anyway since we might be in an OOM squeeze. Igor, want to take a look (or I can do it Tuesday).
Depends on: 660430
(In reply to comment #51)
> GetJSObjectOfWrapper should not take a ccx. Creating a ccx during GC is
> kinda bad anyway since we might be in an OOM squeeze. Igor, want to take a
> look (or I can do it Tuesday).

I filed bug 660430 about this. I try to look at it over weekend.
Depends on: 660783
Seems like most xpconnect issues are resolved. Thanks for the fast fix Igor!
Is there anything left on the xpconnect side?
(In reply to comment #53)
> Is there anything left on the xpconnect side?

I hope not but it is easy to miss stuff there.
Blocks: GC
Attached patch patch (obsolete) — Splinter Review
Merging again.
There was the problem that popping the same LargeMarkItem from one Deque caused a double free. I introduced a popBottomLocked and popTopLocked that uses a GCLock only for the largeMarkStack.
Attachment #535269 - Attachment is obsolete: true
Attachment #535269 - Flags: review?(igor)
Comment on attachment 538418 [details] [diff] [review]
patch

Still the same performance wins.
Between 20% and 45% overall GC pause time reduction.
Attachment #538418 - Flags: review?(igor)
Depends on: 653488
I will review this on Tuesday-Wednesday when I will be back from some traveling.
Comment on attachment 538418 [details] [diff] [review]
patch

Review of attachment 538418 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.h
@@ +1267,5 @@
> + * Multiprocessors" by Arora et al.
> + * The implementation was inspired by Vyacheslav Egorov from the Chromium 
> + * project.
> + */
> +#ifdef JS_THREADSAFE

As I wrote this is NOT a thread-safe dequee even on x86. So update the comments accordingly stating that we tolerate races and simplify the implementation by at least removing the TaggedTop class and encoding/decoding logic.

::: js/src/jsgcmark.cpp
@@ +691,1 @@
>      }

pushBottom does not use synchronization. As such it is not guaranteed that a write to item->start will propagate to another core when another thread calls popTopLocked the item from the stack. On x86 this does not happen as there it is guaranteed that a write is not reordered with any lock operation and popTopLocked does use locks. But on ARM this is possible and when this happens we would get a double free.

Suggestion: as we use locks in any case, implement separated class with an interface:

LargeItemWork {

    bool add(const LargeItem &item);
    bool get(LargeItem &item);

    /* Any better name ?. */
    bool getFromNonOwnerThread(LargeItem &item);
}

Non-owner thread can only call getFromNonOwnerThread. Initially all methods take a lock and "get" can be implemented via calling getFromNonOwnerThread. But we can optimize "add" and even "get" for x86 to take advantage of stronger coherency guarantees.

The "get" method internally under the lock checks for start + LARGE_OBJECT_CHUNK_SIZE <= end and either pops the item from the stack or update the stack top. The method returns the range to scan so the caller just scans [start, end).
Attachment #538418 - Flags: review?(igor) → review-
Attached patch patchSplinter Review
I fixed the comment and the largeMarkStack issue. It uses always a lock now. 
So this should be a working version now with the same 20% - 40% performance wins.

I don't have time for the other MarkStack changes right now because I am not in MV over the summer. We can either take this and optimize it later, wait until fall and I will finish it or somebody else can steal it.
Attachment #538418 - Attachment is obsolete: true
(In reply to comment #59)
> Created attachment 540193 [details] [diff] [review] [review]
> patch

On Monday during the day European time I will try to test the patch impact on a laptop where I can disable hyperthreading and the second core. If the patch would not regress noticeably such configuration, then I will do a review for nits etc, so we can land this.

If the regression would be substantial we would need a workaround to enable the parallel marking via a preference and/or only in debug builds. This way we can still land this and test for bugs while implementing later code that detects single-core non-hyperthreading CPU.
Thanks Igor!

There happened a strange crash during a try run:
http://tinderbox.mozilla.org/showlog.cgi?log=Try/1308371835.1308373055.21282.gz#err0

The crashing thread is not involved with the GC but it happened during the GC.
That try run may not be a crash, but a hang. According to bug 563012 comment 5, that's what crashinjectdll.dll is for: it kills a hung test. It's odd that thread 0 was doing a PR_Unlock call, though. Could there be some kind of livelock?
We don't use work-stealing for XML objects so maybe we can avoid some locking.
Igor are JSXMLArrays like xml_namespaces shared between XML objects?
(In reply to comment #63)
> We don't use work-stealing for XML objects so maybe we can avoid some
> locking.

True!

> Igor are JSXMLArrays like xml_namespaces shared between XML objects?

The arrays that are shrunk during the GC are not shared.
Comment on attachment 540193 [details] [diff] [review]
patch

Review of attachment 540193 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsweakmap.cpp
@@ +246,5 @@
>          if (IS_GC_MARKING_TRACER(trc) && map->empty()) {
>              trc->context->delete_(map);
>              obj->setPrivate(NULL);
>          } else {
>              map->trace(trc);

This will hold the lock when marking/tracing non-empty map leading to a deadlock.

Gal: could you comment why it is necessary to delete an empty map during the marking phase of the GC and not when the code removes all elements from it? It would be nice if we can replace the body of the method with:

if (ObjectValueMap *map = GetObjectMap(obj)) 
    map->trace(trc);
That was meant to be just a shortcut. If its empty, delete it. We can delete it later, should be fine as long you are sure that code gets called when the map is empty.
Comment on attachment 540193 [details] [diff] [review]
patch

Review of attachment 540193 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +1869,5 @@
>  
>      if (IS_GC_MARKING_TRACER(trc)) {
>          GCMarker *gcmarker = static_cast<GCMarker *>(trc);
>          MarkWeakReferences(gcmarker);
>      }

Here is another bug: we should mark the weak references after stopping the parallel marking or we can miss some weak maps.

Also we need an extra lock at http://hg.mozilla.org/tracemonkey/file/0428dbdf3d58/js/src/jsweakmap.h#l108 that recently was changed to use IS_MARKING_TRACER.
Depends on: 666549
What happened to this bug? The blocking bugs all seem to be fixed, and the last one claims to be landed after this.
(In reply to comment #68)
> What happened to this bug? The blocking bugs all seem to be fixed, and the
> last one claims to be landed after this.

The patch should be 97% ready but I don't have the time to finish it now. I already tested it on try-server, various benchmarks and real workload on different platforms but this was about 2 months ago. The results were pretty good so far!

Another thing is that currently we have some GC instrumentation patches in the engine that make it impossible to perform the last tunings here. I also don't want to make big changes during crash investigation. Finding GC crashes is _very_ important and Bill should take his time there.

Maybe I will rebase it for FF9 but otherwise I will wait until October to finish it.
Another paper describing parallel marking:
Parallel garbage collection for shared memory multiprocessors
http://www.math.tau.ac.il/~shanir/publications/dfsz2001.pdf
(In reply to Gregor Wagner from comment #69)

> Maybe I will rebase it for FF9 but otherwise I will wait until October to
> finish it.

so is this finished now?
(In reply to Cyko_01 from comment #71)
> (In reply to Gregor Wagner from comment #69)
> 
> > Maybe I will rebase it for FF9 but otherwise I will wait until October to
> > finish it.
> 
> so is this finished now?

nope. working on some other stuff now :(
Whiteboard: [Snappy]
Snappy:P3 because incremental probably will solve most snappiness problems related to GC, on sites without a really heavy GC load.
Whiteboard: [Snappy] → [Snappy:P3]
blocking-kilimanjaro: --- → ?
David, why do you think this should block K9O, particularly given comment 73?
(In reply to Justin Lebar [:jlebar] from comment #74)
> David, why do you think this should block K9O, particularly given comment 73?

I don't positively think this should block k9o. I just went out and nominated every major piece of GC-related work I could remember or easily find, and this came up.

But, as you may have guessed, I didn't review comment 73 :-). It's a good observation and I'm not sure what the answer is yet. IGC will probably get us most of the way there, but I think that if we were left with 16ms pauses or so, cutting them in half could still be of value in reducing frame skips.
I don't completely agree with comment 73 :) 
On my low end netbook I see a fast and slow mode with IGC enabled. So for example without IGC I see a 100msec pause and with IGC I see a 500msec period where animations are very slow. So I don't think we can say everything is solved there.
Right now I also see a huge pause at the end where we probably reset the GC because we run OOM but I hope that's just a tuning issue.
Also on my regular laptop I see a huge memory regression when IGC is enabled. Cutting the time in half (or even more with more threads) we spend in IGC might help in the memory regression as well.
What sites are you talking about, Gregor?
(In reply to Bill McCloskey (:billm) from comment #77)
> What sites are you talking about, Gregor?

Is there a dedicated bug for tracking IGC performance?
So for the realtime raytracer I see a memory consumption of 770MB when disabled vs. 1.3GB when IGC enabled on my MBP. The animation is definitely smoother with IGC enabled and there is no slow part noticeable.

For my low-end netbook the animation during IGC becomes very slow. The same for the spinning balls from v8. There I see a few seconds high frame rate and a second or so slow frame rate.
blocking-kilimanjaro: ? → -
taking as per irc discussion with billm.
Assignee: anygregor → tschneidereit+bmo
I might or might not work on this in the short to medium term. For now, I'm putting it back into the pond so as not to deter anyone else from working on it.
Assignee: tschneidereit+bmo → general
Is this bug still valid?
Assignee: general → nobody
Blocks: 1262321
You need to log in before you can comment on or make changes to this bug.