Last Comment Bug 652416 - race when returning GC arenas from the background finalizer to GC arena allocator
: race when returning GC arenas from the background finalizer to GC arena alloc...
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Igor Bukanov
:
Mentors:
: 589771 (view as bug list)
Depends on:
Blocks: 627200
  Show dependency treegraph
 
Reported: 2011-04-24 05:31 PDT by Igor Bukanov
Modified: 2011-06-15 08:12 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (42.15 KB, patch)
2011-04-24 15:44 PDT, Igor Bukanov
no flags Details | Diff | Review
v2 (51.87 KB, patch)
2011-04-25 16:05 PDT, Igor Bukanov
no flags Details | Diff | Review
v3 (55.86 KB, patch)
2011-04-29 02:41 PDT, Igor Bukanov
no flags Details | Diff | Review
v4 (61.83 KB, patch)
2011-04-30 16:01 PDT, Igor Bukanov
no flags Details | Diff | Review
v5 (58.26 KB, patch)
2011-05-04 08:09 PDT, Igor Bukanov
no flags Details | Diff | Review
v5 for real (61.28 KB, patch)
2011-05-04 08:11 PDT, Igor Bukanov
anygregor: review+
Details | Diff | Review

Description Igor Bukanov 2011-04-24 05:31:52 PDT
When the background finalization introduced in the bug 627200 returns arenas to the GC arena allocator it uses the volatile ArenaList::hasToBeFinalized field to signal that the arena list is ready to be scanned for arenas with free things.

But such volatile variable do not ensure that that other writes are consistent. In particular, the CPU that allocates things may not see other writes that was preceding the code that sets ArenaList::hasToBeFinalized to false on the background thread.
Comment 1 Igor Bukanov 2011-04-24 05:48:19 PDT
I plan to fix this via appending the finalized arenas to ArenaList under the GC lock. It increases the number of times RefillTypedFreeList takes this lock by one per each arena list from which we allocate things between GC runs, but I suppose this would not be visible.
Comment 2 Igor Bukanov 2011-04-24 09:03:05 PDT
There is another race in the implementation. Currently the GC waits for the background finalization to end only before it starts the sweeping. So the marking and finalization may in parallel after the GC cleared the mark bits.
Comment 3 Igor Bukanov 2011-04-24 15:44:28 PDT
Created attachment 528028 [details] [diff] [review]
v1

Work in progress that passes shell tests
Comment 4 Gregor Wagner [:gwagner] 2011-04-24 17:06:20 PDT
I am trying to find out what the actual bug is you want to fix.
So there is the hasToBeFinalized field that indicates whether we have to allocate a new arena or we can try to allocate from the already existing arenalist. Arenas that are freed on the background thread can be allocated right away and they are not "protected" by this variable.
My understanding is that the change of this variable might not be visible right away but it will be visible once the thread flushes the cache and this will eventually happen.
So you think the bug is 1)that the state change is not visible right away or 2)due to the fact that the state change is not visible right away we can get a race condition or 3)the other threads will never see the state change?

For comment 2)
Where do you see that? We usually wait in js_GC for the sweeping do be done. And for gcPoke we wait again right before we clear the marking bits. I was very conservative with the synchronization and I guess we could get rid of some synchronization because I don't do delayed sweeping during LAST_CONTEXT GC where we actually pay attention to gcPoke.
Comment 5 Bill McCloskey (:billm) 2011-04-24 18:01:08 PDT
(In reply to comment #4)
> I am trying to find out what the actual bug is you want to fix.
> So there is the hasToBeFinalized field that indicates whether we have to
> allocate a new arena or we can try to allocate from the already existing
> arenalist. Arenas that are freed on the background thread can be allocated
> right away and they are not "protected" by this variable.
> My understanding is that the change of this variable might not be visible right
> away but it will be visible once the thread flushes the cache and this will
> eventually happen.
> So you think the bug is 1)that the state change is not visible right away or
> 2)due to the fact that the state change is not visible right away we can get a
> race condition or 3)the other threads will never see the state change?

It sounds like Igor is worried about certain architectures where writes made by CPU A can be seen out of order by CPU B. The Alpha was somewhat notorious for this. In this case, the main thread might see hasToBeFinalized get set to zero before it actually sees the writes made by the finalization process (such as setting up the freelist). This would be bad. I don't think this is an issue on x86, but we should probably fix it if we can, I guess.
Comment 6 Andreas Gal :gal 2011-04-24 18:03:23 PDT
We have a mips port, so yeah, we should take a patch for this if anyone cares, but I wouldn't actually go and spend time on this without indications that this is a problem on platforms we actively support.
Comment 7 Boris Zbarsky [:bz] 2011-04-24 18:52:13 PDT
Do we have an ia64 port?  ;)
Comment 8 Bill McCloskey (:billm) 2011-04-24 19:20:45 PDT
Actually, I found this article while Googling. It seems to suggest that ARM has this problem. So we should definitely fix it.
  http://wanderingcoder.net/2011/04/01/arm-memory-ordering/
Comment 9 Andreas Gal :gal 2011-04-24 19:31:22 PDT
I think we went through this with the operation callback. We use

JS_ATOMIC_SET(&interruptFlags, 1)

which should enforce memory ordering.
Comment 10 Bill McCloskey (:billm) 2011-04-24 23:40:48 PDT
(In reply to comment #9)
> I think we went through this with the operation callback. We use
> 
> JS_ATOMIC_SET(&interruptFlags, 1)
> 
> which should enforce memory ordering.

I'm not sure this is correct. In gcc, JS_ATOMIC_SET gets converted to __sync_lock_test_and_set. The gcc docs say that this is not a full barrier operation--only an acquire barrier, which is insufficient for our purposes.
  http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

And really we need two barriers: one where the flag is written (to ensure that no writes get moved past it) and one where the flag is read (to make sure that no other reads move before it). GCC only seems to support __sync_synchronize(), which is a full barrier. We could use it in both places.

However, it sounds like Igor's solution just uses a lock, so then we wouldn't have to worry about this. But we might want to look at the operation callback again. Is there a bug where this was discussed?
Comment 11 Igor Bukanov 2011-04-25 02:36:32 PDT
(In reply to comment #4)
> For comment 2)
> Where do you see that? We usually wait in js_GC for the sweeping do be done.
> And for gcPoke we wait again right before we clear the marking bits. I was very
> conservative with the synchronization and I guess we could get rid of some
> synchronization because I don't do delayed sweeping during LAST_CONTEXT GC
> where we actually pay attention to gcPoke.

You are right, there is no bug in that.
Comment 12 Igor Bukanov 2011-04-25 02:58:11 PDT
(In reply to comment #10)
> But we might want to look at the operation callback
> again. Is there a bug where this was discussed?

There is no problem with the operation callback as the code does not depend on other variables written besides the callback flag itself. This is very different from the background finalization when we have to ensure that all writes to GC structures are finished before we can access for reading them.
Comment 13 Igor Bukanov 2011-04-25 05:01:22 PDT
(In reply to comment #8)
> Actually, I found this article while Googling. It seems to suggest that ARM has
> this problem. So we should definitely fix it.
>   http://wanderingcoder.net/2011/04/01/arm-memory-ordering/

We have the problem even on x86. The current implementation depends on all reads from the allocation thread happens after the background thread finishes all writes that happens before that thread clears ArenaList::hasToBeFinalized flag. But according to that article:

---
Speaking of which, on the other hand x86 guarantees that writes won’t be reordered, that architecture is referred to as strongly ordered. This is not to say it doesn’t do any reordering, for instance reads are allowed to happen ahead of writes that come “before” them; this breaks a few algorithms like Peterson’s algorithm. 
---
Comment 14 Gregor Wagner [:gwagner] 2011-04-25 08:13:16 PDT
The GCC manual mentions "asm volatile ("" : : : "memory");" as a solution.
Comment 15 Igor Bukanov 2011-04-25 10:08:05 PDT
(In reply to comment #14)
> The GCC manual mentions "asm volatile ("" : : : "memory");" as a solution.

From http://en.wikipedia.org/wiki/Memory_ordering#Compiler_memory_barrier :

---
These barriers prevent a compiler from reordering instructions, they do not prevent reordering by CPU.
---
Comment 16 Igor Bukanov 2011-04-25 16:05:25 PDT
Created attachment 528209 [details] [diff] [review]
v2

This passes the try server.

Changes:

1. The background thread puts the finalized list into the arena list under the lock into the separated ArenaList::finalizedInBackground field. The allocator checks the field and when it is set, append the list under the lock to the arena list. In principle the lock can be replaced by a write barrier in the background finalization and read barrier during the allocation, but that is for another bug. 

2. ArenaList::cursor becomes a pointer to an arena pointer, instead of just an arena pointer. This way the finalized list can always be appended to the main list and there is no need to keep at least one arena in the list.

3. The patch copies parts of the patch from the bug 589771 (that was never landed as it rotten too much before it can get landing approval) to simplify handling of the background finalization during the GC and the stats collection.

The patch is based on the patch from the bug 605029 for simplicity.
Comment 17 Igor Bukanov 2011-04-29 02:28:38 PDT
To Gregor: 

With the patch compared with TM tip I see a 0.8% regression in V8 with a slight win in splay and losses in earley-boyer (1.5%) and regexp (0.7%). The sunspider results are not affected. 

Now, since the patch takes the GC lock when returning finalized arenas to ArenaList, the chunk lock can be removed. So if I use the GC lock when releasing the chunks from the background finalization, then V8 shows gains in all the tests compared with the patch that still uses locks. The overall regression in V8 becomes insignificant with the win in splay becoming 1.5%. Sunspider show 0.4% win compared the the current TM tip.

In the bug 627200 comment 1 you mentioned that the patch has not affected sunspider, but have you seen regressions in V8 due to the added locking?
Comment 18 Igor Bukanov 2011-04-29 02:41:06 PDT
Created attachment 529066 [details] [diff] [review]
v3

The new version removes the per-chunk locks to offset benchmark losses during allocation due to the need to take the GC lock. With this changes v8 no longer regresses and sunspider shows 0.4% win.
Comment 19 Gregor Wagner [:gwagner] 2011-04-29 09:10:30 PDT
(In reply to comment #17)
> To Gregor: 
> 
> In the bug 627200 comment 1 you mentioned that the patch has not affected
> sunspider, but have you seen regressions in V8 due to the added locking?

I also saw small "unstable" regressions that were present one day but they were also gone the next day. My guess is that the benchmark-suite has fixed GC points between benchmarks. Witch background finalization this means now right when a benchmark starts we are going the path with the locking and maybe new arena allocation. So the "regression" depends on the thread-scheduling.


BTW: Please stop combining many different fixes into one patch. The time for reviewing increases exponentially.
Comment 20 Gregor Wagner [:gwagner] 2011-04-29 14:35:55 PDT
Puh is there really no other way than splitting the arenalists? What was wrong with the lock idea you had? 

I think the code becomes very error-prone because we always have to maintain 2 separate arena lists now and there is no easy way to just iterate over the arenalist because we always have to check if the second one is around.
Comment 21 Igor Bukanov 2011-04-29 15:09:26 PDT
(In reply to comment #20)
> Puh is there really no other way than splitting the arenalists?

Yes, that was not the good solution. I will update the patch without the split. 

The observation is that the arena list can be in 3 states. Waiting for finalization, just-finalized and ready. That can be encoded as a volatile enum. When the allocation thread see either waiting or just-finalized state, it takes the GC lock. If under the lock the state is still waiting, it would allocate a new arena as currently.

If the state is just-finalized, then it means that the background thread has finished the arena thread and updated the tail of the list with finalized arenas under tyhe GC lock at some point before. In that case the allocation thread transition the the state into ready. In the ready state the allocation thread proceeds to check at arenas under ArenaList::cursor for free things without any locks.
Comment 22 Gregor Wagner [:gwagner] 2011-04-29 16:03:54 PDT
(In reply to comment #21)
> (In reply to comment #20)
> > Puh is there really no other way than splitting the arenalists?
> 
> Yes, that was not the good solution. I will update the patch without the split. 
> 
> The observation is that the arena list can be in 3 states. Waiting for
> finalization, just-finalized and ready. That can be encoded as a volatile enum.
> When the allocation thread see either waiting or just-finalized state, it takes
> the GC lock. If under the lock the state is still waiting, it would allocate a
> new arena as currently.
> 

Yeah it took me some time to figure out your implicit state-machine. An enum sounds much better than the 
reinterpret_cast<ArenaHeader *>(sizeof(ArenaHeader));
Comment 23 Igor Bukanov 2011-04-30 16:01:22 PDT
Created attachment 529319 [details] [diff] [review]
v4

The new version bring more comments about the lock policy regarding the arena list access.

The patch moves to ArenaHeader::getNextWithFreeList  all the logic related to choosing an arena with a non-empty free list. Initially I tried to avoid goto there, but that resulted in rather unclear nested loops. I guess management of 3-state variable accessed outside and inside the lock leads to mess in any case.

With the patch the GC lock protects access to both the head and cursor of the arena list, not just the cursor. Event with using ArenaHeader **, not ArenaHeader * for the cursor, inserting a newly allocated arena to the empty list requires to modify both the head and the cursor. As the result ArenaHeader::getNextWithFreeList must keep the GC lock during insertion to prevent the background thread from manipulating the cursor while the main thread puts the first arena into the list. But that just adds few CPU instructions to the duration of the lock.
Comment 24 Gregor Wagner [:gwagner] 2011-05-03 13:07:42 PDT
I walked through the code a first time and it looks good but I need more
time for a detailed review.
I have a few nits, questions so far:


> #ifdef JS_GCMETER
>-# define METER(x)               ((void) (x))
>+# define METER(x)               x

Why this change?



> 
>-ArenaList *
>-GetFinalizableArenaList(JSCompartment *c, unsigned thingKind) {
>-    JS_ASSERT(thingKind < FINALIZE_LIMIT);
>-    return &c->arenas[thingKind];
>+template <typename T>
>+inline ArenaHeader *
>+ArenaList::getNextWithFreeList(JSContext *cx, unsigned thingKind)
>+{
>+    /*
>+     * Search the list for arenas with free things unless the
>+     * background finalization still runs.
>+     */
>+#ifdef JS_THREADSAFE
>+    if (backgroundFinalizeState == BFS_DONE) {
>+      check_arena_list:
>+#endif
>+        while (ArenaHeader *aheader = *cursor) {
>+            cursor = &aheader->next;
>+            if (aheader->freeList)
>+                return aheader;
>+        }
>+#ifdef JS_THREADSAFE
>+    }
>+
>+    AutoLockGC lock(cx->runtime);
>+
>+  check_state:
>+    if (backgroundFinalizeState == BFS_JUST_FINISHED) {
>+        /*
>+         * The finalization has finished before we took the lock, check
>+         * the arena list again.
>+         */
>+        backgroundFinalizeState = BFS_DONE;
>+        goto check_arena_list;
>+    }
>+#endif

Couldn't you end up in a situation where you get the GC lock,
goto check_arena_list, and no arena has free things so you end up 
getting the GC lock again?

>+
>+    if (ArenaHeader *aheader = AllocateArena<T>(cx, thingKind)) {
>+        /*
>+         * Add arenas to the head of the list before the cursor so they are
>+         * not checked for the free things again.
>+         */
>+        aheader->next = head;
>+        if (cursor == &head)
>+            cursor = &aheader->next;
>+        head = aheader;
>+        return aheader;
>+    }
>+
>+#ifdef JS_THREADSAFE
>+    /* We don't release the lock when allocating the arenas. */

Remove "the" before arenas.


>+    }
>+
>+    /*
>+     * In stats we should not reflect newly the arenas allocated after the GC
>+     * has finished. So we do not add to livearenas the arenas from al->head.
>+     */

In stats we should not reflect newly allocated arenas



> 
> /* The arenas in a list have uniform kind. */
>-struct ArenaList {
>-    ArenaHeader           *head;          /* list start */
>-    ArenaHeader           *cursor;        /* arena with free things */

arena with possible free things
Comment 25 Igor Bukanov 2011-05-03 13:17:36 PDT
(In reply to comment #24)
> > #ifdef JS_GCMETER
> >-# define METER(x)               ((void) (x))
> >+# define METER(x)               x
> 
> Why this change?

In an initial version of the patch I used that to write METER(uint32 x = ); But this is not necessary so I will remove the change.

> Couldn't you end up in a situation where you get the GC lock,
> goto check_arena_list, and no arena has free things so you end up 
> getting the GC lock again?

Yes, this is possible. But free lists that contain just full arenas are rare. So the preference here to release the lock and walk the list outside it to minimize the amount of time the lock is held.
Comment 26 Gregor Wagner [:gwagner] 2011-05-03 13:25:00 PDT
(In reply to comment #25)
> 
> > Couldn't you end up in a situation where you get the GC lock,
> > goto check_arena_list, and no arena has free things so you end up 
> > getting the GC lock again?
> 
> Yes, this is possible. But free lists that contain just full arenas are rare.
> So the preference here to release the lock and walk the list outside it to
> minimize the amount of time the lock is held.

Is this allowed?
From MDC:
PR_Lock is not reentrant. Calling it twice on the same thread results in undefined behavior.
Comment 27 Igor Bukanov 2011-05-03 14:02:50 PDT
(In reply to comment #26)
> Is this allowed?

Hm, but in C++ goto outside the scope of variable calls the destructor. For example, consider the following C++ program:

#include <stdio.h>

struct A {
    A() { printf("Construct\n"); }
    ~A() { printf("Destruct\n"); } 
};


int main()
{
    int i = 0;
  label:
    A a;
    if (!i) {
        ++i;
        goto label;
    }
    return 0;
}

With GCC it prints:

Construct
Destruct
Construct
Destruct

Or are there bugs in some compilers that we use that do not call the destructor?
Comment 28 Gregor Wagner [:gwagner] 2011-05-03 14:35:01 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > Is this allowed?
> 
> Or are there bugs in some compilers that we use that do not call the
> destructor?

Yes you are right. I never used such code and it looked strange.
Such a bug would be almost impossible to find. Can we add an assert?
All the ifdefs and gotos are not very helpful either in this particular function.
I know that you tried to avoid them but maybe you can look over it again if we can make it more readable :)
Comment 29 Igor Bukanov 2011-05-04 08:09:39 PDT
Created attachment 530012 [details] [diff] [review]
v5

In the new version I split getNextWithFreeList into getArenaWithFreeList and searchForFreeArena. getNextWithFreeList no longer tries to share code between thread-safe and single-threaded versions. That allowed to keep just single goto replacing the rest of the control flow with single loop. In addition the new version avoids TriggerGC call when it retries the allocation later after the finalization is done.
Comment 30 Igor Bukanov 2011-05-04 08:11:27 PDT
Created attachment 530017 [details] [diff] [review]
v5 for real

Prev patch was wrong
Comment 31 Gregor Wagner [:gwagner] 2011-05-05 15:01:13 PDT
Comment on attachment 530017 [details] [diff] [review]
v5 for real

> 
> static inline bool
> CanBeFinalizedInBackground(gc::FinalizeKind kind, Class *clasp)
> {
>+#ifdef JS_THREADSAFE
>     JS_ASSERT(kind <= gc::FINALIZE_OBJECT_LAST);
>     /* If the class has no finalizer or a finalizer that is safe to call on
>      * a different thread, we change the finalize kind. For example,
>      * FINALIZE_OBJECT0 calls the finalizer on the main thread,
>      * FINALIZE_OBJECT0_BACKGROUND calls the finalizer on the gcHelperThread.
>      * kind % 2 prevents from recursivly incrementing the finalize kind because
>      * we can call NewObject with a background finalize kind.
>      */
>     if (kind % 2 == 0 && (!clasp->finalize || clasp->flags & JSCLASS_CONCURRENT_FINALIZER))
>         return true;
>+#endif
>     return false;
> }

nice!

Thx!
Comment 32 Chris Leary [:cdleary] (not checking bugmail) 2011-05-10 15:10:55 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/b892a04bedf7
Note: not marking as fixed because fixed-in-tracemonkey is not present on the whiteboard.
Comment 33 Igor Bukanov 2011-05-10 15:31:41 PDT
I forgot to mark the bug as fixed-in-tracemonkey:

http://hg.mozilla.org/tracemonkey/rev/b892a04bedf7
Comment 34 Igor Bukanov 2011-06-15 08:12:08 PDT
*** Bug 589771 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.