Closed Bug 1097864 Opened 10 years ago Closed 10 years ago

Intermittent assertion failure: al.isCursorAtEnd()

Categories

(Core :: JavaScript: GC, defect)

x86_64
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: shu, Assigned: jonco)

Details

Attachments

(2 files)

Running jit-tests with a enable-debug enable-optimize shell on linux64, on m-c as of 688f821edcd4, with JS_GC_ZEAL=14, very intermittently has failures with the assertion:

Assertion failure: al.isCursorAtEnd(), at /home/shu/moz/central-hg/js/src/jsgc.cpp:1983
I see this also on Windows, but oddly when I catch it in a debugger it actually breaks on [1], so I think the --enable-optimize build is doing some weird assertion coalescing. However, I just got this in a --disable-debug build as well, on the expected line ([2]).

The stack shows that we're getting there from GCRuntime::refillFreeListFromMainThread. I can think of two things that might be happening here:

(1) Compacting GC is modifying the ArenaList outside the GC lock somehow.

(2) Compacting GC holds the GC lock and is busy modifying the ArenaList on a background thread, but !cx->runtime()->isHeapBusy() and GCRuntime::refillFreeListFromMainThread doesn't do anything to wait for the helper threads. ArenaLists::allocateFromArena tries to get the next non-full Arena, finds there are none, and takes the GC lock. The helper thread finishes what it's doing and relinquishes the GC lock, and allocateFromArena finds that there are suddenly non-full Arenas after the cursor.

Tests I saw failing in an --enable-debug --enable-optimize build:
  js/src/jit-test/tests/jaeger/bug658579.js
  js/src/jit-test/tests/debug/Debugger-onNewGlobalObject-09.js

In an --enable-debug --disable-optimize build I saw the following test fail:
  js/src/jit-test/tests/basic/bug842940.js

This can probably happen anywhere, and the timing might be pretty hardware-specific, but maybe that will help with reproducing.

[1] http://hg.mozilla.org/mozilla-central/annotate/688f821edcd4/js/src/jsgc.cpp#l1963
[2] http://hg.mozilla.org/mozilla-central/annotate/688f821edcd4/js/src/jsgc.cpp#l1983
Assignee: nobody → jcoppeard
Well compacting GC does modify the arena lists without taking the lock so let's change that.  I can't exactly see the mechanism by which these assertions could fail but let's see what effect this patch has.

It's worth pointing out that the normal GC does this too when queuing objects for sweeping.  This happens on the main thread when there's nothing accessing these on the helper threads, which is also the case for compacting.
Attachment #8522920 - Flags: review?(terrence)
Comment on attachment 8522920 [details] [diff] [review]
lock-while-manipulating-arena-lists

Review of attachment 8522920 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense.

How confident are we that the locking in refillFreeListsOffMainThread is actually doing anything useful?
Attachment #8522920 - Flags: review?(terrence) → review+
Disclaimer: I don't know how CGC currently works, so this might not be helpful.

So for sweeping, we take all the Arenas in each ArenaLists and put them in arenaListsToSweep, setting arenaLists to an empty list so that we don't try to allocate out of them during sweeping. Then, we do one of two things:

(1) Finalize them in the foreground, creating a reference to already-finalized arenas in incrementalSweptArenas if our budget runs out. Once finalization for an AllocKind completes, we merge the runtime-wide incrementalSweepList back in with any Arenas that were allocated during finalization. This should all happen on the same thread, so no locking is required (although allocateFromArena will probably be forced to take the lock anyway since there will never be any non-full Arenas available during finalization).

(2) Finalize them in the background, all in one go, without taking the lock. Once that completes, we take the lock and merge the swept Arenas back in with any Arenas that were allocated during finalization. While BFState == BFS_RUN, allocateFromArena will always take the lock. There wouldn't be any non-full Arenas for it to allocate from anyway, so this is just for safety around the ArenaList merging.

So for any part of CGC that doesn't have to allocate new Arenas from existing chunks, I think you could just move the current arenaLists out to a arenaListsToCompact, perform the compaction on a helper thread without taking the lock, then take the lock only to merge them back together with any Arenas that were allocated in the meantime.

If you do need to mess with chunks, i.e. moving Arenas from almost-empty chunks into almost-full ones, you would have to take the GC lock while allocating new arenas from existing chunks for your use. This shouldn't race with allocateFromArena, since that will always take the lock before allocating a new Arena from a chunk.
Keywords: leave-open
I found I can reproduce this by running the jit tests on a build without compacting GC when using zeal mode 14 (shrinking GC).  Approx 50 tests fail like this.

Annoyingly the tests pass when run individually though.
Ah, good point - I didn't enable compacting GC in my build either. Talking about this with Shu on IRC I suggested running this with rr to see if you can backtrack to where the change happened. I don't know how easy or hard rr is to set up though.
I ran the entire suite with --tbpl under rr with --enable-debug --disable-optimize, not one test crashed. :(
The problem is that although we take the GC lock in ArenaLists::allocateFromArena() it can be released again by GCRuntime::maybeAllocTriggerZoneGC() when it triggers a GC.

The fix for this is simple.  Since the latter method doesn't actually do very much when it triggers a GC, we can continue to hold the lock there.

To make things like this less likely to happen in future AutoUnlockGC now must be constructed with a non-const reference to an AutoLockGC.  The AutoLockGC keeps track of whether it has ever been unlocked so that you can assert that it hasn't.

Also since AutoUnlockGC needs a non-const reference then any methods that take a const reference to an AutoLockGC can't unlock it.
Attachment #8525340 - Flags: review?(terrence)
Great find! I'm glad my concerns turned out to be unfounded.
Nice find.
Comment on attachment 8525340 [details] [diff] [review]
bug1097864-locking

Review of attachment 8525340 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice! I strongly considered doing basically this last week, but figured I was just being overly paranoid! I guess I should listen to those urges. One thing I didn't think of was to use non-const as "unlockable"; that's a great use.

::: js/src/vm/Runtime.h
@@ +1498,1 @@
>  class AutoLockGC

I think we should probably make this MOZ_STACK_CLASS too.
Attachment #8525340 - Flags: review?(terrence) → review+
https://hg.mozilla.org/mozilla-central/rev/70ffdede9e2b
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.