Closed Bug 928050 Opened 11 years ago Closed 11 years ago

Remove worker pausing mechanism

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

(Whiteboard: [qa-])

Attachments

(5 files, 3 obsolete files)

During GCs, we pause JS worker threads running concurrently (doing parsing, compression etc.) to avoid potential races with them.  This should be unnecessary though --- such threads either don't access VM state or have exclusive access to their zone, which won't be collected by the GC.  If the GC takes the exclusive access lock when using the shared runtime bits which are accessible to worker threads, then pausing workers during GC should be unnecessary.
Attached patch patch (obsolete) — Splinter Review
Assignee: nobody → bhackett1024
Attachment #818764 - Flags: review?(wmccloskey)
To avoid reentrant locking we can't do anything that might GC while holding the exclusive access lock.  A big source of such operations is during atomization.  This patch removes the possibility of GC'ing during atomization and does a fair amount of cleanup.  This could be changed so that during atomization we release the lock before doing something that might trigger a GC, but the cleanup here is kind of nice I guess.
Attachment #818766 - Flags: review?(wmccloskey)
Per the above, we hold the exclusive access lock while filling in the IonRuntime's stubs (they are allocated in the atoms zone) so shouldn't GC in this space.  This templates IonCode::New so that we only avoid GCs when allocating IonCode objects in this short span.
Attachment #818767 - Flags: review?(jdemooij)
Attached patch patch (obsolete) — Splinter Review
Remove some methods from ThreadSafeContext that can no longer be called race free if a GC is occurring concurrently, and add synchronization code so that threads do not try to get new arenas while a GC is in progress.
Attachment #818764 - Attachment is obsolete: true
Attachment #818764 - Flags: review?(wmccloskey)
Attachment #820465 - Flags: review?(wmccloskey)
This should be really helpful for <script async>-compiled asm.js: it's not uncommon to have a GC while the async compilation is in progress which currently blocks the main thread.
Comment on attachment 820465 [details] [diff] [review]
patch

Review of attachment 820465 [details] [diff] [review]:
-----------------------------------------------------------------

I think we could do the GC number stuff a little bit nicer. Basically, I'm thinking that we could add a gcNumber() method to the zone (or maybe the compartment if that's easier). The idea would be that we'd only increment the gcNumber for zones that are being collected. I don't want to ask you to change the incrementing though, since it looks like there are 3 or 4 places where we do it. For this bug, I think we should just have Zone::gcNumber() return the runtime's gcNumber if it's not an exclusive compartment, and otherwise it can return 0. Then we can change the places in this patch to use Zone::gcNumber for the zone in question. And please put a comment on Zone::gcNumber saying that it's a number that gets incremented whenever we collect the given zone (and maybe at other times too).

I'm also worried about lock ordering (for deadlock prevention). It seems like we have three locks in play: the GC lock, the exclusive access lock, and the worker thread state lock. Can we guarantee that they'll be acquired in some order? If so, we need to document the order somewhere and assert that we always acquire in the right order.

This looks good, but I want to know what you have to say about the ordering issue and you're not on IRC now, so I'm just setting feedback+.

::: js/src/gc/RootMarking.cpp
@@ +672,3 @@
>  #ifdef JS_ION
> +            jit::IonRuntime::Mark(trc);
> +        }

Please move the brace outside the #ifdef.

::: js/src/jsgc.cpp
@@ +2834,5 @@
>       * zones that are not being collected, we are not allowed to collect
>       * atoms. Otherwise, the non-collected zones could contain pointers
>       * to atoms that we would miss.
>       */
> +    if (rt->gcIsFull && !rt->keepAtoms()) {

The safety argument is a little tricky here. Can you please comment that:
1. rt->keepAtoms() can't change when we're in GC code because we can only start new threads from the main thread, which is doing GC work.
2. If rt->keepAtoms() changes between GC slices, then we'll cancel the incremental GC.

@@ +3372,5 @@
>  
>  static void
>  FindZoneGroups(JSRuntime *rt)
>  {
> +    AutoLockForExclusiveAccess lock(rt);

What's this for?

@@ +4133,5 @@
>  
>      runtime->gcNumber++;
>  
> +    // Threads other than the main thread can't trigger GC, and operate on
> +    // zones which will not be collected from while they are running.

Can you rephrase this to be of the form "It's okay if other threads have suppressGC set because..."?

::: js/src/vm/Runtime.h
@@ +1803,4 @@
>      }
>      ~AutoKeepAtoms() {
> +        if (pt->isMainThread())
> +            pt->runtimeFromMainThread()->keepAtoms_--;

Please assert it's > 0 before decrementing.
Attachment #820465 - Flags: review?(wmccloskey) → feedback+
Actually, this patch is necessary x2: having the main thread block on a worker thread creates a potential deadlock for bug 929236 where a worker thread waits on a runnable dispatched to the main thread (where various gunk is forced to run).

Brian: do you know if, after these patches, there are any other cases where the main thread blocks on worker threads?
Blocks: 929236
I was just experimenting with this patch applied and a sleep(10) inside the worker thread and I get an assertion when trying to merge the compilation results:
  https://pastebin.mozilla.org/3328204
I was able to reproduce this twice with a couple tries.  Just use any script async and put a sleep(10) somewhere in the parse path.  The main thread is still responsive so you can open/close a few tabs to trigger a gc.
Comment on attachment 818766 [details] [diff] [review]
don't gc during atomization

Review of attachment 818766 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, forgot about this one.
Attachment #818766 - Flags: review?(wmccloskey) → review+
(In reply to Luke Wagner [:luke] from comment #7)
> Actually, this patch is necessary x2: having the main thread block on a
> worker thread creates a potential deadlock for bug 929236 where a worker
> thread waits on a runnable dispatched to the main thread (where various gunk
> is forced to run).
> 
> Brian: do you know if, after these patches, there are any other cases where
> the main thread blocks on worker threads?

There are several places where the main thread may be forced to wait for a worker to finish --- canceling off thread Ion compilations, waiting for off thread compression to finish, and (I think) waiting for asm.js compilation tasks to finish.  I think that in places where worker threads are waiting on the main thread we should always make sure there is at least one unrestrained worker thread available to process jobs.
(In reply to Brian Hackett (:bhackett) from comment #10)
> There are several places where the main thread may be forced to wait for a
> worker to finish --- canceling off thread Ion compilations, waiting for off
> thread compression to finish, and (I think) waiting for asm.js compilation
> tasks to finish.  I think that in places where worker threads are waiting on
> the main thread we should always make sure there is at least one
> unrestrained worker thread available to process jobs.

Those should all be fine for the reason you mentioned and the fact that helperThreadCount() now either returns 0 or >=2.  What I was asking about is if there are any more places where the main thread indiscriminately waits on all worker threads to halt like the case this bug is removing.
(In reply to Luke Wagner [:luke] from comment #11)
> Those should all be fine for the reason you mentioned and the fact that
> helperThreadCount() now either returns 0 or >=2.  What I was asking about is
> if there are any more places where the main thread indiscriminately waits on
> all worker threads to halt like the case this bug is removing.

No, this is the only place where the main thread waits on all worker threads.
Attachment #818767 - Flags: review?(jdemooij) → review+
Attached patch updatedSplinter Review
This looks pretty good on try.  The problem with the isSameNonEmptySpan assertions was that AutoPrepareForTracing would copy freelist stuff for all zones, including those being manipulated off thread.  Fixing this also requires some ZoneIters to avoid inspecting zones being manipulated off thread.
Attachment #820465 - Attachment is obsolete: true
Attachment #824410 - Flags: review?(wmccloskey)
Sorry for the delay here. The canLock change is really nice. While I'm working my way through, can you answer a two questions?

1. Why is the operation callback lock acquired before the GC lock? It seems like the other way around would make more sense. I can't think of why we would take the GC lock while holding the operation callback lock, but maybe there's a case?

2. Why is the worker thread state lock used to control access to rt->heapState? The GC lock would make more sense. Were you trying to avoid reentrant locking issues with allocateFromArenaInline?
(In reply to Bill McCloskey (:billm) from comment #14)
> 1. Why is the operation callback lock acquired before the GC lock? It seems
> like the other way around would make more sense. I can't think of why we
> would take the GC lock while holding the operation callback lock, but maybe
> there's a case?

Ion needs to take the operation callback lock in CodeGenerator::link because it is allocating code in the Ion allocator, which is protected by that lock as it can be mprotected by other threads triggering a callback in Ion code.  This same call also allocates a GC thing, which could end up taking the GC lock to refill the IonCode freelist.

> 2. Why is the worker thread state lock used to control access to
> rt->heapState? The GC lock would make more sense. Were you trying to avoid
> reentrant locking issues with allocateFromArenaInline?

If the heap is busy when another thread tries to refill a freelist, it needs to block until the heap is no longer busy.  We don't have wait/notify plumbing available for the GC lock, but do for the worker thread state lock.  Since the only contention on heapState will involve worker threads this lock also seemed ok to use.
Attached patch atoms-zone-iter (obsolete) — Splinter Review
This patch makes it explicit whether we want to iterate over the atoms zone or not. I also added several calls to acquire the exclusive access lock.

- The ones for memory reporting we already discussed.
- One in TraceRuntime and another in StartVerifyPreBarriers. These functions call MarkRuntime with a JSTracer that's not a GC marker. MarkRuntime calls jit::JitRuntime::Mark(trc), which iterates over all IonCode cells in the atoms zone.
- On for JS_IterateCompartments because it seems better to include the atoms compartment in the iteration. We could probably fix this one if it's an issue, but it seems better to include it for an API function.

I haven't tested with your patch, Brian. I'm a little worried that the extra locks will break the lock ordering assertions, but hopefully they'll be okay.
Attachment #826139 - Flags: review?(bhackett1024)
Oh, I just noticed that you already took the exclusive access lock in MarkRuntime. I guess the ordering shouldn't be an issue then.
Comment on attachment 824410 [details] [diff] [review]
updated

Review of attachment 824410 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. Please remove all the canCollect stuff.

::: js/src/gc/RootMarking.cpp
@@ +667,5 @@
>              MarkScriptRoot(trc, &vec[i].script, "scriptAndCountsVector");
>      }
>  
> +    if (!rt->isBeingDestroyed() && !trc->runtime->isHeapMinorCollecting()) {
> +        AutoLockForExclusiveAccess lock(rt);

I think we should require callers of MarkRuntime to take the the exclusive access lock since callers need to call PrepareForTracing(WithAtoms), and that needs the lock anyway.

::: js/src/jsworkers.cpp
@@ +334,2 @@
>  
> +    threads = (WorkerThread*) runtime->calloc_(sizeof(WorkerThread) * numThreads);

While you're here, an you switch this to js_pod_calloc?

::: js/src/jsworkers.h
@@ +74,5 @@
>      /* Worklist for source compression worker threads. */
>      Vector<SourceCompressionTask *, 0, SystemAllocPolicy> compressionWorklist;
>  
> +    WorkerThreadState(JSRuntime *rt)
> +    {

Brace should be up one line.

::: js/src/vm/Runtime.cpp
@@ +331,5 @@
>      operationCallbackLock = PR_NewLock();
>      if (!operationCallbackLock)
>          return false;
> +
> +    gcLock = PR_NewLock();

I'm assuming that some code runs between here and js_InitGC that needs the GC lock. What is it?

@@ +841,5 @@
> +
> +#ifdef DEBUG
> +
> +bool
> +JSRuntime::canLock(RuntimeLock which)

Can you rename this to assertCanLock and have it return void?

@@ +844,5 @@
> +bool
> +JSRuntime::canLock(RuntimeLock which)
> +{
> +#ifdef JS_THREADSAFE
> +    switch (which) {

Maybe this is too clever, but you could implement this with less code by having the cases fall through. That would also be nice because they'd be required to be listed in the order of the lock order relation.

@@ +847,5 @@
> +#ifdef JS_THREADSAFE
> +    switch (which) {
> +      case OperationCallbackLock:
> +        // ExclusiveAccessLock -> OperationCallbackLock
> +        // WorkerThreadStateLock -> OperationCallbackLock

These comments are a little random. I think it would be better to have a larger comment in the header file about how you have to acquire them in a certain order, and then a smaller comment here saying that this function implements the rules in the header. It might also be nice to explain here why the order is the way it is. Your comment 15 was helpful.

::: js/src/vm/Runtime.h
@@ +720,5 @@
> +        ExclusiveAccessLock,
> +        GCLock,
> +        WorkerThreadStateLock
> +    };
> +    bool canLock(RuntimeLock which);

Can you write a comment here giving the ordering rules?
Attachment #824410 - Flags: review?(wmccloskey) → review+
Comment on attachment 826139 [details] [diff] [review]
atoms-zone-iter

Review of attachment 826139 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Zone.h
@@ +307,5 @@
>  namespace js {
>  
> +/*
> + * Using the atoms zone without holding the exclusive access lock is dangerous
> + * because worker threads may be using it simulataneously. Therefore, it's

typo
Attachment #826139 - Flags: review?(bhackett1024) → review+
I fixed a few problems with this patch and pushed it to try along with some other things. Some winxp jit-test is failing, so now I'm trying to figure out if it's caused by this patch or the other one I pushed with it.
Attachment #826139 - Attachment is obsolete: true
Attachment #827157 - Flags: review+
Gary, Christian, can you guys try fuzzing the patch in comment 22 on Linux? There's a WinXP failure that I can't reproduce, and I'm hoping fuzzing might expose something that I can reproduce on Linux.
Flags: needinfo?(gary)
Flags: needinfo?(choller)
Depends on: 935149
As a quick result, I haven't seen anything terrible happen in the few minutes of fuzzing with the comment 22 patch, but I will leave it running overnight and will report back if there is anything that pops up.
Flags: needinfo?(gary)
Comment on attachment 827157 [details] [diff] [review]
atoms-zone-iter v2

I didn't really find anything particularly bad here.
Attachment #827157 - Flags: feedback+
Arghh. I think I figured out the problem. I added a couple extra AutoLockForExclusiveAccess uses, but I didn't push on top of Brian's patches to avoid GC during atomization. The Windows failure (which I finally managed to reproduce) was due to some recursive locking.

I did another try push on top of Brian's patches to avoid GC during atomization and Ion stuff.
https://tbpl.mozilla.org/?tree=Try&rev=418d3162b610

I also filed bug 936681. If the assertion that failed had been printed in the tinderbox log, this would have been a lot easier to track down.
Blocks: 927204
I assume patch-fuzzing isn't required anymore as it's landing already? I didn't get earlier to this because I was still sorting out results for another request, but since gkw already tested this, it should be enough :)
Flags: needinfo?(choller)
Depends on: 937605
Maybe this has since been fixed, but I get a deadlock (about 50% of the time) on unrealengine.com/html5 with only the combined patch (based on 99c73ffc83d4) and Bill's atom-zone-iter2 patch applied.  Looks like the main thread is waiting on a lock in AutoLockForExclusiveAccess and a JS worker thread is calling AtomizeChars which, even with NoGC, ends up in refillFreeList which waits on the WorkerThreadState::PRODUCER condvar in the new "If we're off the main thread" block.

Full stacks:
https://pastebin.mozilla.org/3598152
To wit: rt->atomsCompartment()->zone()->isCollecting() is false when the above deadlock occurs.  I don't know if this is necessarily the case when the worker thread is Atomizing, but for kicks I moved the AutoLockForExclusiveAccess to inside the then-branch (right before MarkAtoms)... then I hit a deadlock in the other AutoLockForExclusiveAccess in FindZoneGroups.  If I put *that* in a Maybe and only construct() if rt->atomsCompartment()->zone()->isCollecting(), then I can't repro a deadlock after 10 tries.  But again, I don't know what the invariants are here.
Hmm, thanks for the report.  I don't think this will be a problem in the patch I'm trying to get through try.  Bill's patch moved some AutoLockForExclusiveAccess around AutoPrepareForTracing which can do full GC slices, so to avoid recursively taking that lock we pretty much need to already have exclusive access throughout any GC session.  The worker is waiting for the GC session to finish, and if the worker holds the exclusive access lock there will necessarily not be any GC session in progress.  We don't allocate GC things while holding the other per-runtime locks (I'll add asserts for this).
Great!  If you throw up a combined patch, I'll apply it and do some testing.
This try patch is looking pretty good:

https://tbpl.mozilla.org/?tree=Try&rev=96df11692a61
Great!  Seems to be fixed.
Depends on: 939464
Depends on: 939504
Depends on: 939741
\o/

Can we remove [leave open]?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla28
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: