Closed Bug 928050 Opened 11 years ago Closed 11 years ago

Remove worker pausing mechanism

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla28

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

(Whiteboard: [qa-])

Attachments

(5 files, 3 obsolete files)

patch 11 years ago Brian Hackett [Laid off!] 37.08 KB, patch		Details \| Diff \| Splinter Review
don't gc during atomization 11 years ago Brian Hackett [Laid off!] 53.25 KB, patch	billm : review+	Details \| Diff \| Splinter Review
don't gc when initializing IonRuntime 11 years ago Brian Hackett [Laid off!] 23.98 KB, patch	jandem : review+	Details \| Diff \| Splinter Review
patch 11 years ago Brian Hackett [Laid off!] 46.62 KB, patch	billm : feedback+	Details \| Diff \| Splinter Review
updated 11 years ago Brian Hackett [Laid off!] 72.98 KB, patch	billm : review+	Details \| Diff \| Splinter Review
rolled up patch (99c73ffc83d4) 11 years ago Brian Hackett [Laid off!] 153.54 KB, patch		Details \| Diff \| Splinter Review
atoms-zone-iter 11 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 47.28 KB, patch	bhackett1024 : review+	Details \| Diff \| Splinter Review
atoms-zone-iter v2 11 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 49.10 KB, patch	billm : review+ gkw : feedback+	Details \| Diff \| Splinter Review

Brian Hackett [Laid off!]

Assignee

Description

•

11 years ago

During GCs, we pause JS worker threads running concurrently (doing parsing, compression etc.) to avoid potential races with them.  This should be unnecessary though --- such threads either don't access VM state or have exclusive access to their zone, which won't be collected by the GC.  If the GC takes the exclusive access lock when using the shared runtime bits which are accessible to worker threads, then pausing workers during GC should be unnecessary.

Brian Hackett [Laid off!]

Assignee

Comment 1

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Assignee: nobody → bhackett1024

Attachment #818764 - Flags: review?(wmccloskey)

Brian Hackett [Laid off!]

Assignee

Comment 2

•

11 years ago

Attached patch don't gc during atomization — Details — Splinter Review

To avoid reentrant locking we can't do anything that might GC while holding the exclusive access lock.  A big source of such operations is during atomization.  This patch removes the possibility of GC'ing during atomization and does a fair amount of cleanup.  This could be changed so that during atomization we release the lock before doing something that might trigger a GC, but the cleanup here is kind of nice I guess.

Attachment #818766 - Flags: review?(wmccloskey)

Brian Hackett [Laid off!]

Assignee

Comment 3

•

11 years ago

Attached patch don't gc when initializing IonRuntime — Details — Splinter Review

Per the above, we hold the exclusive access lock while filling in the IonRuntime's stubs (they are allocated in the atoms zone) so shouldn't GC in this space.  This templates IonCode::New so that we only avoid GCs when allocating IonCode objects in this short span.

Attachment #818767 - Flags: review?(jdemooij)

Brian Hackett [Laid off!]

Assignee

Comment 4

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Remove some methods from ThreadSafeContext that can no longer be called race free if a GC is occurring concurrently, and add synchronization code so that threads do not try to get new arenas while a GC is in progress.

Attachment #818764 - Attachment is obsolete: true

Attachment #818764 - Flags: review?(wmccloskey)

Attachment #820465 - Flags: review?(wmccloskey)

Luke Wagner [:luke]

Comment 5

•

11 years ago

This should be really helpful for <script async>-compiled asm.js: it's not uncommon to have a GC while the async compilation is in progress which currently blocks the main thread.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 6

•

11 years ago

Comment on attachment 820465 [details] [diff] [review]
patch

Review of attachment 820465 [details] [diff] [review]:
-----------------------------------------------------------------

I think we could do the GC number stuff a little bit nicer. Basically, I'm thinking that we could add a gcNumber() method to the zone (or maybe the compartment if that's easier). The idea would be that we'd only increment the gcNumber for zones that are being collected. I don't want to ask you to change the incrementing though, since it looks like there are 3 or 4 places where we do it. For this bug, I think we should just have Zone::gcNumber() return the runtime's gcNumber if it's not an exclusive compartment, and otherwise it can return 0. Then we can change the places in this patch to use Zone::gcNumber for the zone in question. And please put a comment on Zone::gcNumber saying that it's a number that gets incremented whenever we collect the given zone (and maybe at other times too).

I'm also worried about lock ordering (for deadlock prevention). It seems like we have three locks in play: the GC lock, the exclusive access lock, and the worker thread state lock. Can we guarantee that they'll be acquired in some order? If so, we need to document the order somewhere and assert that we always acquire in the right order.

This looks good, but I want to know what you have to say about the ordering issue and you're not on IRC now, so I'm just setting feedback+.

::: js/src/gc/RootMarking.cpp
@@ +672,3 @@
>  #ifdef JS_ION
> +            jit::IonRuntime::Mark(trc);
> +        }

Please move the brace outside the #ifdef.

::: js/src/jsgc.cpp
@@ +2834,5 @@
>       * zones that are not being collected, we are not allowed to collect
>       * atoms. Otherwise, the non-collected zones could contain pointers
>       * to atoms that we would miss.
>       */
> +    if (rt->gcIsFull && !rt->keepAtoms()) {

The safety argument is a little tricky here. Can you please comment that:
1. rt->keepAtoms() can't change when we're in GC code because we can only start new threads from the main thread, which is doing GC work.
2. If rt->keepAtoms() changes between GC slices, then we'll cancel the incremental GC.

@@ +3372,5 @@
>  
>  static void
>  FindZoneGroups(JSRuntime *rt)
>  {
> +    AutoLockForExclusiveAccess lock(rt);

What's this for?

@@ +4133,5 @@
>  
>      runtime->gcNumber++;
>  
> +    // Threads other than the main thread can't trigger GC, and operate on
> +    // zones which will not be collected from while they are running.

Can you rephrase this to be of the form "It's okay if other threads have suppressGC set because..."?

::: js/src/vm/Runtime.h
@@ +1803,4 @@
>      }
>      ~AutoKeepAtoms() {
> +        if (pt->isMainThread())
> +            pt->runtimeFromMainThread()->keepAtoms_--;

Please assert it's > 0 before decrementing.

Attachment #820465 - Flags: review?(wmccloskey) → feedback+

Luke Wagner [:luke]

Comment 7

•

11 years ago

Actually, this patch is necessary x2: having the main thread block on a worker thread creates a potential deadlock for bug 929236 where a worker thread waits on a runnable dispatched to the main thread (where various gunk is forced to run).

Brian: do you know if, after these patches, there are any other cases where the main thread blocks on worker threads?

Blocks: 929236

Luke Wagner [:luke]

Comment 8

•

11 years ago

I was just experimenting with this patch applied and a sleep(10) inside the worker thread and I get an assertion when trying to merge the compilation results:
  https://pastebin.mozilla.org/3328204
I was able to reproduce this twice with a couple tries.  Just use any script async and put a sleep(10) somewhere in the parse path.  The main thread is still responsive so you can open/close a few tabs to trigger a gc.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

11 years ago

Comment on attachment 818766 [details] [diff] [review]
don't gc during atomization

Review of attachment 818766 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, forgot about this one.

Attachment #818766 - Flags: review?(wmccloskey) → review+

Brian Hackett [Laid off!]

Assignee

Comment 10

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #7)
> Actually, this patch is necessary x2: having the main thread block on a
> worker thread creates a potential deadlock for bug 929236 where a worker
> thread waits on a runnable dispatched to the main thread (where various gunk
> is forced to run).
> 
> Brian: do you know if, after these patches, there are any other cases where
> the main thread blocks on worker threads?

There are several places where the main thread may be forced to wait for a worker to finish --- canceling off thread Ion compilations, waiting for off thread compression to finish, and (I think) waiting for asm.js compilation tasks to finish.  I think that in places where worker threads are waiting on the main thread we should always make sure there is at least one unrestrained worker thread available to process jobs.

Luke Wagner [:luke]

Comment 11

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #10)
> There are several places where the main thread may be forced to wait for a
> worker to finish --- canceling off thread Ion compilations, waiting for off
> thread compression to finish, and (I think) waiting for asm.js compilation
> tasks to finish.  I think that in places where worker threads are waiting on
> the main thread we should always make sure there is at least one
> unrestrained worker thread available to process jobs.

Those should all be fine for the reason you mentioned and the fact that helperThreadCount() now either returns 0 or >=2.  What I was asking about is if there are any more places where the main thread indiscriminately waits on all worker threads to halt like the case this bug is removing.

Brian Hackett [Laid off!]

Assignee

Comment 12

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #11)
> Those should all be fine for the reason you mentioned and the fact that
> helperThreadCount() now either returns 0 or >=2.  What I was asking about is
> if there are any more places where the main thread indiscriminately waits on
> all worker threads to halt like the case this bug is removing.

No, this is the only place where the main thread waits on all worker threads.

Jan de Mooij [:jandem]

Updated

•

11 years ago

Attachment #818767 - Flags: review?(jdemooij) → review+

Brian Hackett [Laid off!]

Assignee

Comment 13

•

11 years ago

Attached patch updated — Details — Splinter Review

This looks pretty good on try.  The problem with the isSameNonEmptySpan assertions was that AutoPrepareForTracing would copy freelist stuff for all zones, including those being manipulated off thread.  Fixing this also requires some ZoneIters to avoid inspecting zones being manipulated off thread.

Attachment #820465 - Attachment is obsolete: true

Attachment #824410 - Flags: review?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 14

•

11 years ago

Sorry for the delay here. The canLock change is really nice. While I'm working my way through, can you answer a two questions?

1. Why is the operation callback lock acquired before the GC lock? It seems like the other way around would make more sense. I can't think of why we would take the GC lock while holding the operation callback lock, but maybe there's a case?

2. Why is the worker thread state lock used to control access to rt->heapState? The GC lock would make more sense. Were you trying to avoid reentrant locking issues with allocateFromArenaInline?

Brian Hackett [Laid off!]

Assignee

Comment 15

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #14)
> 1. Why is the operation callback lock acquired before the GC lock? It seems
> like the other way around would make more sense. I can't think of why we
> would take the GC lock while holding the operation callback lock, but maybe
> there's a case?

Ion needs to take the operation callback lock in CodeGenerator::link because it is allocating code in the Ion allocator, which is protected by that lock as it can be mprotected by other threads triggering a callback in Ion code.  This same call also allocates a GC thing, which could end up taking the GC lock to refill the IonCode freelist.

> 2. Why is the worker thread state lock used to control access to
> rt->heapState? The GC lock would make more sense. Were you trying to avoid
> reentrant locking issues with allocateFromArenaInline?

If the heap is busy when another thread tries to refill a freelist, it needs to block until the heap is no longer busy.  We don't have wait/notify plumbing available for the GC lock, but do for the worker thread state lock.  Since the only contention on heapState will involve worker threads this lock also seemed ok to use.

Brian Hackett [Laid off!]

Assignee

Comment 17

•

11 years ago

Attached patch rolled up patch (99c73ffc83d4) — Details — Splinter Review

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 18

•

11 years ago

Attached patch atoms-zone-iter (obsolete) — Details — Splinter Review

This patch makes it explicit whether we want to iterate over the atoms zone or not. I also added several calls to acquire the exclusive access lock.

- The ones for memory reporting we already discussed.
- One in TraceRuntime and another in StartVerifyPreBarriers. These functions call MarkRuntime with a JSTracer that's not a GC marker. MarkRuntime calls jit::JitRuntime::Mark(trc), which iterates over all IonCode cells in the atoms zone.
- On for JS_IterateCompartments because it seems better to include the atoms compartment in the iteration. We could probably fix this one if it's an issue, but it seems better to include it for an API function.

I haven't tested with your patch, Brian. I'm a little worried that the extra locks will break the lock ordering assertions, but hopefully they'll be okay.

Attachment #826139 - Flags: review?(bhackett1024)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 19

•

11 years ago

Oh, I just noticed that you already took the exclusive access lock in MarkRuntime. I guess the ordering shouldn't be an issue then.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 20

•

11 years ago

Comment on attachment 824410 [details] [diff] [review]
updated

Review of attachment 824410 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. Please remove all the canCollect stuff.

::: js/src/gc/RootMarking.cpp
@@ +667,5 @@
>              MarkScriptRoot(trc, &vec[i].script, "scriptAndCountsVector");
>      }
>  
> +    if (!rt->isBeingDestroyed() && !trc->runtime->isHeapMinorCollecting()) {
> +        AutoLockForExclusiveAccess lock(rt);

I think we should require callers of MarkRuntime to take the the exclusive access lock since callers need to call PrepareForTracing(WithAtoms), and that needs the lock anyway.

::: js/src/jsworkers.cpp
@@ +334,2 @@
>  
> +    threads = (WorkerThread*) runtime->calloc_(sizeof(WorkerThread) * numThreads);

While you're here, an you switch this to js_pod_calloc?

::: js/src/jsworkers.h
@@ +74,5 @@
>      /* Worklist for source compression worker threads. */
>      Vector<SourceCompressionTask *, 0, SystemAllocPolicy> compressionWorklist;
>  
> +    WorkerThreadState(JSRuntime *rt)
> +    {

Brace should be up one line.

::: js/src/vm/Runtime.cpp
@@ +331,5 @@
>      operationCallbackLock = PR_NewLock();
>      if (!operationCallbackLock)
>          return false;
> +
> +    gcLock = PR_NewLock();

I'm assuming that some code runs between here and js_InitGC that needs the GC lock. What is it?

@@ +841,5 @@
> +
> +#ifdef DEBUG
> +
> +bool
> +JSRuntime::canLock(RuntimeLock which)

Can you rename this to assertCanLock and have it return void?

@@ +844,5 @@
> +bool
> +JSRuntime::canLock(RuntimeLock which)
> +{
> +#ifdef JS_THREADSAFE
> +    switch (which) {

Maybe this is too clever, but you could implement this with less code by having the cases fall through. That would also be nice because they'd be required to be listed in the order of the lock order relation.

@@ +847,5 @@
> +#ifdef JS_THREADSAFE
> +    switch (which) {
> +      case OperationCallbackLock:
> +        // ExclusiveAccessLock -> OperationCallbackLock
> +        // WorkerThreadStateLock -> OperationCallbackLock

These comments are a little random. I think it would be better to have a larger comment in the header file about how you have to acquire them in a certain order, and then a smaller comment here saying that this function implements the rules in the header. It might also be nice to explain here why the order is the way it is. Your comment 15 was helpful.

::: js/src/vm/Runtime.h
@@ +720,5 @@
> +        ExclusiveAccessLock,
> +        GCLock,
> +        WorkerThreadStateLock
> +    };
> +    bool canLock(RuntimeLock which);

Can you write a comment here giving the ordering rules?

Attachment #824410 - Flags: review?(wmccloskey) → review+

Brian Hackett [Laid off!]

Assignee

Comment 21

•

11 years ago

Comment on attachment 826139 [details] [diff] [review]
atoms-zone-iter

Review of attachment 826139 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Zone.h
@@ +307,5 @@
>  namespace js {
>  
> +/*
> + * Using the atoms zone without holding the exclusive access lock is dangerous
> + * because worker threads may be using it simulataneously. Therefore, it's

typo

Attachment #826139 - Flags: review?(bhackett1024) → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 22

•

11 years ago

Attached patch atoms-zone-iter v2 — Details — Splinter Review

I fixed a few problems with this patch and pushed it to try along with some other things. Some winxp jit-test is failing, so now I'm trying to figure out if it's caused by this patch or the other one I pushed with it.

Attachment #826139 - Attachment is obsolete: true

Attachment #827157 - Flags: review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 23

•

11 years ago

Gary, Christian, can you guys try fuzzing the patch in comment 22 on Linux? There's a WinXP failure that I can't reproduce, and I'm hoping fuzzing might expose something that I can reproduce on Linux.

Flags: needinfo?(gary)

Flags: needinfo?(choller)

John Hopkins (:jhopkins)

Updated

•

11 years ago

Depends on: 935149

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 24

•

11 years ago

As a quick result, I haven't seen anything terrible happen in the few minutes of fuzzing with the comment 22 patch, but I will leave it running overnight and will report back if there is anything that pops up.

Flags: needinfo?(gary)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Depends on: 936414

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 25

•

11 years ago

Comment on attachment 827157 [details] [diff] [review]
atoms-zone-iter v2

I didn't really find anything particularly bad here.

Attachment #827157 - Flags: feedback+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 26

•

11 years ago

Arghh. I think I figured out the problem. I added a couple extra AutoLockForExclusiveAccess uses, but I didn't push on top of Brian's patches to avoid GC during atomization. The Windows failure (which I finally managed to reproduce) was due to some recursive locking.

I did another try push on top of Brian's patches to avoid GC during atomization and Ion stuff.
https://tbpl.mozilla.org/?tree=Try&rev=418d3162b610

I also filed bug 936681. If the assertion that failed had been printed in the tinderbox log, this would have been a lot easier to track down.

Terrence Cole [:terrence]

Updated

•

11 years ago

Blocks: 927204

Brian Hackett [Laid off!]

Assignee

Comment 27

•

11 years ago

First two parts:

https://hg.mozilla.org/integration/mozilla-inbound/rev/cf059711090f
https://hg.mozilla.org/integration/mozilla-inbound/rev/53b2cdf230dc

Whiteboard: [leave open]

Christian Holler (:decoder)

Comment 28

•

11 years ago

I assume patch-fuzzing isn't required anymore as it's landing already? I didn't get earlier to this because I was still sorting out results for another request, but since gkw already tested this, it should be enough :)

Flags: needinfo?(choller)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 29

•

11 years ago

Atoms iter stuff:
https://hg.mozilla.org/integration/mozilla-inbound/rev/442211bce621

Carsten Book [:Tomcat]

Comment 30

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/cf059711090f
https://hg.mozilla.org/mozilla-central/rev/53b2cdf230dc

Ryan VanderMeulen [:RyanVM]

Comment 31

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/442211bce621

Gustavo Luiz Duarte

Updated

•

11 years ago

Depends on: 937605

Luke Wagner [:luke]

Comment 32

•

11 years ago

Maybe this has since been fixed, but I get a deadlock (about 50% of the time) on unrealengine.com/html5 with only the combined patch (based on 99c73ffc83d4) and Bill's atom-zone-iter2 patch applied.  Looks like the main thread is waiting on a lock in AutoLockForExclusiveAccess and a JS worker thread is calling AtomizeChars which, even with NoGC, ends up in refillFreeList which waits on the WorkerThreadState::PRODUCER condvar in the new "If we're off the main thread" block.

Full stacks:
https://pastebin.mozilla.org/3598152

Luke Wagner [:luke]

Comment 33

•

11 years ago

To wit: rt->atomsCompartment()->zone()->isCollecting() is false when the above deadlock occurs.  I don't know if this is necessarily the case when the worker thread is Atomizing, but for kicks I moved the AutoLockForExclusiveAccess to inside the then-branch (right before MarkAtoms)... then I hit a deadlock in the other AutoLockForExclusiveAccess in FindZoneGroups.  If I put *that* in a Maybe and only construct() if rt->atomsCompartment()->zone()->isCollecting(), then I can't repro a deadlock after 10 tries.  But again, I don't know what the invariants are here.

Brian Hackett [Laid off!]

Assignee

Comment 34

•

11 years ago

Hmm, thanks for the report.  I don't think this will be a problem in the patch I'm trying to get through try.  Bill's patch moved some AutoLockForExclusiveAccess around AutoPrepareForTracing which can do full GC slices, so to avoid recursively taking that lock we pretty much need to already have exclusive access throughout any GC session.  The worker is waiting for the GC session to finish, and if the worker holds the exclusive access lock there will necessarily not be any GC session in progress.  We don't allocate GC things while holding the other per-runtime locks (I'll add asserts for this).

Luke Wagner [:luke]

Comment 35

•

11 years ago

Great!  If you throw up a combined patch, I'll apply it and do some testing.

Brian Hackett [Laid off!]

Assignee

Comment 36

•

11 years ago

This try patch is looking pretty good:

https://tbpl.mozilla.org/?tree=Try&rev=96df11692a61

Luke Wagner [:luke]

Comment 37

•

11 years ago

Great!  Seems to be fixed.

:Benjamin Peterson

Updated

•

11 years ago

Depends on: 939464

Landry Breuil (:gaston)

Updated

•

11 years ago

Depends on: 939504

Brian Hackett [Laid off!]

Assignee

Comment 38

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2bbd3a23d40f

Ed Morley [:emorley]

Comment 39

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/2bbd3a23d40f

Gustavo Luiz Duarte

Updated

•

11 years ago

Depends on: 939741

Luke Wagner [:luke]

Comment 40

•

11 years ago

\o/

Can we remove [leave open]?

Brian Hackett [Laid off!]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Whiteboard: [leave open]

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Target Milestone: --- → mozilla28

Ioana (away)

Updated

•

10 years ago

Whiteboard: [qa-]

You need to log in before you can comment on or make changes to this bug.