Closed Bug 477627 Opened 15 years ago Closed 15 years ago

Deadlock in ClaimTitle with nested contexts and requests

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9.1

People

(Reporter: igor, Assigned: igor)

References

Details

(Keywords: fixed1.9.1, Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 10 obsolete files)

ClaimTitle from js/src/jslock.cpp suspends the current request when waiting for the thread that owns the title to finish its request. But the code fails to take into account the requests from other contexts that may also be active on the current thread. Thus when that owner thread calls js_GC before finishing its request, the deadlock happens.

The code should do the same what js_GC itself is doing in this case. That is, it should count the number of active requests from all contexts on the current thread and subtract it from JSRuntime.requestDepth without any manipulations of JSContext.requestDepth.
> thread and subtract it from JSRuntime.requestDepth without any manipulations of

s/Depth/Count/

The possibility of multiple contexts per thread may be ignored or left for the embedding to deal with in other places. This case can't be dealt with by the embedding in general.

On the other hand, a thread running code on several contexts (nesting on the thread's stack) is taking liberties with the request model, since the inner request may yield or suspend itself carefully, but the outer is running for a longer run-to-completion interval than it would if it didn't nest code on a second context. The integrity of the request model depends on bounding the length of a run to completion, and nesting arbitrarily undermines this.

Probably too late to fix the API here. We should lookf or other places that assume (JS_SuspendReques and JS_YieldRequest could also walk the passed-in cx's thread->contextList).

/be
(In reply to comment #1)

> On the other hand, a thread running code on several contexts (nesting on the
> thread's stack) is taking liberties with the request model

This is the main source of deadlocks that I've run into in XPConnect, and it's hard to debug and even harder to avoid.

Maybe we just need to take a hard stance on this issue. Right now we support nested requests in multiple contexts on one thread but deadlocking is easy if you don't suspend properly. IMO we either need to disallow this (enforced through assertions?) or start walking the thread's context list. What do you guys think?
Ben: that's exactly the choice, so it would help to know: how hard it has been to add the suspend/resume calls to the outer contexts (outer=older on the stack)?

/be
(In reply to comment #1)
> Probably too late to fix the API here.

There is no need to fix API here. ClaimTitle just should do exactly the same what js_GC does when it detects that GC is runs on another thread. I will post the patch tomorrow.

Another possibility is to move the requestDepth from JSContext to JSThread and change code to use JSThread, not JSContext, as owner of objects. This removes a possibility for nested requests and AFAICS should not break any API and would make JS_SuspendRequst works reliably accross nested contexts.
(In reply to comment #5)
> (In reply to comment #1)
> Another possibility is to move the requestDepth from JSContext to JSThread and
> change code to use JSThread, not JSContext, as owner of objects.

This will slow down code pervasively by adding a dependent load to the current exclusive-title-ownership test in JS_LOCK_OBJ, js_GetSlotThreadSafe, etc. That overhead is non-trivial already.

Of course we should try to avoid that overhead at runtime until objects become MT-shared, but that is a different bug.

/be
(In reply to comment #6)
> > (In reply to comment #1)
> > Another possibility is to move the requestDepth from JSContext to JSThread and
> > change code to use JSThread, not JSContext, as owner of objects.
> 
> This will slow down code pervasively by adding a dependent load to the current
> exclusive-title-ownership test in JS_LOCK_OBJ, js_GetSlotThreadSafe, etc.

I realized that the move of requestDepth from JSContext to JSThread is almost orthogonal with using JSContext/JSThread as the owner for objects. The orthogonality would break in just few places in jslock.cpp like ClaimTitle that would require some adjustments, but these should be safe.
(In reply to comment #4)

It hasn't been difficult to suspend properly once we've located the context switch but the trick is finding it. And it is completely possible that we're adding more misuses even now given that XPConnect sometimes uses a 'safe' JSContext when it can't find another to use.

I talked with jst about this a little and he was concerned that unless we start waling the context list we'll have bugs lurking in edge cases that debug assertions won't catch.
We should automate. Igor's on the case.

/be
Blocking+ p2 per discussion with brendan and bent.
Flags: blocking1.9.1+
Priority: -- → P2
Target Milestone: --- → mozilla1.9.1
Attached patch v1 (obsolete) — Splinter Review
The patch just fixes the reported bug and the bug 437325. That is, in js_ClaimTitle the patch effectively suspends all the requests from all contexts on the current thread in the same way as js_GC does.

The patch does *not* move requestDepth from JSContext to JSThread to make JS_SuspendRequest to suspend all the requests automatically. The problem is that JSThread can be shared among multiple JSRuntime instances. 

One way to address this is to stop this JSThread sharing. Another way is in JS_BeginRequest to search the current thread for another active context with the matching runtime and increase the requestDepth there. Both approaches are doable but they deserve a separated bug.
Attachment #361580 - Flags: review?(brendan)
recording patch dependency
Blocks: 437325
Blocks: 477999
Comment on attachment 361580 [details] [diff] [review]
v1

>+    /*
>+     * Discount all the requests on the current thread from contributing
>+     * to rt->requestCount before we wait for all other requests to finish.
>+     * JS_NOTIFY_REQUEST_DONE, that will wake us up, is only called on

Nit: s/that/which/

> #ifdef JS_THREADSAFE
>-    /* If we were invoked during a request, pay back the temporary debit. */
>-    if (requestDebit)
>-        rt->requestCount += requestDebit;

Trying to think of bugs in doing this += earlier; can't see a problem. Just noting the change.

>+uint32
>+js_DeactivateRequestsForGC(JSContext *cx)
>+{
>+    uint32 activeCount;
>+
>+    JS_ASSERT(cx->thread);
>+    JS_ASSERT(cx->runtime->gcThread != cx->thread);
>+
>+    activeCount = js_CountRequestContexts(cx);
>+    if (activeCount != 0) {
>+        JSRuntime *rt = cx->runtime;
>+        JS_ASSERT(activeCount <= rt->requestCount);
>+        rt->requestCount -= activeCount;
>+        if (rt->requestCount == 0)
>+            JS_NOTIFY_REQUEST_DONE(rt);
>+    }
>+    return activeCount;

"Activate" and "Deactivate" seem like inaccurate new metaphors, since nothing changes on this thread as far as request activity or even cx->requestDepth (which Suspend would change). How about "Discount" and "Recount"?

>+void
>+js_ActivateRequestsAfterGC(JSRuntime *rt, uint32 activeCount)
>+{
>+    while (rt->gcLevel > 0) {
>+        JS_ASSERT(rt->gcThread);
>+        JS_AWAIT_GC_DONE(rt);
>+    }
>+    if (activeCount != 0)
>+        rt->requestCount += activeCount;
>+}

I still prefer requestDebit over activeCount or activeAnything.

>+/*
>+ * If we're in one or more requests (possibly on more than one context)
>+ * running on the current thread, indicate, temporarily, that all these
>+ * requests are inactive so a possible GC can proceed on another thread.
>+ * This function returns the number of deactivated requests. The number must

s/deactivated/discounted/, etc.

>+ * be passed latter to js_ActivateRequestAfterGC to reactivate the requests.

s/latter/later/

>+         * Deactivate all the requests running on the current thread so a

Discount, etc.

>         /*
>          * We know that some other thread's context owns title, which is now
>          * linked onto rt->titleSharingTodo, awaiting the end of that other
>-         * thread's request.  So it is safe to wait on rt->titleSharingDone.
>+         * thread's request. So it is safe to wait on rt->titleSharingDone.
>+         * But before we force the operation callback for that other thread
>+         * so it can quickly suspend.

"But before waiting, we force ..."

r=me with nits addressed. Thanks,

/be
Attachment #361580 - Flags: review?(brendan) → review+
Attached patch v2 (obsolete) — Splinter Review
In the new patch I addressed the nits and moved js_WaitForGC, js_DiscountRequestsForGC and js_RecountRequestsAfterGC to jscntxt.cpp from jsgc.cpp as they are more related to JSContext manipulations rather to the GC itself.
Attachment #361580 - Attachment is obsolete: true
Attachment #362391 - Flags: review+
Attached patch v3 (obsolete) — Splinter Review
The previous version of the patch does not fix all the problems with ClaimTitle and nested requests. The code assumes that a deadlock can only happen if ownercx waits on a title that the current context owns. But with nested requests on different contexts the check should compare threads, not just JSContext instances.

The new patch fixes that.
Attachment #362391 - Attachment is obsolete: true
Attachment #362480 - Flags: review?(brendan)
Comment on attachment 362480 [details] [diff] [review]
v3

>+    JS_ASSERT(CURRENT_THREAD_IS_ME(cx->thread));

It was fun at the time, but the ungrammatical "is me" is grating on me. Oh well, separate bug.

>-static JSBool
>-WillDeadlock(JSTitle *title, JSContext *cx)
>-{
>-    JSContext *ownercx;
>-
>-    do {
>-        ownercx = title->ownercx;
>-        if (ownercx == cx) {
>-            JS_RUNTIME_METER(cx->runtime, deadlocksAvoided);
>-            return JS_TRUE;
>-        }
>-    } while (ownercx && (title = ownercx->titleToShare) != NULL);

Note loop to find n-cycle for arbitrary n.

>+         * Avoid deadlock if title's owner thread is waiting on a title that
>+         * the current thread own, by revoking title's ownership. This

"the current thread owns"

>-        if (rt->gcThread == cx->thread ||
>-            (ownercx->titleToShare &&
>-             WillDeadlock(ownercx->titleToShare, cx))) {
>+        bool share = (rt->gcThread == cx->thread);
>+        if (!share && ownercx->titleToShare) {

Here ownercx->thread could be cx->thread, unless I missed a change earlier.

>+            JSContext *cx2 = ownercx->titleToShare->ownercx;
>+            if (cx2) {
>+                JS_ASSERT(cx2->thread != ownercx->thread);
>+                share = (cx2->thread == cx->thread);

So I don't see how the assertion can be always valid if the condition assigned to share is true.

In general a deadlock involves an n-cycle for n >=1 of threads waiting for other threads (n==1 is selflock or stuck lock). It would seem necessary to preserve WillDeadlock (whether it gets inlined by hand or not; prefer to let the compiler decide) and traverse the dependency graph, checking for ownercx->thread == cx->thread in its do-while loop body (shown above).

Hope I'm not missing something obvious.

/be
(In reply to comment #16)
> In general a deadlock involves an n-cycle for n >=1 of threads waiting for
> other threads (n==1 is selflock or stuck lock). It would seem necessary to
> preserve WillDeadlock (whether it gets inlined by hand or not; prefer to let
> the compiler decide) and traverse the dependency graph, checking for
> ownercx->thread == cx->thread in its do-while loop body (shown above).

n > 2 case is a bad miss on my part. It should be indeed sufficient to change the loop in the detector to check for the current thread, not cx.
Attached patch v4 (obsolete) — Splinter Review
The new version restores WillDeadlock waiting cycle detector, fixes the issues with claiming the title when some thread waits for it and adds comments to explain all the corner cases.
Attachment #362480 - Attachment is obsolete: true
Attachment #363298 - Flags: review?(brendan)
Attachment #362480 - Flags: review?(brendan)
Attachment #363298 - Attachment description: v2 → v4
Comment on attachment 363298 [details] [diff] [review]
v4

>Index: tm/js/src/jsgc.cpp
>===================================================================
>--- tm.orig/js/src/jsgc.cpp	2009-02-20 14:27:09.000000000 +0100
>+++ tm/js/src/jsgc.cpp	2009-02-20 14:46:45.000000000 +0100
>@@ -3244,16 +3244,22 @@ js_GC(JSContext *cx, JSGCInvocationKind 
> #endif
> #ifdef JS_GCMETER
>     uint32 nlivearenas, nkilledarenas, nthings;
> #endif
> 
>     JS_ASSERT_IF(gckind == GC_LAST_DITCH, !JS_ON_TRACE(cx));
>     rt = cx->runtime;
> #ifdef JS_THREADSAFE

Nit (pre-existing): blank line before #ifdef.

>+ * Return true if waiting on rt->titleSharingDone in ClaimTitle until ownercx
>+ * finishes its request and makes a title shared deadlocks.

Suggest subjunctive mood: "Return true if we would deadlock waiting in ClaimTitle on rt->titleSharingDone until ownercx finishes its request and shares a title."

>+         * The latter check covers the case when the embedding triggers a call
>+         * to js_GC on cx outside a request while having ownercx running a

s/on cx/on a cx/

>+         * request on the same thread and js_GC calls a mark hook or a

s/thread and/thread, and then/

Has MikeM@RetekSolutions.com or Wes (on irc.mozilla.org #jsapi) tested this patch out? It would be great to get testing in a multi-threaded embedding.

r=me with nits -- thanks.

/be
Attachment #363298 - Flags: review?(brendan) → review+
Here is the example that expose the deadlock in the js shell using evalcx and scatter:

function f()
{
    sleep(0.001);
    Date;
}

function thread_1() { evalcx("f()", this); }

function thread_2() { Math; }

scatter([thread_1, thread_2]);

Without the patch it hangs since thread_2 waits to get the shared title to access the Math property of the global object while thread_1 waits to get Date. Now, with the patch the shell asserts since I forgot to remove no longer applicable assert. Without the assert the patch works as expected.
Here is the test case demonstrating the problem from the comment 0 in js shell. Without the patch it deadlocks. Note that it may be necessary to run the shell with -S 0 as currently the stack limit calculations is broken in js shell for non-main threads.

var resource = {};
resource.x = 0;

make_object_shared(this);
scatter([thread_1, thread_2]);

function thread_1()
{
    // We need to make sure thread_2 is already called under_evalcx before                                                                                         
    // we call gc(). That code should never leave the current request as                                                                                           
    // otherwise under_evalcx will succeed in sharing "resource" before we                                                                                         
    // call the GC. Construction of a string with 2**24 chars without loops                                                                                        
    // fits the bill as no operation callback checks will be done in the                                                                                           
    // code.                                                                                                                                                       
    var s = "1";
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;
    s+=s;s+=s;s+=s;s+=s;
    gc();
}

function thread_2() { evalcx("under_evalcx()", this); }

function under_evalcx() { resource.y = 1; }

function make_object_shared(obj)
{
    function f() {
        var s = 0;
        for (var i = 0; i != 1000; ++i) s += obj[i];
        return s;
    }
    scatter([f,f,f,f])
}
Even with the patch the current approach for deadlock avoidance is not enough. The problem is that WillDeadlock assumes that if ownercx->titleToShare is null, then ownercx->thread cannot wait in ClaimTitle. But this is not true since ownercx->thread *can* wait on some other context that have titleToShare. So to fix this it is necessary to enumerate all contexts and check if any with cx->titleToShare waits on the current thread. 

The following example demonstrates the problem in js shell. With or without the patch is triggers the deadlock:

this.resource1 = {};
this.resource1.x = 0;

this.resource2 = null;
this.resource3 = null;

make_object_shared(this);
make_object_shared(Object);
make_object_shared(Object.prototype);
scatter([thread_1, thread_2, thread_3]);

function thread_1()
{
    // We need to make sure thread_2 is already called before                                                                                                      
    // we call gc(). That code should never leave the current request as                                                                                           
    // otherwise under_evalcx will succeed in sharing "resource" before we                                                                                         
    // call the GC. Construction of a string with 2**20 chars without loops                                                                                        
    // fits the bill as no operation callback checks will be done in the                                                                                           
    // code.                                                                                                                                                       
    var s = "1";
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;
    if (!this.resource2) {
        print("warning: test must be tuned so thread2 at this point will set resource2");
        return;
    }
    this.resource2.x = 0;
}

function thread_2()
{
    this.resource2 = {};
    this.resource2.y = 0;
    evalcx("under_evalcx2()", this);
}

function under_evalcx2() {
    // busy pseudo-wait without triggering operation callbacks                                                                                                     
    var s = "1";
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;
    s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;s+=s;

    if (!this.resource3) {
        print("warning: test must be tuned so thread3 at this point will set resource3");
        return;
    }
    this.resource3.y = 0;
}

function thread_3() {
    this.resource3 = {};
    this.resource3.z = 0;
    evalcx("under_evalcx3()", this);
}

function under_evalcx3() {
    this.resource1.z = 0;
}

function make_object_shared(obj)
{
    function f() {
        var s = 0;
        for (var i = 0; i != 1000; ++i) s += obj[i];
        return s;
    }
    scatter([f,f,f,f])
}
Attachment #363298 - Attachment is obsolete: true
I simple "way" to address the problem from the comment 22 is to move titleToShare from JSContext to JSThread. But that does not work when multiple runtimes shares the same thread as WillDeadlock may discover on some thread titleToShare from another runtime. So enumerating all the contexts should be the safest option.
I just realized that the current sharing of JSThread among different JSRuntimes is broken. The problem is that js_GC freely accesses JSThread fields for non-current threads. This is not good as a different runtime may access the same fields without any locking. Fixes like those cx->runtime checks that the patches in this bug brings to fix the bug 437325 only hides the problem under the carpet.

As such the idea from the comment 23 is no worse than what js_GC is doing.
No longer blocks: 437325
Depends on: 437325
Attached patch v5 (obsolete) — Splinter Review
Here is the patch that passes the test from the comment 22. It is based on the patch from the bug 437325 comment 12. It takes advantage of the fact that with that patch it is safe to access JSThread.titleToShare for non-current threads as long as the GC lock is held.

The patch also changes NudgeThread to iterate only over contexts from the blocking thread which is not possible when JSThread can be shared between runtimes.
Attached patch v6 (obsolete) — Splinter Review
The new patch applies on top of the patch from bug 437325 comment 69.
Attachment #363561 - Attachment is obsolete: true
Is v6 ready for review?
(In reply to comment #27)
> Is v6 ready for review?

I need to fix bug 437325 first.
Attached patch v7 (obsolete) — Splinter Review
The new version is just v6 re-diffed against the TM tip given that the bug 437325 is fixed now.
Attachment #365912 - Attachment is obsolete: true
I will test the latest patch and, if that would be OK, ask for a review tomorrow, 2009-09-25.
Attached patch v8 (obsolete) — Splinter Review
The new patch passes the tests. See comment 25 for details of the patch structure.
Attachment #369176 - Attachment is obsolete: true
Attachment #369523 - Flags: review?(brendan)
review ping for brendan
Attached patch v9 (obsolete) — Splinter Review
The new version syncs the patch with TM tip changes.
Attachment #369523 - Attachment is obsolete: true
Attachment #371642 - Flags: review?(brendan)
Attachment #369523 - Flags: review?(brendan)
Comment on attachment 371642 [details] [diff] [review]
v9

>+        /*
>+         * If the GC runs on another thread, temporary suspend the current

s/temporary/temporarily/

>+         * ownercx->thread is waiting in ClaimTitle for a context from some
>+         * thread to finish its request. If that thread is the current thread,
>+         * the dealock happens.

s/the deadlock happens/we would deadlock/

>- * members can't be modified by js_ConcatStrings, js_MinimizeDependentStrings,
>- * or js_UndependString.
>+ * FinishSharingTitle is the tail part of ShareTitle, split out to become a
>+ * subroutine of js_ShareWaitingTitles too. The bulk of the work here involves
>+ * making mutable strings in the title's object's slots be immutable. We have
>+ * to do this because such strings will soon be available to multiple threads,
>+ * so their buffers can't be realloc'd any longer in js_ConcatStrings, and
>+ * their members can't be modified by js_ConcatStrings,
>+ * js_MinimizeDependentStrings, or js_UndependString.

Too bad this wrapped less nicely -- could reorder to put the longest method name last.

>+            canClaim = (ownercx->thread == cx->thread
>+                        && cx->requestDepth > 0);

&& at end of continued line.

Looks good otherwise, sorry for the delay.

/be
Attachment #371642 - Flags: review?(brendan) → review+
This blocks a P1, so is a P1
Priority: P2 → P1
Igor: can you land this soon, so we can get it baked and to m-c and then 1.9.1?  It's a b4 blocker. Thanks!
The patch requires some merge work due to landed changes from the bug 480301. I will try to address this today.
I will not be able to update today the patch to merge it with changes from the bug 480301. There are some issues with that that requires some investigation.
Attached patch v10 (merge start) (obsolete) — Splinter Review
Backup before doing merge work with 480301 - that patch just undos the changes from that bug and applies v9 on top of that for a start.
Attachment #371642 - Attachment is obsolete: true
Attached patch v11Splinter Review
The new patch is v9 with nits addressed.
Attachment #372950 - Attachment is obsolete: true
Attachment #372981 - Flags: review+
landed to TM - http://hg.mozilla.org/tracemonkey/rev/c599deb96665
Whiteboard: fixed-in-tracemonkey
To Andreas:

A cause of recent sporadic oranges could be with this bug. So I would suggest backing it out as well if the orange persists.
Did this get backed out of tm?
Whiteboard: fixed-in-tracemonkey → fixed-in-tracemonkey [backout needed?]
(In reply to comment #43)
> Did this get backed out of tm?

No - the TM oranges it seems was not related to this bug.
Whiteboard: fixed-in-tracemonkey [backout needed?] → fixed-in-tracemonkey
Keywords: fixed1.9.1
So glad this stuck. Thanks, Igor for all your work here.

/be
Depends on: 600853
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: