Closed Bug 1297901 Opened 8 years ago Closed 8 years ago

Intermittent js/src/jit-test/tests/wasm/jsapi.js | Timeout (code -9, args "--baseline-eager")

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox51 --- disabled
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bbouvier)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Flags: needinfo?(bbouvier)
Found it (thanks to rr!), taking.
Assignee: nobody → bbouvier
Status: NEW → ASSIGNED
Flags: needinfo?(bbouvier)
Attached patch fix-ordering.patch (obsolete) — Splinter Review
A first simple fix. Jon says we might even be able to remove the ShellAsyncTasks lock itself, I'm trying this now.
I've tried removing the ShellAsyncTask's mutex; we can't do that easily because AutoLockHelperThreadState is a LockGuard and ExclusiveData<>::Guard is a ExclusiveData::Guard, and there's no obvious way to convert from one to the other.

Found the issue by running wasm/jsapi.js in rr with chaos mode; after running under rr a few times, one time a deadlock happened, and I could see the main thread was blocked in DrainJobQueue. Then, repeating the WebAssembly.compile tests from this file (which are the only ones to use drainJobQueue()) from this file made it trigger 100% of the time in rr with chaos mode.

Asking review on the initial patch, with an explanation. Considering the following piece of code:

    while (true) {
        if (!sc->asyncTasks.lock()->outstanding)
            break;
        AutoLockHelperThreadState lock;
        HelperThreadState().wait(lock, GlobalHelperThreadState::CONSUMER);
    }

`outstanding` is updated by the ShellFinishAsyncTaskCallback, which is called there: http://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#1455

So there's a mutex ordering in which we do the following:
- main thread: takes asyncTasks lock, looks at outstanding: it's positive.
- main thread: unlock asyncTasks (leaving the if)
- helper thread: takes asyncTasks lock, updates outstanding to 0
- helper thread: takes GlobalHelperThreadState lock, notifies consumers
- main thread: wait on consumer

Thus the main thread waits forever, after the notification was sent.

Proposed solution is to lock the HelperThreadState *before* the asyncTasks lock:
- can't cause a deadlock with the other mutex, since the main thread takes the asyncTasks lock and releases it directly in the `if` condition.
- this ensures the helper thread can't notify before the main thread is waiting

Under rr chaos mode, this clears the issue right away.
Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

We've talked about this tonight, but for the record I've explained it precisely in the previous comment.
Attachment #8810469 - Flags: review?(jcoppeard)
Nice!  I actually tried to run wasm/jsapi.js in a loop a few times to try to reproduce this (to no avail); running in rr chaos mode was a fantastic idea.  Thanks so much!
Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

Review of attachment 8810469 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: js/src/vm/MutexIDs.h
@@ +20,5 @@
>    _(RuntimeExclusiveAccess,      200) \
>                                        \
>    _(GlobalHelperThreadState,     300) \
>                                        \
> +  _(ShellAsyncTasks,             350) \

This works, but can you make this order 500 instead?

It only needs to be greater than GlobalHelperThreadState to work, but reducing the number of distinct values makes it clear(er) that there are fewer possible combinations of mutexes that can be held at the same time because mutexes with the same order cannot be held at the same time.

I should really have added a comment to this effect.
Attachment #8810469 - Flags: review?(jcoppeard) → review+
Fwiw, this is not pushed yet because a try build shows that an order of 500 can't work:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd57aed530a9b5308a01d269d648fd086e9957f9

But even an order of 350 doesn't work, as shown by running the wasm/jsapi.js test with JS_GC_ZEAL=14,10.
Attached patch mutexes.patch (obsolete) — Splinter Review
This works locally and under the initial test case with rr chaos mode. Running a try build just to be sure...
Attachment #8810469 - Attachment is obsolete: true
Attachment #8812730 - Flags: review?(jcoppeard)
Attached patch mutexes.patchSplinter Review
A better patch, which updates the comment and just Moves the finished queue.
Attachment #8812730 - Attachment is obsolete: true
Attachment #8812730 - Flags: review?(jcoppeard)
Attachment #8812747 - Flags: review?(jcoppeard)
Comment on attachment 8812747 [details] [diff] [review]
mutexes.patch

Review of attachment 8812747 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thanks for the updates.
Attachment #8812747 - Flags: review?(jcoppeard) → review+
Pushed by bbouvier@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa05debf2237
Fix ordering of taking mutexes for the shell tasks; r=jonco
https://hg.mozilla.org/mozilla-central/rev/aa05debf2237
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: