Intermittent js/src/jit-test/tests/wasm/jsapi.js | Timeout (code -9, args "--baseline-eager")

RESOLVED FIXED in Firefox 52

Status

()

Core
JavaScript Engine
P3
normal
RESOLVED FIXED
a year ago
11 months ago

People

(Reporter: Treeherder Bug Filer, Assigned: bbouvier)

Tracking

({intermittent-failure})

unspecified
mozilla53
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox51 disabled, firefox52 fixed, firefox53 fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

a year ago
treeherder
Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=34593319&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/FuR7WtzrQ3e4yD64DHuksw/runs/0/artifacts/public%2Flogs%2Flive_backing.log

Comment 1

a year ago
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3

Comment 2

a year ago
17 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 12
* mozilla-inbound: 4
* try: 1

Platform breakdown:
* linux64: 14
* osx-10-10: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-10-31&endday=2016-11-06&tree=all

Comment 3

a year ago
23 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 19
* mozilla-inbound: 3
* mozilla-central: 1

Platform breakdown:
* linux64: 22
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-08&endday=2016-11-08&tree=all

Comment 4

a year ago
18 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 8
* autoland: 7
* try: 2
* mozilla-central: 1

Platform breakdown:
* linux64: 14
* windows8-64: 2
* windowsxp: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-09&endday=2016-11-09&tree=all
(Assignee)

Updated

a year ago
Flags: needinfo?(bbouvier)

Comment 5

11 months ago
60 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 28
* mozilla-inbound: 23
* mozilla-central: 5
* try: 3
* graphics: 1

Platform breakdown:
* linux64: 51
* windows8-64: 4
* windowsxp: 2
* linux32: 2
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-07&endday=2016-11-13&tree=all
(Assignee)

Comment 6

11 months ago
Found it (thanks to rr!), taking.
Assignee: nobody → bbouvier
Status: NEW → ASSIGNED
Flags: needinfo?(bbouvier)
(Assignee)

Comment 7

11 months ago
Created attachment 8810469 [details] [diff] [review]
fix-ordering.patch

A first simple fix. Jon says we might even be able to remove the ShellAsyncTasks lock itself, I'm trying this now.
(Assignee)

Comment 8

11 months ago
I've tried removing the ShellAsyncTask's mutex; we can't do that easily because AutoLockHelperThreadState is a LockGuard and ExclusiveData<>::Guard is a ExclusiveData::Guard, and there's no obvious way to convert from one to the other.

Found the issue by running wasm/jsapi.js in rr with chaos mode; after running under rr a few times, one time a deadlock happened, and I could see the main thread was blocked in DrainJobQueue. Then, repeating the WebAssembly.compile tests from this file (which are the only ones to use drainJobQueue()) from this file made it trigger 100% of the time in rr with chaos mode.

Asking review on the initial patch, with an explanation. Considering the following piece of code:

    while (true) {
        if (!sc->asyncTasks.lock()->outstanding)
            break;
        AutoLockHelperThreadState lock;
        HelperThreadState().wait(lock, GlobalHelperThreadState::CONSUMER);
    }

`outstanding` is updated by the ShellFinishAsyncTaskCallback, which is called there: http://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#1455

So there's a mutex ordering in which we do the following:
- main thread: takes asyncTasks lock, looks at outstanding: it's positive.
- main thread: unlock asyncTasks (leaving the if)
- helper thread: takes asyncTasks lock, updates outstanding to 0
- helper thread: takes GlobalHelperThreadState lock, notifies consumers
- main thread: wait on consumer

Thus the main thread waits forever, after the notification was sent.

Proposed solution is to lock the HelperThreadState *before* the asyncTasks lock:
- can't cause a deadlock with the other mutex, since the main thread takes the asyncTasks lock and releases it directly in the `if` condition.
- this ensures the helper thread can't notify before the main thread is waiting

Under rr chaos mode, this clears the issue right away.
(Assignee)

Comment 9

11 months ago
Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

We've talked about this tonight, but for the record I've explained it precisely in the previous comment.
Attachment #8810469 - Flags: review?(jcoppeard)

Comment 10

11 months ago
Nice!  I actually tried to run wasm/jsapi.js in a loop a few times to try to reproduce this (to no avail); running in rr chaos mode was a fantastic idea.  Thanks so much!

Comment 11

11 months ago
Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

Review of attachment 8810469 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: js/src/vm/MutexIDs.h
@@ +20,5 @@
>    _(RuntimeExclusiveAccess,      200) \
>                                        \
>    _(GlobalHelperThreadState,     300) \
>                                        \
> +  _(ShellAsyncTasks,             350) \

This works, but can you make this order 500 instead?

It only needs to be greater than GlobalHelperThreadState to work, but reducing the number of distinct values makes it clear(er) that there are fewer possible combinations of mutexes that can be held at the same time because mutexes with the same order cannot be held at the same time.

I should really have added a comment to this effect.
Attachment #8810469 - Flags: review?(jcoppeard) → review+
status-firefox51: --- → disabled
status-firefox52: --- → affected
status-firefox53: --- → affected

Comment 12

11 months ago
21 failures in 715 pushes (0.029 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 12
* autoland: 5
* try: 2
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 11
* osx-10-10: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-14&endday=2016-11-20&tree=all
(Assignee)

Comment 13

11 months ago
Fwiw, this is not pushed yet because a try build shows that an order of 500 can't work:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd57aed530a9b5308a01d269d648fd086e9957f9

But even an order of 350 doesn't work, as shown by running the wasm/jsapi.js test with JS_GC_ZEAL=14,10.
(Assignee)

Comment 14

11 months ago
Created attachment 8812730 [details] [diff] [review]
mutexes.patch

This works locally and under the initial test case with rr chaos mode. Running a try build just to be sure...
Attachment #8810469 - Attachment is obsolete: true
Attachment #8812730 - Flags: review?(jcoppeard)
(Assignee)

Comment 15

11 months ago
Created attachment 8812747 [details] [diff] [review]
mutexes.patch

A better patch, which updates the comment and just Moves the finished queue.
Attachment #8812730 - Attachment is obsolete: true
Attachment #8812730 - Flags: review?(jcoppeard)
Attachment #8812747 - Flags: review?(jcoppeard)
(Assignee)

Comment 16

11 months ago
Fwiw, latest try build is all green and happy: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e440558ee52b00a8244fcfdc46dc055ba336f29

Comment 17

11 months ago
Comment on attachment 8812747 [details] [diff] [review]
mutexes.patch

Review of attachment 8812747 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thanks for the updates.
Attachment #8812747 - Flags: review?(jcoppeard) → review+

Comment 18

11 months ago
Pushed by bbouvier@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa05debf2237
Fix ordering of taking mutexes for the shell tasks; r=jonco

Comment 19

11 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/aa05debf2237
Status: ASSIGNED → RESOLVED
Last Resolved: 11 months ago
status-firefox53: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53

Comment 20

11 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/45a3bdf8faec
status-firefox52: affected → fixed

Comment 21

11 months ago
11 failures in 623 pushes (0.018 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 5
* autoland: 5
* mozilla-central: 1

Platform breakdown:
* linux64: 7
* osx-10-10: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-21&endday=2016-11-27&tree=all
You need to log in before you can comment on or make changes to this bug.