Closed Bug 1297901 Opened 8 years ago Closed 8 years ago

Intermittent js/src/jit-test/tests/wasm/jsapi.js | Timeout (code -9, args "--baseline-eager")

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla53

Tracking Flags:

Tracking

Status

firefox51

---

disabled

firefox52

---

fixed

firefox53

---

fixed

People

(Reporter: intermittent-bug-filer, Assigned: bbouvier)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

fix-ordering.patch 8 years ago Benjamin Bouvier [:bbouvier] (inactive) 2.32 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
mutexes.patch 8 years ago Benjamin Bouvier [:bbouvier] (inactive) 3.11 KB, patch		Details \| Diff \| Splinter Review
mutexes.patch 8 years ago Benjamin Bouvier [:bbouvier] (inactive) 3.25 KB, patch	jonco : review+	Details \| Diff \| Splinter Review

Treeherder Bug Filer

Reporter

Description

•

8 years ago

treeherder

Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=34593319&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/FuR7WtzrQ3e4yD64DHuksw/runs/0/artifacts/public%2Flogs%2Flive_backing.log

BMO Automation

Comment 1

•

8 years ago

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.

Priority: -- → P3

Comment hidden (Intermittent Failures Robot)

17 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 12
* mozilla-inbound: 4
* try: 1

Platform breakdown:
* linux64: 14
* osx-10-10: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-10-31&endday=2016-11-06&tree=all

Comment hidden (Intermittent Failures Robot)

23 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 19
* mozilla-inbound: 3
* mozilla-central: 1

Platform breakdown:
* linux64: 22
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-08&endday=2016-11-08&tree=all

Comment hidden (Intermittent Failures Robot)

18 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 8
* autoland: 7
* try: 2
* mozilla-central: 1

Platform breakdown:
* linux64: 14
* windows8-64: 2
* windowsxp: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-09&endday=2016-11-09&tree=all

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Updated

•

8 years ago

Flags: needinfo?(bbouvier)

Comment hidden (Intermittent Failures Robot)

60 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 28
* mozilla-inbound: 23
* mozilla-central: 5
* try: 3
* graphics: 1

Platform breakdown:
* linux64: 51
* windows8-64: 4
* windowsxp: 2
* linux32: 2
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-07&endday=2016-11-13&tree=all

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 6

•

8 years ago

Found it (thanks to rr!), taking.

Assignee: nobody → bbouvier

Status: NEW → ASSIGNED

Flags: needinfo?(bbouvier)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 7

•

8 years ago

Attached patch fix-ordering.patch (obsolete) — Details — Splinter Review

A first simple fix. Jon says we might even be able to remove the ShellAsyncTasks lock itself, I'm trying this now.

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 8

•

8 years ago

I've tried removing the ShellAsyncTask's mutex; we can't do that easily because AutoLockHelperThreadState is a LockGuard and ExclusiveData<>::Guard is a ExclusiveData::Guard, and there's no obvious way to convert from one to the other.

Found the issue by running wasm/jsapi.js in rr with chaos mode; after running under rr a few times, one time a deadlock happened, and I could see the main thread was blocked in DrainJobQueue. Then, repeating the WebAssembly.compile tests from this file (which are the only ones to use drainJobQueue()) from this file made it trigger 100% of the time in rr with chaos mode.

Asking review on the initial patch, with an explanation. Considering the following piece of code:

    while (true) {
        if (!sc->asyncTasks.lock()->outstanding)
            break;
        AutoLockHelperThreadState lock;
        HelperThreadState().wait(lock, GlobalHelperThreadState::CONSUMER);
    }

`outstanding` is updated by the ShellFinishAsyncTaskCallback, which is called there: http://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#1455

So there's a mutex ordering in which we do the following:
- main thread: takes asyncTasks lock, looks at outstanding: it's positive.
- main thread: unlock asyncTasks (leaving the if)
- helper thread: takes asyncTasks lock, updates outstanding to 0
- helper thread: takes GlobalHelperThreadState lock, notifies consumers
- main thread: wait on consumer

Thus the main thread waits forever, after the notification was sent.

Proposed solution is to lock the HelperThreadState *before* the asyncTasks lock:
- can't cause a deadlock with the other mutex, since the main thread takes the asyncTasks lock and releases it directly in the `if` condition.
- this ensures the helper thread can't notify before the main thread is waiting

Under rr chaos mode, this clears the issue right away.

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 9

•

8 years ago

Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

We've talked about this tonight, but for the record I've explained it precisely in the previous comment.

Attachment #8810469 - Flags: review?(jcoppeard)

Luke Wagner [:luke]

Comment 10

•

8 years ago

Nice!  I actually tried to run wasm/jsapi.js in a loop a few times to try to reproduce this (to no avail); running in rr chaos mode was a fantastic idea.  Thanks so much!

Jon Coppeard (:jonco)

Comment 11

•

8 years ago

Comment on attachment 8810469 [details] [diff] [review]
fix-ordering.patch

Review of attachment 8810469 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: js/src/vm/MutexIDs.h
@@ +20,5 @@
>    _(RuntimeExclusiveAccess,      200) \
>                                        \
>    _(GlobalHelperThreadState,     300) \
>                                        \
> +  _(ShellAsyncTasks,             350) \

This works, but can you make this order 500 instead?

It only needs to be greater than GlobalHelperThreadState to work, but reducing the number of distinct values makes it clear(er) that there are fewer possible combinations of mutexes that can be held at the same time because mutexes with the same order cannot be held at the same time.

I should really have added a comment to this effect.

Attachment #8810469 - Flags: review?(jcoppeard) → review+

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox51: --- → disabled

status-firefox52: --- → affected

status-firefox53: --- → affected

Comment hidden (Intermittent Failures Robot)

21 failures in 715 pushes (0.029 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 12
* autoland: 5
* try: 2
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 11
* osx-10-10: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-14&endday=2016-11-20&tree=all

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 13

•

8 years ago

Fwiw, this is not pushed yet because a try build shows that an order of 500 can't work:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd57aed530a9b5308a01d269d648fd086e9957f9

But even an order of 350 doesn't work, as shown by running the wasm/jsapi.js test with JS_GC_ZEAL=14,10.

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 14

•

8 years ago

Attached patch mutexes.patch (obsolete) — Details — Splinter Review

This works locally and under the initial test case with rr chaos mode. Running a try build just to be sure...

Attachment #8810469 - Attachment is obsolete: true

Attachment #8812730 - Flags: review?(jcoppeard)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 15

•

8 years ago

Attached patch mutexes.patch — Details — Splinter Review

A better patch, which updates the comment and just Moves the finished queue.

Attachment #8812730 - Attachment is obsolete: true

Attachment #8812730 - Flags: review?(jcoppeard)

Attachment #8812747 - Flags: review?(jcoppeard)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 16

•

8 years ago

Fwiw, latest try build is all green and happy: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e440558ee52b00a8244fcfdc46dc055ba336f29

Jon Coppeard (:jonco)

Comment 17

•

8 years ago

Comment on attachment 8812747 [details] [diff] [review]
mutexes.patch

Review of attachment 8812747 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thanks for the updates.

Attachment #8812747 - Flags: review?(jcoppeard) → review+

Pulsebot

Comment 18

•

8 years ago

Pushed by bbouvier@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa05debf2237
Fix ordering of taking mutexes for the shell tasks; r=jonco

Phil Ringnalda (:philor)

Comment 19

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/aa05debf2237

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox53: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

Ryan VanderMeulen [:RyanVM]

Comment 20

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/45a3bdf8faec

status-firefox52: affected → fixed

Comment hidden (Intermittent Failures Robot)

11 failures in 623 pushes (0.018 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 5
* autoland: 5
* mozilla-central: 1

Platform breakdown:
* linux64: 7
* osx-10-10: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1297901&startday=2016-11-21&endday=2016-11-27&tree=all

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Updated

•

7 years ago

Blocks: 1414800

You need to log in before you can comment on or make changes to this bug.