Closed Bug 853862 Opened 10 years ago Closed 10 years ago

BaselineCompiler: Frequent M4 test_suspend.html failures


(Core :: JavaScript Engine, defect)

Not set





(Reporter: jandem, Assigned: djvj)




(1 file)

Mochitest-4 tests/dom/workers/test/test_suspend.html fails frequently with Linux or Windows 32-bit debug builds on the BC branch. Doesn't seem to happen with OS X or opt builds.

I just retriggered M4 a few times and 4 out of 6 runs failed like this.

Considering it involves web workers and doesn't fail consistently, it may be threading/timing related somehow..

From the log:

14:23:20     INFO -  1135 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Wrong url!
14:23:20     INFO -  1136 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Not suspended?
14:23:20     INFO -  1137 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Already resumed?!
14:23:20     INFO -  1138 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Received a message while suspended!
14:23:21     INFO -  1139 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Wrong url!
14:23:21     INFO -  1140 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Not suspended?
14:23:21     INFO -  1141 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Already resumed?!
14:23:21     INFO -  1142 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Received a message while suspended!
14:23:22     INFO -  1143 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Wrong url!
14:23:22     INFO -  1144 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Not suspended?
14:23:22     INFO -  1145 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Already resumed?!
14:23:22     INFO -  1146 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Received a message while suspended!
14:23:22     INFO -  1147 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Wrong url!
14:23:22     INFO -  1148 INFO TEST-PASS | /tests/dom/workers/test/test_suspend.html | Got message before resumed!
14:23:22     INFO -  1149 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_suspend.html | Missed a message, suspend failed! - got 25, expected 75
Just got this on OS X as well:

So it seems to happen with debug builds on all platforms, but 32-bit Linux builds fail most frequently.
After much gnashing of teeth, I have managed to replicate this issue in a 32-bit built inside a chroot env on my workstation.  Yay!

Starting to debug now.
Assignee: general → kvijayan
Ok I have a bit more insight on what's happening.

The test is structured as follows: It loads a page in an iframe and kicks off a counter-incrementing worker (using setInterval w/ interval of 100ms) inside that iframe, having the worker call a callback every time it increments.

After receiving 25 such messages, the main page sets the iframe's location to "about:blank" in a way that sends the original iframe page to the bfcache.  This suspends the original iframe pages' worker messages.  The main page waits 5 seconds and then calls |back()| on the iframe, which resumes the original page and starts counting again.

Throughout this, the main page ensures that its callback is getting called with strictly successive numbers.

I used an array to log all the relevant info |messageCallback| was getting called, and here's what that looks like:

[1364240841282, (void 0), 1], "HERE0", "HERE1",
[1364240841381, 1, 2], "HERE0", "HERE1",
[1364240841480, 2, 3], "HERE0", "HERE1",
[1364240843580, 23, 24], "HERE0", "HERE1",
[1364240843681, 24, 25], "HERE0", "HERE1",
[1364240848803, 25, 76], "HERE0", "HERE2", "HERE3",
[1364240848828, 25, 26],
[1364240848828, 25, 27],
[1364240848845, 25, 74],
[1364240848845, 25, 75]

The "..." just condense lines that are not surprising in any way.  The format of each line is:

[TimeStamp, PreviousCount, CallbackData]

"HERE0" is printed if |messageCallback| gets past the |if (finished)| check.
"HERE1" is printed if the |if (!suspended)| branch is taken.
"HERE2" and "HERE3" are printed if the |if (!suspended)| branch is not taken.

Breakdown of the log:

In the first few log messages, we see that the callback is being called at 100ms intervals, with proper successive numbers.  Then, after 25 is reached, there is a 5-second delay (as expected, since the iframe is on "about:blank" for 5 seconds), after which messages start getting delivered again.

The _FIRST_ message to get delivered contains the number 76, not 26, as expected.  _AFTER_ that, messages for 26 to 75 get delivered.  Looking at the timestamps of these messages, we notice that they're all highly clustered.  A mere 25ms after "76" is delivered, the "26" is delivered.  Then, within the next 20ms, messages for all the remaining numbers ("27" .. "75") get delivered.

This suggests the following behaviour:

When the iframe is sent to "about:blank", the worker keeps working, generating messages that are not sent out.

When the iframe is sent back to the original page, the worker's messages are dispatched, all at once.

Somehow, there's a race where the worker's output-queue is hooked back up with the receiver's input-queue.. but before the old suspended output-queue's messages can be copied to the receiver's input-queue, the worker is able to dispatch one new message (the "76").  Then the other messages are copied to the input-queue on top of that, causing a mismatch in the order of delivery.

This seems like it's probably not a bug in baseline so much as a race condition uncovered by baseline.  Baseline is somehow able to more quickly execute and give JS an opportunity to stuff a message in the queue before the reconnection is complete.  I might be wrong, however, and maybe there's some issue with Baseline not correctly interrupting itself somewhere that it needs to interrupt itself.
Ok, after some debugging, it turns out that baseline is missing an interrupt request.  The reason it happens is as follows:

|js_TriggerOperationCallback| is called on the runtime.  This sets the runtime's main thread's |ionStackLimit| to -1.

The main thread kicks off the running of the worker's scheduled interrupt.  The interpreter, realizes the function being called has a BaselineScript, and thus does an |EnterBaseline|.  |EnterBaseline| creates an IonActivation on the stack, which "initializes" |ionStackLimit| back to a regular value, masking the interrupt.

Baseline code executes as normal, having ignored the interrupt.

This is a race condition that is hard to eliminate.  If we change IonActivation to check the |ionStackLimit| for -1 before setting it, it simply reduces the race window, but doesn't eliminate it.  Ideally, we want |ionStackLimit| to only be set in places where it can never race with |TriggerOperationCallback|.

Currently, the ionStackLimit is set in the following places:
1. Within TriggerOperationCallback, which sets it to -1.
2. By JSRuntime::resetIonStackLimit(), which is only used by js_InvokeOperationCallback, which will never race with js_TriggerOperationCallback.
3. By vm/ForkJoin.cpp, on a private thread which is NOT runtime->mainThread.
4. By the constructor of IonActivation

The only problematic set is the constructor of IonActivation here, which sets ionStackLimit to be equal to nativeStackLimit.

There's no easy way to avoid a race here, since IonActivation can be called almost at any time (if the interpreter chooses to enter Ion or Baseline).  The best option here is to not set |ionStackLimit| in here AT ALL.

Instead, we can update ionStackLimit automatically whenever nativeStackLimit is set.  nativeStackLimit is dynamically set from within only JS_SetNativeStackQuota, and JSRuntime::clearOwnerThread.

JS_SetNativeStackQuota is almost always called immediately after the runtime is constructed, so is not at risk with racing against a TriggerOperationCallback.  The only exception to this is that it's also called by setOwnerThread.  setOwnerThread is only called by JS_SetRuntimeThread, which is only called in contexts where the thread is being exited or entered for cycle collection purposes.  This doesn't seem like a likely candidate for racing with TriggerOperationCallback on the same thread.

clearOwnerThread just sets the runtimeStackLimit to a sentinel value anyway, in situations where the thread ownership is changing, so it doesn't need to update ionStackLimit.

So upshot is:
updating ionStackLimit in JS_SetNativeStackQuota, "should" be fine.  I think.
This patch should apply onto mozilla-inbound, and I believe it fixes the issues described above.

Major changes:
1. A ionStackLimitLock_ field is added to PerThreadData, holding a lock object.
2. The line that updates ionStackLimit inside IonActivation's constructor is instead moved to JS_SetNativeStackQuota, guarded by the lock, and ensuring it doesn't overwrite a pending interrupt (i.e. -1 value)
3. JSRuntime::resetIonStackLimit() is changed to use the lock-guarded set.

The only places where the lock is not used is within paralleljs ForkJoin code, because that code does not create the locks in the first place.  Explanatory comments have been added in the two places where ForkJoin code modifies ionStackLimit.

Tested this manually trying to replicate the issue.  Typically I could replicate the issue within about 7-10 runs.  With the patch, I was able to run test_suspend.html 50 times, all successes.

I also manually set breakpoints and ensured that the TriggerOperationCallback was getting handled correctly by baseline.
Attachment #729832 - Flags: review?(sstangl)
Attachment #729832 - Flags: review?(jdemooij)
Comment on attachment 729832 [details] [diff] [review]
Wrap ionStackLimit modifications with a lock

Review of attachment 729832 [details] [diff] [review]:

Great detective work, glad to see this fixed.

::: js/src/jsapi.cpp
@@ +3037,5 @@
>  #endif
> +
> +    // If there's no pending interrupt request set on the runtime's main thread's
> +    // ionStackLimit, then update it so that it reflects the new nativeStacklimit.
> +#if JS_ION

Nit: #ifdef JS_ION

::: js/src/jscntxt.h
@@ +478,5 @@
>      uintptr_t            ionStackLimit;
> +# ifdef JS_THREADSAFE
> +    /* Synchronizes setting of ionStackLimit so signals by triggerOperationCallback don't
> +     * get lost. */

Uber-style-nit: format comment like:

 * AAA
 * BBB.

@@ +485,5 @@
> +    class IonStackLimitLock {
> +        PerThreadData &data_;
> +      public:
> +        IonStackLimitLock(PerThreadData &data) : data_(data) {
> +            PR_Lock(data_.ionStackLimitLock_);

I don't know how PR_Lock is implemented, maybe JS_ASSERT(data_.ionStackLimitLock_) here so that we will definitely fail if the lock is not initialized.
Attachment #729832 - Flags: review?(jdemooij) → review+
Attachment #729832 - Flags: review?(sstangl)
Attachment #729832 - Flags: review?(hv1989)
Attachment #729832 - Flags: review?
Attachment #729832 - Flags: review+
Comment on attachment 729832 [details] [diff] [review]
Wrap ionStackLimit modifications with a lock

Accidentally cleared jandem's r+
Attachment #729832 - Flags: review? → review+
Comment on attachment 729832 [details] [diff] [review]
Wrap ionStackLimit modifications with a lock

Review of attachment 729832 [details] [diff] [review]:

::: js/src/jsapi.cpp
@@ +689,5 @@
>      ionJSContext(NULL),
>      ionStackLimit(0),
> +# ifdef JS_THREADSAFE
> +    ionStackLimitLock_(NULL),
> +# endif

Nit: remove whitespace before ifdef and endif

@@ +694,5 @@
>      ionActivation(NULL),
>      asmJSActivationStack_(NULL),
>  # ifdef JS_THREADSAFE
>      asmJSActivationStackLock_(NULL),
>  # endif

Nit: This wasn't changed by this patch, but could you remove the whitespace here too?

::: js/src/jscntxt.h
@@ +1267,5 @@
>      bool                jitHardening;
>      void resetIonStackLimit() {
> +        mainThread.setIonStackLimit(mainThread.nativeStackLimit);

This makes me nervous, as it would be easy to abuse for other things. Maybe add a comment what this function does? (I.e. remove signals set by triggerOperationCallback)
Attachment #729832 - Flags: review?(hv1989) → review+
Great find btw! Thread racing is always hard to debug.
TBPL looks good.
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.