Last Comment Bug 785560 - IonMonkey: ReportOverRecursed firing incorrectly
: IonMonkey: ReportOverRecursed firing incorrectly
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
-- normal (vote)
: ---
Assigned To: David Anderson [:dvander]
: Jason Orendorff [:jorendorff]
Depends on:
Blocks: IonGreen
  Show dependency treegraph
Reported: 2012-08-24 17:14 PDT by David Anderson [:dvander]
Modified: 2012-08-28 00:08 PDT (History)
1 user (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

fix (3.45 KB, patch)
2012-08-27 14:08 PDT, David Anderson [:dvander]
no flags Details | Diff | Splinter Review
v2 (6.64 KB, patch)
2012-08-27 15:58 PDT, David Anderson [:dvander]
sstangl: review+
Details | Diff | Splinter Review

Description User image David Anderson [:dvander] 2012-08-24 17:14:54 PDT
This is causing intermittent failures across Talos and even, more rarely, reftests on TBPL. It is most easily reproducible on Windows when running v8 or dromaeo_css, and I have only been able to reproduce using release builds, on any platform.

The failure mode is that the benchmark suite simply stops running. Talos then gets no response from the browser, and the Python harness times out. The browser meanwhile is completely usable. When running v8, it manages to print a score of "NaN".

On a hunch I replaced all "return false" instances in IonMonkey with:
    return (printf("%s %d\n", __FILE__, __LINE__), false);

And changed the Talos harness to immediately write spew to a local file. When the problem occurred again, the last line in the file was:

    VMFunctions.cpp 104

Which happens to be ion::ReportOverRecursed. In IonMonkey, this function is invoked if the stack limit check fails, which can happen either if the watchdog timer sets the interrupt flag, or we actually recurse too deeply. In this case, we've fired the ReportOverRecursed check, but there was no interrupt set, so we reported an actual over-recursion error and returned false.

Bizarrely, no error is being reported to the error console, but I'm not sure what to make of that yet. Either way we should not be over-recursing.

Now that we have something to break on, I can start debugging this.
Comment 1 User image David Anderson [:dvander] 2012-08-24 17:40:27 PDT
Another note is that we didn't have Talos results until June 7th - thankfully, because no browser can page back further than that on TBPL without crashing. The timeout existed since the very first run. On May 25th, we landed interrupt support, so it could be this bug is related to the watchdog timer.
Comment 2 User image David Anderson [:dvander] 2012-08-27 02:29:33 PDT
Confirmed that this bug is a race condition with the watchdog timer and Ion's stack limit mechanism. A test patch seems to have come back green on TBPL; will post the full thing tomorrow.
Comment 3 User image David Anderson [:dvander] 2012-08-27 14:08:00 PDT
Created attachment 655761 [details] [diff] [review]

Comment 4 User image Sean Stangl [:sstangl] 2012-08-27 15:49:03 PDT
This is a summary of and expansion on a discussion in #ionmonkey, for the sake of posterity.

We have to worry about the concurrent interactions between three functions. The pseudocode below is from the form of the functions in the patch in Comment 3. In reading the code below, note that we're overloading the use of the "overrecursed" check to also catch interrupts, to avoid an extra load and compare of the interrupt flag.

> interrupt <- 1
> ionStackLimit <- NULL

> interrupt <- 0
> ionStackLimit <- nativeStackLimit

ion (partially codegen, partially ReportOverRecursed():
> if (ionStack <= ionStackLimit)
>     return
> if (actuallyOverRecursed())
>     return js_ReportOverRecursed()
> if (interrupt)
>     return InterruptCheck()
> if (ionStackLimit == NULL)             // introduced in patch
>     ionStackLimit <- nativeStackLimit  // introduced in patch
>     return                             // introduced in patch
> return js_ReportOverRecursed()

triggerOperationCallback() may execute concurrently with js_InvokeOperationCallback() and ion/ReportOverRecursed(), but js_InvokeOperationCallback() and ion/ReportOverRecursed() must execute sequentially with respect to each other and themselves.

The issue dvander found is the following sequence:
> trigger / interrupt <- 1
> invoke / interrupt <- 0
> invoke / ionStackLimit <- nativeStackLimit
> trigger / ionStackLimit <- NULL

In the old version of ion/ReportOverRecursed() the "ionStackLimit == NULL" check was missing, so ionStackLimit would wind up as NULL, and we would erroneously report recursion. The patch fixes the bug, but introduces another race condition, with the following (way more complicated) sequence:

> trigger / interrupt <- 1
> invoke / interrupt <- 0
> invoke / ionStackLimit <- nativeStackLimit
> trigger / ionStackLimit <- NULL
> ion / ionStack <= ionStackLimit (false, because ionStackLimit == NULL)
> ion / actuallyOverRecursed() (false, because we entered due to NULL ionStackLimit)
> ion / interrupt (false, interrupt == 0)
> ion / ionStackLimit == NULL (true)
> trigger / interrupt <- 1
> trigger / ionStackLimit <- NULL
> ion / ionStackLimit <- nativeStackLimit

And so we can end up with a pending interrupt, but ionStackLimit set such that Ion will never detect it. This implies that it is unsafe for ion/ReportOverRecursed() to attempt to detect this race condition -- we'll have to solve it in another way.

The first observation to make is that ReportOverRecursed() can be entered in two ways: with ionStackLimit == NULL, resulting in InterruptCheck() or nothing; or ionStackLimit == nativeStackLimit, resulting in the actuallyOverRecursed() check tripping. Since the latter case trips the actuallyOverRecursed() check, it does not deal with interrupt or ionStackLimit at all, and therefore cannot race. So the only races we have to deal with are when ionStackLimit == NULL. If we enter with ionStackLimit == NULL, then js_ReportOverRecursed() should never be called. So we can change ion/ReportOverRecursed to the following:

ion (partially codegen, partially ReportOverRecursed():
> if (ionStack <= ionStackLimit)
>     return
> if (actuallyOverRecursed())
>     return js_ReportOverRecursed()
> if (interrupt)
>     return InterruptCheck()
> return

But then we still need to solve the race condition where ionStackLimit is NULL after the interaction between trigger() and invoke(). We can do this by reordering trigger() thusly:

> ionStackLimit <- NULL
> interrupt <- 1

Then it is impossible for ionStackLimit to be NULL after invoke() returns. However, it does mean that the following race is possible:

> trigger / ionStackLimit <- NULL
> ion / ionStack <= ionStackLimit (true)
> ion / actuallyOverRecursed (false)
> ion / interrupt (false)

But this sequence: 1) performs no undesirable behavior, 2) rarely happens, 3) is likely to quickly resolve itself via "trigger / interrupt <- 1" if it does happen.

No other races are possible with the given preconditions. So the above two changes are safe. I think this is the best we can do without a mutex or explicitly checking the interrupt flag.
Comment 5 User image David Anderson [:dvander] 2012-08-27 15:58:45 PDT
Created attachment 655802 [details] [diff] [review]

Thanks for the detailed analysis! This does seem much better.
Comment 6 User image Sean Stangl [:sstangl] 2012-08-27 16:01:38 PDT
Comment on attachment 655802 [details] [diff] [review]

Review of attachment 655802 [details] [diff] [review]:

awesome :)

::: js/src/jscntxt.cpp
@@ +132,5 @@
>  {
> +    /* 
> +     * Invalidate ionTop to trigger its over-recursion check. Note this must be
> +     * set before interrupt, to avoid racing into a weird state where interrupt
> +     * is stuck at 0 but ionStackLimit is NULL.

Might be useful to explicitly mention js_InvokeOperationCallback() as the function it races with.
Comment 7 User image David Anderson [:dvander] 2012-08-27 16:07:18 PDT
Comment 8 User image Sean Stangl [:sstangl] 2012-08-28 00:08:51 PDT
Just an additional note, although we've resolved that it doesn't actually matter: we can still end up with a pending interrupt and ionStackLimit == nativeStackLimit, if trigger() is called twice before invoke() executes. This is still better than ending up with NULL, and nothing incorrect happens if we ignore an interrupt for a while.

The following is an example sequence of that race:

trigger1 / ionStackLimit <- NULL
trigger1 / interrupt <- 1
trigger2 / ionStackLimit <- NULL
invoke / interrupt <- 0
invoke / ionStackLimit <- nativeStackLimit
trigger / interrupt <- 1

Note You need to log in before you can comment on or make changes to this bug.