Closed Bug 870043 Opened 7 years ago Closed 7 years ago

"JS Watchdog" thread runs every second!!

Categories

(Core :: XPConnect, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla25

People

(Reporter: dougt, Assigned: bholley)

References

Details

Attachments

(6 files, 1 obsolete file)

In B2G, there is data that shows that the JS Watchdog is firing every 1 second.  Is it possible to increase its interval more than 2.5s and align its wakeup with other threads wakeup?


In bug 568730, we hibernates watch dog thread when there is no JS activity for 2 seconds.  Some apps in B2G are doing some activity once in every 2 to 3 seconds (think music apps).  Increasing the watchdog and aligning it with other threads will reduce battery usage.
Luke, will this increase the length of time that operation callback tests take? Or is that orthogonal to the watchdog thread?
(In reply to Bobby Holley (:bholley) from comment #1)
The tests run for about a second and a half, regardless of the operation callback (to test the interrupted-but-not-killed path).  So increasing that limit would decrease the coverage of the tests.
Adding another exclamation point to make sure this gets looked at. :-)
Summary: "JS Watchdog" thread runs every second! → "JS Watchdog" thread runs every second!!
But yeah, given that dom.max_script_run_time varies between 10 and 20 seconds depending on platform, 1s resolution is probably overkill.

I wonder if we should make this variable per-platform?
Naveed, please find an assignee for this. A watchdog like this is a total atrocity for battery efficiency on anything mobile, including laptops. The watchdog should not run when no JS is actively executing. Please try to get a fix in for FFOS 1.1 and anything mobile via trunk.
Assignee: nobody → nihsanullah
gal: the watchdog timer lives in DOM.
(In reply to Andreas Gal :gal from comment #5)
> Naveed, please find an assignee for this. A watchdog like this is a total
> atrocity for battery efficiency on anything mobile, including laptops. The
> watchdog should not run when no JS is actively executing. Please try to get
> a fix in for FFOS 1.1 and anything mobile via trunk.

So, if I understand correctly (from comment 0 and from reading the code), it sounds like we're already correctly disabling watchdog interrupts when JS is not running. And when it is, leaving the last active request (effectively: when the JS engine goes idle) wakes up the watchdog thread and sets it to hibernate. So while we have to wake up the thread, the CPU is already spinning, so I don't think that really hurts us from a battery perspective.

What confuses me about this bug though is that comment 0 indicates that there's very short-lived JS activity that happens every 2-3 seconds. If each activity burst is less than 1 second, the watch dog should never be firing on its own. What should be happening is:

0 - We're idle. The watchdog is not active.
1 - We wake up, and start to run some JS. The JS engine notifies XPConnect, which arms the watchdog.
2 - JS completes. The JS engine notifies XPConnect, which wakes up the watchdog thread and tells it to hibernate.
3 - We're idle. The watchdog is not active.

This is all to say that the watchdog should only ever wake up on its own when the JS engine is active for more than a second. In that case, it seems fine for it to do that, IMO.

Doug, does the data indicate that the watchdog is firing on its own while the system is otherwise dormant? If so, that _may_ be an indication that the system is keeping a JS Request open while it's supposed to be sleeping (or some other bug). If not, then the fact that we see the watchdog thread running is not necessarily indicative of it waking up on its own, and thus not necessarily a problem.
Flags: needinfo?(doug.turner)
How do DOM workers that basically run *all the time* impact this, i.e. the wifi worker just to name one, always runs, and is blocked on a select call (or something like that) all the time. Will those workers keep the watchdog running all the time?
(In reply to Johnny Stenback (:jst, jst@mozilla.com) from comment #8)
> How do DOM workers that basically run *all the time* impact this, i.e. the
> wifi worker just to name one, always runs, and is blocked on a select call
> (or something like that) all the time. Will those workers keep the watchdog
> running all the time?

Those run on a separate JSRuntime, right? All this stuff is per-runtime, and the XPConnect watchdog is only for the main-thread JSRuntime. So those should be orthogonal.
(In reply to Bobby Holley (:bholley) from comment #9)
> Those run on a separate JSRuntime, right? All this stuff is per-runtime, and
> the XPConnect watchdog is only for the main-thread JSRuntime. So those
> should be orthogonal.

Correct.
> 2 - JS completes. The JS engine notifies XPConnect, which wakes up the watchdog thread and tells 
> it to hibernate.

Is it possible that it's this wakeup that we're seeing here?  Is there any way to tell?

Surely we could cancel the watchdog's wakeup from the main thread.
Ah, interesting! After re-reading the code, it looks like we _don't_ immediately notify the watchdog after the operation callback when the JSRuntime deactivates, but rather just save a timestamp and let the watchdog wake up on its own. Presumably, this is to avoid a bunch of context-switch churn with the watchdog each time we find ourselves at the top of the event loop (with the request count at zero).

So what's our desired behavior, here? Is there some kind of better signal that we can get that the system is going idle? Or alternatively, should we just bump the timer to ~4 seconds on mobile platforms so that it runs with a lower frequency than our expected wakeup rate?
4s wakeups are not acceptable either. Thats 21k wakeups in a 24h period. That completely kills battery life.
(In reply to Andreas Gal :gal from comment #13)
> 4s wakeups are not acceptable either. Thats 21k wakeups in a 24h period.
> That completely kills battery life.

But these wakeups are only happening because the system is already waking up and running JS every 3 seconds. If it the system were to go idle for 24 hours, we'd have exactly 1 wakeup, X seconds into the idle period, after which point the watchdog would hibernate.
Whats the cause for the 3s wakeup?
(In reply to Andreas Gal :gal from comment #15)
> Whats the cause for the 3s wakeup?

see comment 0.
I don't see from comment 0 what exactly the cause is. What causes us to wake up every 3s on an idle phone that is not playing music?
(In reply to Andreas Gal :gal from comment #17)
> I don't see from comment 0 what exactly the cause is. What causes us to wake
> up every 3s on an idle phone that is not playing music?

It's not clear to me that anyone has shown this to be the case.
Whats the logic behind the 4s then?
(In reply to Andreas Gal :gal from comment #19)
> Whats the logic behind the 4s then?

Because it would address the issue of the 3s music interrupts (with some tweaking of the existing code).

The issue is that the only idle signal we're currently working with is the transition to and from zero requests on the JSRuntime. We could in theory cancel the watchdog whenever we hit zero requests and restart it whenever we incremented to 1 request, but that would cause us to do thread churn for each trip through the event loop (where we should always hit 0 requests), which is undesirable.

The key problem here is that, when arming the watchdog thread, we don't know if we're about to enter an infinite loop, or just run 2 lines of script and go indefinitely idle. And since canceling the watchdog isn't free, we only want to do it if we're reasonably confident that we're actually going idle (and the current mechanism doesn't give us that confidence).

Within that set of constraints, our current approach is reasonable. First, we save a timestamp each time we hit zero requests. Then, when the watchdog fires, it checks the timestamp to see if the runtime has been idle for more than a minute. If so, it hibernates itself, costing only one unnecessary watchdog wakeup.

This approach generally works, but causes these annoying interleave issues if the JSRuntime is waking up with a frequency slightly lower than our desired watchdog resolution (ie the once-every-three-seconds music activity). If we decrease the watchdog resolution such that it fires with lower frequency than the JS activity, we can have the JS activity shut down the watchdog while it's at it.

Specifically, given periodic JS running every X seconds and a maximum watchdog resolution of Y seconds,
I'm proposing that we introduce a parameter Q that indicates the minimum time we want the JSRuntime to be idle before hibernating the watchdog thread. Whenever JS runs, it can check the activity timestamp against Q, and if it's greater, bite the bullet and cancel the watchdog.

But all this only works if Q < X < Y. So given X=3s, this is the source of my proposal for Y=4s.
> And since canceling the watchdog isn't free, we only want to do it if we're reasonably confident 
> that we're actually going idle (and the current mechanism doesn't give us that confidence).

Sorry if this is bikeshedding, but do we actually know that canceling the watchdog would be too expensive?  It's got to be on the order of how expensive it is to arm the watchdog, right?  And surely we do not have to context switch to the watchdog thread to disarm it; for example, I recall jduell recently made our nsITimers threadsafe.
(I ask because tuning things so they work well in this particular case, when JS wakes up every 3s, is just asking for trouble when we find a case when JS wakes up every 4s, or whatever.)
Just to add some background info,

We are trying to implement Low power audio (LPA) for FFOS - Bug 846214.
LPA processes huge (3.5s) audio in less time (~0.9s) so processor can sleep for remaining time (~2.5s)
Note that only processor sleeps, but audio continues to play with the help of dsp  during that remaining time.
Again processor wakes up, processes next huge chunk of audio in less time and go to sleep for the remaining time.

LPA saves power by allowing processor to sleep like this while playing audio
But since JS Watchdog causes wakeup at the rate of 1s, LPA can't save power.

I agree that tuning for this specific case may affect other cases.

But if there is a way to change watch dog interval at run time we can use it to increase delay for Music process (while playing in lpa mode) and it can be 1s for other cases.

Even after increasing delay, it may wakeup in-between that 2.5s, to avoid this there should be some way to trigger watch dog while lpa actually process audio. 
So watch dog can also sleep during the lpa sleep interval.
Flags: needinfo?(doug.turner)
(In reply to Justin Lebar [:jlebar] from comment #21)
> Sorry if this is bikeshedding, but do we actually know that canceling the
> watchdog would be too expensive?  It's got to be on the order of how
> expensive it is to arm the watchdog, right?

Currently all the hibernation management happens on the watchdog thread itself. It receives information from the main thread via shared data structures, and arms/disarms itself by waiting on a condvar. Do we have any condvar-like threading primitives that allow timeouts to be adjusted dynamically after another thread is already waiting on the condvar?

> And surely we do not have to
> context switch to the watchdog thread to disarm it; for example, I recall
> jduell recently made our nsITimers threadsafe.

I'm not sure what you're getting at with the nsITimers. Can you explain?

(In reply to Justin Lebar [:jlebar] from comment #22)
> (I ask because tuning things so they work well in this particular case, when
> JS wakes up every 3s, is just asking for trouble when we find a case when JS
> wakes up every 4s, or whatever.)

Yeah, sure. But note that this wouldn't be tuned only for 3s. It would be tuned for anything in a range of (minimum time for us to believe that we're idle, watchdog resolution).

My question still stands as to whether Gecko has any sort of reliable mechanism to signal that it's about to go idle. If we can hook up to that, this whole question is moot. Who knows about these things?
> My question still stands as to whether Gecko has any sort of reliable mechanism to signal that 
> it's about to go idle.

We have MessageLoop::current()->PostIdleTask().  This doesn't work at on Mac (bug 862414), and I doubt it works on Windows either.  Additionally, I have no idea how reliable it is on Linux to fire iff we're about to be idle.  But it might work.

> I'm not sure what you're getting at with the nsITimers. Can you explain?

I meant, if we can cancel an nsITimer across threads, then that's an existence proof that it's possible for one thread to go to sleep and another thread to adjust when it wakes up in a safe way.

But now I've looked at the timer implementation, and I see that it involves a third thread, which wakes up when you cancel the timer.  So that speaks to your point.

> Currently all the hibernation management happens on the watchdog thread itself. It receives 
> information from the main thread via shared data structures, and arms/disarms itself by waiting 
> on a condvar. Do we have any condvar-like threading primitives that allow timeouts to be 
> adjusted dynamically after another thread is already waiting on the condvar?

Maybe we could use /dev/alarm on B2G (see GonkHal.cpp).  But touching that involves syscalls, so it's probably not going to be cheap.
Depends on: 871742
I've filed bug 871742 to get rid of the watchdog thread in processes where we don't need it.
I'll drive the meta issue here.
Assignee: nihsanullah → bobbyholley+bmo
So, focusing on B2G for now, there are really two dimensions to this problem: reducing the set of processes in which we use the watchdog thread, and making the watchdog thread more efficient where we do use it. Let's start by addressing the first.

IIUC, there are three interesting cases for B2G processes here:
(1) The main process. This never runs unprivileged code, and currently uses a pref to ignore OperationCallbacks if they ever get fired. The watchdog thread is doing nothing at all in this process, and should be disabled.
(2) Browser tab processes. We have one of these per browser tab. On-device smoketesting indicates that the slow script dialog works in the B2G browser, and IMO we should make sure it continues to work.
(3) Non-Browser App processes. From my testing with [1], hosted web apps (and presumably other apps) with long-running script will have the script terminated without a prompt. Presumably, this is because they don't have anything hooked up to mozbrowsershowmodalprompt, which presumably causes GetPromptFromContext to return null, which causes DOMOperationCallback to return null (Justin, does this sound plausible?). This is reasonable behavior, but it has its problems: we have the overhead of the watchdog thread, and wallclock time can be deceptive (see bug 870541). So it sounds like it would be preferable to simply let the apps hang, and to freeze apps running in the background. This is bug 872177, and I'm not sure of its level of engineering effort.

(1) and (2) are pretty clear-cut. The biggest question is what to do about (3). The questions as I seem them are:
* Do we want to do bug 872177, and how much work is it?
* If yes to the above, do we want to do anything about the watchdog issue in the mean time? If so, what?
** Disable the watchdog for apps that are "privileged enough" - perhaps packaged apps?
** Disable the watchdog for all apps, since the user can just use the home button to banish the app to a (battery-eating) life in the background. This fixes the wallclock issue we have currently.
** Something else?

Once we figure out our eventual watchdog strategy, we can decide whether we want to expend energy improving the watchdog itself.

[1] http://people.mozilla.com/~bholley/testcases/iloopr/iloopr.webapp
Flags: needinfo?(jonas)
bobby, jonas -- any update here?
(In reply to Doug Turner (:dougt) from comment #29)
> bobby, jonas -- any update here?

I am waiting on input per the flag.
Honestly, I'm not really sure that we *need* to have the watchdog thread at all in B2G. For the browser case, if a tab hangs, the user will notice and can simply close it.

*Ideally* the behavior that would be nice to have is something like this:

By default, when script runs too long inside of a mozbrowser, don't do anything. I.e. let the script keep running.

The page embedding the mozbrowser should be able to opt in to being notified about long-running script. For remote mozbrowsers, when we notify, the script keeps running. If we end up returning to the event loop we send another notification saying that we're no longer stuck in a slow script. At any point the embedding app can simply terminate the tab process.

For non-remote mozbrowsers, assuming the embedding page has opted in, we pause the script and then notify the embedding page. The embedding page could then tell the browser API to throw a non-catchable exception in the paused script.


But this is obviously a non-trivial amount of work.


> (1) and (2) are pretty clear-cut. The biggest question is what to do about
> (3). The questions as I seem them are:
> * Do we want to do bug 872177, and how much work is it?

Yes. But I expect that it's a non-trivial amount of work to do well.

> * If yes to the above, do we want to do anything about the watchdog issue in
> the mean time? If so, what?

I don't really understand the relationship with bug 872177. It'll always be the case that background apps can actively run.

I do think that we should shut off the watchdog thread in app iframes (i.e. mozapp mozbrowser iframes).

I think we should even shut it off in browser iframes (i.e. non-mozapp mozbrowser iframes) unless the embedding page has explicitly opted in to being able to be notified about slow-script.

> ** Disable the watchdog for apps that are "privileged enough" - perhaps
> packaged apps?

We should disable it for all apps.

> ** Disable the watchdog for all apps, since the user can just use the home
> button to banish the app to a (battery-eating) life in the background. This
> fixes the wallclock issue we have currently.

Yes.

> ** Something else?

See above for my ideal solution.
Flags: needinfo?(jonas)
Ok. Then it sounds like the best solution for the mean time is just to disable the watchdog on b2g with a pref. I'll write up some patches and rejigger stuff to do that.
This is a long-overdue change in general. It's now particularly necessary
because we want to start and stop the watchdog, potentially multiple times
during a session. Encapsulating this into a class that we can
destroy/instantiate several times means that we don't end up with inconsistent
state.

As a nice bonus, this centralizes the whole setup and makes it easier to
comprehend. I actually had this nicely split up into several patches, but
I accidentally squashed them while rebasing. :-(
Attachment #777447 - Flags: review?(mrbkap)
We need this even for testing wakeups, because we can't be certain that any
given operation callback was necessarily triggered from the watchdog thread
(since it's triggered from within the JS engine in various cases as well).
Attachment #777448 - Flags: review?(mrbkap)
The JS engine fires this callback when the request count drops to zero, and we
use it determine when we should hibernate the watchdog thread. But since the
request count never drops to zero for nested event loops, the watchdog never
runs in those cases. And since our xpcshell harness runs tests in a nested event
loop, this means we can't test watchdog hibernation from xpcshell. And we don't
want to test it in mochitests, because the non-determinism of timer CCs and GCs
are likely to be problematic.

Really, we should consider finding a way to integrate nested event loops into
the activity callback mechanism, and should probably get a bug on file. But such
a task is out of scope for this bug, so we just add a way to fake it.
Attachment #777450 - Flags: review?(mrbkap)
Attachment #777451 - Flags: review?(mrbkap)
Attachment #777452 - Flags: review?(mrbkap)
Note - a comment in patch 3 mistakenly refers to the timestamp units as milliseconds. It's actually microseconds, and I've updated the patch to fix that locally.
Comment on attachment 777447 [details] [diff] [review]
Part 1 - Factor the watchdog machinery into a helper class. v1

Review of attachment 777447 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the comments addressed. I didn't dive too deeply into reviewing the threading here, but as far as I can tell this patch hasn't changed the model at all, only made it more explicit.

::: js/xpconnect/src/XPCJSRuntime.cpp
@@ +1043,4 @@
>      }
>  
> +    //
> +    // Invoked by the main thread only.

Can you stick a MOZ_ASSERT(NS_IsMainThread()); in the prolog of all of the functions that are supposed to be only called on the main thread?

@@ +1094,5 @@
> +        mInitialized = false;
> +    }
> +
> +    //
> +    // Invoked by the watchdog thread only.

Likewise, these functions should MOZ_ASSERT(!NS_IsMainThread());

@@ +1116,5 @@
> +  private:
> +    WatchdogManager *mManager;
> +
> +    PRLock *mLock;
> +    PRCondVar *mWakeup;

These two variables should probably become a mozilla::Monitor.

@@ +1117,5 @@
> +    WatchdogManager *mManager;
> +
> +    PRLock *mLock;
> +    PRCondVar *mWakeup;
> +    PRThread *mThread;

Is there a reason not to use an nsThread here?
Attachment #777447 - Flags: review?(mrbkap) → review+
Comment on attachment 777448 [details] [diff] [review]
Part 2 - Add a script-accessible statistics for various watchdog events. v1

Review of attachment 777448 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/xpconnect/src/XPCJSRuntime.cpp
@@ +1188,5 @@
> +    void RecordTimestamp(WatchdogTimestampCategory aCategory)
> +    {
> +        // The watchdog thread always holds the lock when it runs.
> +        Maybe<AutoLockWatchdog> maybeLock;
> +        if (mWatchdog && NS_IsMainThread())

How can NS_IsMainThread return true? I only see calls on the watchdog thread here. It's also a little weird to look at mWatchdog, which is modified on the main thread.

@@ +1196,5 @@
> +    PRTime GetTimestamp(WatchdogTimestampCategory aCategory)
> +    {
> +        // The watchdog thread always holds the lock when it runs.
> +        Maybe<AutoLockWatchdog> maybeLock;
> +        if (mWatchdog && NS_IsMainThread())

Ditto here, but the inverse.
Attachment #777448 - Flags: review?(mrbkap)
Comment on attachment 777449 [details] [diff] [review]
Part 3 - Add support for scripted operation callbacks in XPCShell. v1

Review of attachment 777449 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/xpconnect/shell/xpcshell.cpp
@@ +839,5 @@
> +    if (!JS_CallFunctionValue(cx, nullptr, sScriptedOperationCallback.ref(),
> +                              0, nullptr, rv.address()) || !rv.isBoolean())
> +    {
> +        NS_WARNING("Scripted operation callback failed! Terminating script.");
> +        return false;

Couldn't the function throw an exception that would cause the script to possibly end up in a catch block? I think we need to JS_ClearPendingException here.

@@ +850,5 @@
> +SetOperationCallback(JSContext *cx, unsigned argc, jsval *vp)
> +{
> +    // Sanity-check args.
> +   JS::CallArgs args = JS::CallArgsFromVp(argc, vp);
> +    if (args.length() != 1) {

Nit: Odd indentation.

@@ +863,5 @@
> +    }
> +
> +    // Otherwise, we should have a callable object.
> +    if (!args[0].isObject() || !JS_ObjectIsCallable(cx, &args[0].toObject()))
> +    {

Nit: odd bracing style?
Attachment #777449 - Flags: review?(mrbkap) → review+
Comment on attachment 777450 [details] [diff] [review]
Part 4 - Add a way to simulate the JSRuntime activity callback from xpcshell. v1

Review of attachment 777450 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/xpconnect/shell/xpcshell.cpp
@@ +878,5 @@
> +static JSBool
> +SimulateActivityCallback(JSContext *cx, unsigned argc, jsval *vp)
> +{
> +    // Sanity-check args.
> +   JS::CallArgs args = JS::CallArgsFromVp(argc, vp);

Nit: funny indentation.
Attachment #777450 - Flags: review?(mrbkap) → review+
Attachment #777451 - Flags: review?(mrbkap) → review+
Attachment #777452 - Flags: review?(mrbkap) → review+
Blocks: 896079
I filed bug 896079 to modernize the threading primitives.

As noticed on IRC, we do actually need to read/write statistics on both threads because of the runtime activity timestamp. I'll fix up the ordering of the mWatchdog check.
We need this even for testing wakeups, because we can't be certain that any
given operation callback was necessarily triggered from the watchdog thread
(since it's triggered from within the JS engine in various cases as well).
Attachment #777448 - Attachment is obsolete: true
Attachment #778697 - Flags: review?(mrbkap)
Attachment #778697 - Flags: review?(mrbkap) → review+
Depends on: 897789
What does the dom.use_watchdog pref do?  Should I randomize it during fuzzing?
(In reply to Jesse Ruderman from comment #50)
> What does the dom.use_watchdog pref do?

It (dynamically) controls the watchdog thread, which is a thread that pokes the JS engine every second to deliver OperationCallbacks, which are what trigger the slow script dialog.

> Should I randomize it during fuzzing?

I'd be mildly interested to test the dynamic behavior of turning it on and off (which I implemented in this bug), especially given the potential for race conditions. But I don't think it would be all that useful in the long term. Maybe fuzz it for a week or two and then stop?
See Also: → 870541
You need to log in before you can comment on or make changes to this bug.