Closed Bug 540004 Opened 14 years ago Closed 14 years ago

Implement an IPDL hang detector

Categories

(Core :: IPC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- .4-fixed

People

(Reporter: benjamin, Assigned: cjones)

References

Details

(Whiteboard: [fixed-lorentz])

Attachments

(4 files, 1 obsolete file)

If a child process hangs, we should be able to notice that and prompt the user to kill it after some timeout period. It seems that this would be fairly easy to do with sync or RPC style calls... probably less easy if everything is async, e.g. chrome->content. But for now, RPC for plugin processes should be sufficient.

I'd like to get this hooked up to UI for Firefox Lorentz if possible. We're limited to putting up an app-modal prompt, something like:

"The plugin 'Adobe Flash Player' is not responding after 60 seconds. Stopping the plugin allows you to use Firefox again."

[Stop] [Wait]

Limi, can you propose better strings?
The low-level impl of this is actually quite trivial, basically s/condvar.Wait()/condvar.Wait(kTimeout)/.
Edgy proposal: what if we just kill the non-responsive plugin and treat it as a crash? Most users would (arguably) consider a crash and hang to be basically the same thing, anyway.
That is edgy: we would at least need a pref to disable that behavior, for developers using the flash debugger. We'd also need to be really certain of ourselves.
Too edgy for content processes, certainly, which can hang loading tinderbox pages for far longer than a minute: but perhaps doable for plugins. I'd still like strings available for a dialog, just in case we need to revise that decision during beta/RC.
(In reply to comment #0)
> "The plugin 'Adobe Flash Player' is not responding after 60 seconds. Stopping
> the plugin allows you to use Firefox again."
> 
> [Stop] [Wait]
> 

Shouldn't we use basically the same language as the slow-script dialog?

Also, while this dialog is popped up, are we going to completely block the main-thread event loop?  As long as that's the case, the backend for this isn't too terribly hard.
With the slow-script dialog, we just terminate the currently running JS. With the hang detector, I think we have no choice but to terminate the plugin process.

And yes, this should be a nested event loop which blocks the main event loop.
Depends on: 544095
Assignee: nobody → jones.chris.g
OS: Linux → All
Hardware: x86 → All
Just looking for a rubber stamp r+.
Attachment #425272 - Flags: review?(bent.mozilla)
Still need a windows implementation, although we can push everything but the unit test without it.
Blocking this on the new IPDL features that need to land on m-c along with this (to avoid serious merge headaches later).
Depends on: 540111, 540886
No longer depends on: 544095
After playing around with a Gtk dialog for bug 544095, I realized that this patch ends up calling |listener->OnTimeout()| (and thus |actor->ShouldContinueFromReplyTimeout()|) while still holding the *Channel mutex.  (The modal event loop was woken up to deliver an event to the plugin :( .)  Probably not desirable, but I'll think on it overnight to see if there's any benefit.
Now SyncChannel drops mMutex before calling out to OnTimeout.
Attachment #425277 - Attachment is obsolete: true
Attachment #425865 - Flags: review?(bent.mozilla)
Attachment #425277 - Flags: review?(bent.mozilla)
Attachment #425272 - Flags: review?(bent.mozilla) → review+
Comment on attachment 425275 [details] [diff] [review]
Detect hangs while awaiting synchronous IPC replies.

>-        mChannelState = ChannelClosing;

Doesn't seem like you ever use this again... Simple omission? Or can it be removed from the enum?

>diff --git a/ipc/glue/Makefile.in b/ipc/glue/Makefile.in

Unnecessary.

>+const PRIntervalTime SyncChannel::kNoTimeout = PR_INTERVAL_NO_TIMEOUT;

This seems unnecessary.

>@@ -201,25 +215,71 @@ SyncChannel::OnChannelError()
>     if (AwaitingSyncReply())
>         NotifyWorkerThread();
> }
> 
> //
> // Synchronization between worker and IO threads
> //
> 
>+namespace {
>+
>+bool
>+IsTimeoutExpired(PRIntervalTime aStart, PRIntervalTime aTimeout)
>+{
>+    return (aTimeout != SyncChannel::kNoTimeout) &&
>+        (aTimeout <= (PR_IntervalNow() - aStart));
>+}
>+
>+} // namespace <anon>
>+
>+bool
>+SyncChannel::ShouldContinueFromTimeout()

Maybe it's just late or maybe I'm just tired but no matter how I look at this function I can't tell how the if test will ever succeed or how the function can ever return false.
Comment on attachment 425865 [details] [diff] [review]
Add a ShouldContinue() interface to IPDL actors that allows to decide how a hang should be treated, v2

>+    void SetReplyTimeout(PRIntervalTime aTimeout) {
>+        AssertWorkerThread();
>+        mTimeout = (aTimeout <= 0) ? kNoTimeout : aTimeout;

PRIntervalTime is an unsigned int.
Attachment #425865 - Flags: review?(bent.mozilla) → review+
Comment on attachment 425275 [details] [diff] [review]
Detect hangs while awaiting synchronous IPC replies.

Ok, this makes sense now with the followup patch.
Attachment #425275 - Flags: review?(bent.mozilla) → review+
Pushed http://hg.mozilla.org/projects/electrolysis/rev/d0e79b58e9b8
Pushed http://hg.mozilla.org/projects/electrolysis/rev/b85532205c1b
Pushed http://hg.mozilla.org/projects/electrolysis/rev/6d5e409f594a

This is going to temporarily break the windows build.  On it.

The IPDL/C++ test has been pushed back to bug 545053, because it can't land until we have a windows hang detector.
Whiteboard: [land m-c]
Just to note again that we need to land bug 540111 and bug 540886 before this.
This is contained entirely in ipc/*, fixed in Lorentz by virtue of importing that code directly.
Whiteboard: [land m-c] → [fixed-lorentz]
Blanket approval for Lorentz merge to mozilla-1.9.2
a=beltzner for 1.9.2.4 - please make sure to mark status1.9.2:.4-fixed
Depends on: 576384
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: