Closed Bug 910563 Opened 6 years ago Closed 6 years ago

gracefully handle remote process crashing

Categories

(Firefox :: General, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 26

People

(Reporter: markh, Assigned: markh)

References

Details

Attachments

(1 file, 3 obsolete files)

If the remote process crashes for some reason, we don't handle it very well.  What currently happens is:

* Current capture times out.
* Next capture starts, but calling sendAsyncMessage throws an uncaught exception.
* Fortunately, unhandled exception prevents the _destroyBrowserTimer from being updated, so we destroy the browser faster than we normally would.
* We recreate the browser and things continue as normal.

The end result is ugly messages in the console, and 2 captures discarded - the one during the crash and the next one.

It turns out there is an observer we can listen for to handle this case.  I'm attaching a patch that handles this a little more gracefully - it would even be possible to re-queue the crashing one, but that's more work.

Sadly, the only way I can find to test this is to manually kill the process via task manager - automated test might be possible when some of larch lands, which should expose the PID of the child process which we could kill via psutil.
Comment on attachment 797079 [details] [diff] [review]
0001-Bug-910563-gracefully-handle-the-thumbnail-process-c.patch

Review of attachment 797079 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks.

Various tests in the tree cause intentional crashes:

OOP plug-ins: http://mxr.mozilla.org/mozilla-central/source/dom/plugins/test/testplugin/nptest.cpp#79

crash reporter: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/nsTestCrasher.cpp#51

crash reporter: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/CrashTestUtils.jsm

hang monitor: http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/HangMonitor.cpp#95

And nsIContentFrameMessageManager has privateNoteIntentionalCrash, which doesn't actually cause a crash but indicates some modicum of support in e10s for intentional crashes.

Presumably there will be Firefox desktop e10s tests that need to crash the content process from JS, if there aren't already.  Maybe we'll end up writing that module in this bug. :-)

::: toolkit/components/thumbnails/BackgroundPageThumbs.jsm
@@ +51,5 @@
>          schedule(() => options.onDone(null));
>        return;
>      }
> +    if (!this._addedObservers) {
> +      Services.obs.addObserver(this, "oop-frameloader-crashed", false);

Conceptually, this observer's lifetime is tied to the browser's.  Plus, it's wasteful to keep it registered while the bg service isn't used for a long period of time.  Not saying it's inefficient or would measurably impact performance, it just doesn't need to be there.  So how about registering it in _ensureBrowser and unregistering it in _destroyBrowser?  You could use the existence of the browser as a test for whether the observer has been added.

@@ +84,5 @@
> +      let frameLoader = subject.QueryInterface(Ci.nsIFrameLoader);
> +      if (this._thumbBrowser.messageManager == frameLoader.messageManager) {
> +        Cu.reportError("BackgroundThumbnails remote process crashed - recovering");
> +        this._destroyBrowser();
> +        let curCapture = this._captureQueue ? this._captureQueue[0] : null;

this._captureQueue shouldn't ever be falsey here, should it?  i.e., a capture has to have been requested to end up here.

@@ +85,5 @@
> +      if (this._thumbBrowser.messageManager == frameLoader.messageManager) {
> +        Cu.reportError("BackgroundThumbnails remote process crashed - recovering");
> +        this._destroyBrowser();
> +        let curCapture = this._captureQueue ? this._captureQueue[0] : null;
> +        // we could retry the existing capture, but it's possible the crash

s/existing/pending/

@@ +89,5 @@
> +        // we could retry the existing capture, but it's possible the crash
> +        // was due directly to it, so trying again might just crash again.
> +        // We could keep a flag to indicate if it previously crashed, but
> +        // "resetting" the capture requires more work - so for now, we just
> +        // discard it.

I think that's fine.
Attachment #797079 - Flags: feedback?(adw) → feedback+
We have ctypes code that you can use to intentionally crash. Currently it loads a test lib so that we crash crash in various specific ways (pure-virtual, etc). But you can also crash just by dereferencing voidptr-to-0x1 without an external lib. See http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/CrashTestUtils.jsm
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> We have ctypes code that you can use to intentionally crash. Currently it
> loads a test lib so that we crash crash in various specific ways
> (pure-virtual, etc). But you can also crash just by dereferencing
> voidptr-to-0x1 without an external lib. See
> http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/
> CrashTestUtils.jsm

CrashTestUtils.jsm seems to only be available in xpcshell tests, but I did find another example of a test using ctypes to crash a child process, so just borrowed that.  We should probably consider making CrashTestUtils available to mochi too - but this bug probably isn't the place to do that.

This attachment addresses Drew's feedback comments on the patch itself and adds tests.
Assignee: nobody → mhammond
Status: NEW → ASSIGNED
Attachment #801426 - Flags: review?(adw)
Comment on attachment 801426 [details] [diff] [review]
0001-Bug-910563-gracefully-handle-the-thumbnail-process-c.patch

Review of attachment 801426 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/thumbnails/BackgroundPageThumbs.jsm
@@ +77,5 @@
> +      let frameLoader = subject.QueryInterface(Ci.nsIFrameLoader);
> +      if (this._thumbBrowser.messageManager == frameLoader.messageManager) {
> +        Cu.reportError("BackgroundThumbnails remote process crashed - recovering");
> +        this._destroyBrowser();
> +        let curCapture = this._captureQueue ? this._captureQueue[0] : null;

this._captureQueue shouldn't ever be falsey here, should it?  i.e., a capture has to have been requested to end up here.

@@ +78,5 @@
> +      if (this._thumbBrowser.messageManager == frameLoader.messageManager) {
> +        Cu.reportError("BackgroundThumbnails remote process crashed - recovering");
> +        this._destroyBrowser();
> +        let curCapture = this._captureQueue ? this._captureQueue[0] : null;
> +        // we could retry the existing capture, but it's possible the crash

s/existing/pending/

@@ +223,5 @@
>      }
>    },
>  
>    /**
> +   * Called when the current capture completes or failes (eg, times out, remote

s/failes/fails/
Attachment #801426 - Flags: review?(adw) → review+
(In reply to Drew Willcoxon :adw from comment #5)
> Comment on attachment 801426 [details] [diff] [review]
> 0001-Bug-910563-gracefully-handle-the-thumbnail-process-c.patch
> this._captureQueue shouldn't ever be falsey here, should it?  i.e., a
> capture has to have been requested to end up here.

Yeah, that's just my Python background, assuming an empty array is falsey.  I changed this to checking the length is non-zero.

All other comments fixed - thanks.

https://hg.mozilla.org/integration/fx-team/rev/b8a5df436404
Backed out in https://hg.mozilla.org/integration/fx-team/rev/a66653f84ac7 due to the crash being tested causing the b-c tests to fail, even though privateNoteIntentionalCrash() was called.  Obviously I need to research this some more.
Comment on attachment 801426 [details] [diff] [review]
0001-Bug-910563-gracefully-handle-the-thumbnail-process-c.patch

Review of attachment 801426 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/thumbnails/BackgroundPageThumbs.jsm
@@ +192,5 @@
> +    // "ipc:browser-destroyed" sent both for normal and abnormal terminations,
> +    // but it's not currently possible to determine what browser it was for
> +    // (although nsITabParent is probably going to grow a way of determining
> +    // it at some point)
> +    Services.obs.addObserver(this, "oop-frameloader-crashed", false);

fwiw, if you don't want to add this observer and have to filter for your _thumbBrowser, you can use the recently added "oop-browser-crashed" event instead which will be dispatched to the right browser only.
This new patch is very similar to the last, but with enough changes a re-review probably makes sense.  We now use the event Felipe suggested (which cleans things up - no need to check it was our browser that crashed, and no need to remove an observer).  The test also had changes to cleanup the minidumps, which is (apparently) the reason the test-runner complained about the crash.

Try at https://tbpl.mozilla.org/?tree=Try&rev=85f4373fbdf6 - hopefully it will be green by the time it gets re-reviewed :)
Attachment #797079 - Attachment is obsolete: true
Attachment #801426 - Attachment is obsolete: true
Attachment #802062 - Flags: review?(adw)
Comment on attachment 802062 [details] [diff] [review]
0001-Bug-910563-gracefully-handle-the-thumbnail-process-c.patch

Try failed
Attachment #802062 - Flags: review?(adw)
The problem with the last patch was test specific - as the remote process terminated normally due to the previous test, the observer looking for crashes saw the termination but failed as it wasn't a crash.

This patch fixes that - this time the try will work (he says confidently ;)

https://tbpl.mozilla.org/?tree=Try&rev=3c4ebfeae1b7
Attachment #802062 - Attachment is obsolete: true
Attachment #802676 - Flags: review?(adw)
Attachment #802676 - Flags: review?(adw) → review+
https://hg.mozilla.org/mozilla-central/rev/181c7727f08b
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 26
Depends on: 915384
I've seen that browser_thumbnails_background_crash.js causes pretty consistent orange on ASan builds too. Does this test require the crash reporter to be valid? Does it require processes to actually crash, rather than exit?
Flags: needinfo?(mhammond)
(In reply to Christian Holler (:decoder) from comment #14)
> I've seen that browser_thumbnails_background_crash.js causes pretty
> consistent orange on ASan builds too. Does this test require the crash
> reporter to be valid?

It does require the crash reporter to be in place, as it expects to find the minidump files and remove them.  Is this not true on ASan builds?

> Does it require processes to actually crash, rather
> than exit?

It requires it to crash and it uses ctypes to force it to crash.
(In reply to Mark Hammond (:markh) from comment #15)
> 
> It does require the crash reporter to be in place, as it expects to find the
> minidump files and remove them.  Is this not true on ASan builds?

No, ASan has the crash reporter disabled. In general, all tests that need the crash reporter should be #ifdef MOZ_CRASHREPORTER so they are skipped in non-crashreporter builds. I'll clone the bug for that change.
Flags: needinfo?(mhammond)
Depends on: 924651
You need to log in before you can comment on or make changes to this bug.