Closed Bug 1068349 Opened 10 years ago Closed 10 years ago

IPC message handlers that return false cause child process to crash without a crash report

Categories

(Toolkit :: Crash Reporting, defect)

x86_64
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla37
Tracking Status
e10s m4+ ---

People

(Reporter: billm, Assigned: mconley)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 6 obsolete files)

4.59 KB, patch
ted
: review+
Details | Diff | Splinter Review
5.02 KB, patch
smaug
: review+
Details | Diff | Splinter Review
1.69 MB, application/zip
Details
There are a number of places where ContentParent.cpp calls KillHard(). One of them is in ContentParent::ProcessingError. In this case, the parent has decided that a message sent by the child was malformed or something and we should kill the child. However, we don't submit a crash report in that case. For sync messages, a crash report would tell us why the child sent the message. For async messages, it would crash at a random spot, but at least we would know about the crash.
We basically want to call CreatePairedMinidumps in this situation:
http://hg.mozilla.org/mozilla-central/annotate/4c449dcda453/dom/ipc/CrashReporterParent.h#l112

Not sure if we need other plumbing around that to make it get submitted or if the "child died, look for minidump" logic handles that.
Hey jimm, bug 1099274 (which I duped to this), was an M4. Should I go ahead and carry on the work in this bug as an M4, or do you think we should re-triage? My instinct says the former, but I thought I'd run it by you.
Flags: needinfo?(jmathies)
If this gets ugly for any reason tag it for re-triage. For now though, lets carry on the work in bug 1099274 here.
Assignee: nobody → mconley
tracking-e10s: --- → m4+
Flags: needinfo?(jmathies)
Sounds good.
Comment on attachment 8527898 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?

So, it seems like it was this easy. This will cause a minidump to be generated.

Is there any annotating we need to do here to make this report useful for this particular KillHard case?
Attachment #8527898 - Flags: feedback?(ted)
Comment on attachment 8527898 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?

Review of attachment 8527898 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +3049,5 @@
> +    if (ManagedPCrashReporterParent().Length() > 0) {
> +        CrashReporterParent* crashReporter =
> +            static_cast<CrashReporterParent*>(ManagedPCrashReporterParent()[0]);
> +
> +        crashReporter->GeneratePairedMinidump(this);

does this submit a report? some commenting here would be appreciated.
(In reply to Jim Mathies [:jimm] from comment #8)
> Comment on attachment 8527898 [details] [diff] [review]
> [WIP]: IPC message handlers that return false cause child process to crash
> without a crash report. r=?
> 
> Review of attachment 8527898 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/ipc/ContentParent.cpp
> @@ +3049,5 @@
> > +    if (ManagedPCrashReporterParent().Length() > 0) {
> > +        CrashReporterParent* crashReporter =
> > +            static_cast<CrashReporterParent*>(ManagedPCrashReporterParent()[0]);
> > +
> > +        crashReporter->GeneratePairedMinidump(this);
> 
> does this submit a report? some commenting here would be appreciated.

It doesn't submit it (at this point, we still haven't shown about:tabcrashed to the user, so they haven't yet been given the opportunity to opt-out of sending the report).

My understanding is that this call causes Gecko to generate a minidump for the state of the child process before we kill it. We do something very similar here in PluginModuleChromeParent::ShouldContinueFromReplyTimeout[1]. As a paired minidump, it's possible it's generating a minidump for the parent state as well - perhaps ted can fill us in more there.

Later, when the ContentChild actor dies, ContentParent::ActorDestroy is called, which generates the actual report from the minidumps, and passes the information about the report through the ipc:content-shutdown observer notification, which then shows the UI for submitting the crash report.

So that's kind of how it's all laid out - it looks like all we needed to do was generate that minidump before killing the child process.

I'll update my patch with some documentation about this.

[1]: http://hg.mozilla.org/mozilla-central/file/8c02f3280d0c/dom/plugins/ipc/PluginModuleParent.cpp#l592
Attachment #8527898 - Attachment is obsolete: true
Attachment #8527898 - Flags: feedback?(ted)
Comment on attachment 8527926 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?

Actually, might as well go full-on review request here.
Attachment #8527926 - Flags: review?(ted)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #9)
> My understanding is that this call causes Gecko to generate a minidump for
> the state of the child process before we kill it. We do something very
> similar here in PluginModuleChromeParent::ShouldContinueFromReplyTimeout[1].
> As a paired minidump, it's possible it's generating a minidump for the
> parent state as well - perhaps ted can fill us in more there.

This is correct--we write a pair of minidumps, one for the child and one for the parent. If the user submits a crash report we'll submit both.

> Later, when the ContentChild actor dies, ContentParent::ActorDestroy is
> called, which generates the actual report from the minidumps, and passes the
> information about the report through the ipc:content-shutdown observer
> notification, which then shows the UI for submitting the crash report.

This is also correct. We simply store the dumps in a hashtable keyed by the PID of the child process, and retrieve them when it terminates, and then send out a notification to allow the browser chrome to display the crash UI and submit the report.
Status: NEW → ASSIGNED
Comment on attachment 8527926 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?

Review of attachment 8527926 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +3063,4 @@
>      // This ensures the process is eventually killed, but doesn't
>      // immediately KILLITWITHFIRE because we want to get a minidump if
>      // possible.  After a timeout though, the process is forceably
>      // killed.

You should probably remove this comment, it doesn't actually seem to be true and I'm not sure if it ever was.
Attachment #8527926 - Flags: review?(ted) → review+
Argh, so after some testing, this isn't good enough. Sorry to make you review that, ted.

When I say not good enough, I mean that while we're displaying the checkbox to submit a crash report, we're not actually annotating the crash report with the extra minidump from the content process that we've killed - so if the child started spewing garbage at us and we killed it, this crash report is of minimal use.

What we need to do is something akin to what PluginModuleParent does in TerminateChildProcess, where we add an entry to the additional_minidumps list, and annotate it with the minidump generated from the content process.

Another (separate) problem is that we don't get the additional minidumps displayed in the UI of crash stats. We can view them in the unprocessed JSON blob of the crash report, but that's not great. bsmedberg said we have bugs about that somewhere.

bsmedberg - I couldn't find those bugs despite looking for (what I thought were) logical search terms under Socorro:Webapp, but no dice. Can you provide the bug numbers for those crash stats bits I mentioned?

In the meantime, I'll get to start on a proper patch with actual minidumps for the other process.
Attachment #8527926 - Attachment is obsolete: true
Attachment #8535881 - Attachment is obsolete: true
Comment on attachment 8535883 [details] [diff] [review]
IPC message handlers that return false cause child process to crash without a crash report. r=?

This works better - example of a generated crash report:

https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=e7a59f07-33a3-4048-b241-aca082141212
Attachment #8535883 - Flags: review?(ted)
Comment on attachment 8535883 [details] [diff] [review]
IPC message handlers that return false cause child process to crash without a crash report. r=?

Review of attachment 8535883 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +3078,5 @@
> +            // it exists and is being appended.
> +            nsAutoCString additionalDumps("browser");
> +            crashReporter->AnnotateCrashReport(
> +                NS_LITERAL_CSTRING("additional_minidumps"),
> +                additionalDumps);

It would be nice if we got this for free with GeneratePairedMinidump, but it's probably not called enough places to be worthwhile.
Attachment #8535883 - Flags: review?(ted) → review+
OS: Linux → All
:( Any idea what I hit here, ted?
Flags: needinfo?(ted)
It looks like the browser-chrome tests under browser/components/customizableui/test launch a lot of child processes, so maybe one of them is hitting this KillHard path you've added, and it now causes a failure because the minidumps get written?

I do see some spew like this in the log:
11:13:47     INFO -  [Parent 2384] WARNING: pipe error (49): Connection reset by peer: file /builds/slave/fx-team-lx-0000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 457
11:13:47     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv

The B2G desktop mochitests seem to be hitting something similar. It's possible this is something that has always happened and your patch turns it into a test failure.
Flags: needinfo?(ted)
Attachment #8538644 - Attachment is obsolete: true
Comment on attachment 8538648 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?

So the problem seems to be that some tests rapidly create and destroy single remote tabs, which result in us rapidly creating and destroying content processes.

dom.ipc.tabs.shutdownTimeoutSecs is used by the parent process as a fail-safe - if we just saw our last remote tab go, tell the content process to destroy itself; but if dom.ipc.tabs.shutdownTimeoutSecs goes by without us hearing back, forcibly kill the child.

This timeout is 5 seconds by default in opt builds. On our test machines, this is probably not enough time for the content process to boot up, init XPCOM, get its event loop spinning, and send back a destroy message back to the parent, which is why the parent will periodically force-kill the content process resulting in the minidumps that caused my patch to bounce.

bent suggested that we just disable the timeout on mochitests.

What do you think of that solution, smaug?
Attachment #8538648 - Flags: review?(bugs)
Would it make sense to just use higher value?
Doesn't smell good - bumping timeouts to some magic number feels a bit like we're just kicking the can down the road. :/ Is there any value in having KillHard behaviour in mochitests?
Well, it being something else than 0 would be closer to what we have in normal builds.
I guess? I don't know - I guess I just don't see the value. Then again, I just want to get my patch landed, so I'm not too fussed about the final decision on this bit. :)

I'll see if 10 seconds is sufficient.
Hmm, some failures in b2g. Why exactly? Do we need even longer timeout?
It is a bit worrisome to see those failures.
Timers are a testing nightmare... If we want to make sure that our KillHard timer works we should have a dedicated test for it. We should just expect that our other mochitests are always going to succeed (or timeout and be marked as failures accordingly). Let's try just disabling the timer by default on mochitests.
Comment on attachment 8538648 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?

Well, fine, but those b2g tests failing is very worrisome.
Attachment #8538648 - Flags: review?(bugs) → review+
Attachment #8538648 - Attachment is obsolete: true
Attachment #8539711 - Attachment is obsolete: true
Comment on attachment 8539712 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?

Ok, so it looks as if we're not hitting the KillHard timeout for B2G Desktop, but that there are, I guess, some expected usages of KillHard that do not expect crash reports to come out of them.

For now, I've hidden the paired minidump stuff behind a MOZ_B2G ifdef, and maybe we can investigate this later. Perhaps KillHard can be passed an "aExpected" bool, or there can be two new functions: ExpectedKillHard, UnexpectedKillHard. Something along those lines, anyhow.

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5a5bb2be42cc
Attachment #8539712 - Flags: review?(bugs)
Attachment #8539712 - Flags: review?(bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/20c494ff9887
https://hg.mozilla.org/mozilla-central/rev/93b79187eeaf
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
Blocks: 1110498
See Also: → 1499465
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: