Closed
Bug 1068349
Opened 10 years ago
Closed 10 years ago
IPC message handlers that return false cause child process to crash without a crash report
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
RESOLVED
FIXED
mozilla37
Tracking | Status | |
---|---|---|
e10s | m4+ | --- |
People
(Reporter: billm, Assigned: mconley)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 6 obsolete files)
4.59 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
5.02 KB,
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
1.69 MB,
application/zip
|
Details |
There are a number of places where ContentParent.cpp calls KillHard(). One of them is in ContentParent::ProcessingError. In this case, the parent has decided that a message sent by the child was malformed or something and we should kill the child. However, we don't submit a crash report in that case. For sync messages, a crash report would tell us why the child sent the message. For async messages, it would crash at a random spot, but at least we would know about the crash.
Comment 1•10 years ago
|
||
We basically want to call CreatePairedMinidumps in this situation:
http://hg.mozilla.org/mozilla-central/annotate/4c449dcda453/dom/ipc/CrashReporterParent.h#l112
Not sure if we need other plumbing around that to make it get submitted or if the "child died, look for minidump" logic handles that.
Assignee | ||
Comment 3•10 years ago
|
||
Hey jimm, bug 1099274 (which I duped to this), was an M4. Should I go ahead and carry on the work in this bug as an M4, or do you think we should re-triage? My instinct says the former, but I thought I'd run it by you.
Flags: needinfo?(jmathies)
![]() |
||
Comment 4•10 years ago
|
||
If this gets ugly for any reason tag it for re-triage. For now though, lets carry on the work in bug 1099274 here.
Assignee | ||
Comment 5•10 years ago
|
||
Sounds good.
Assignee | ||
Comment 6•10 years ago
|
||
Assignee | ||
Comment 7•10 years ago
|
||
Comment on attachment 8527898 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?
So, it seems like it was this easy. This will cause a minidump to be generated.
Is there any annotating we need to do here to make this report useful for this particular KillHard case?
Attachment #8527898 -
Flags: feedback?(ted)
![]() |
||
Comment 8•10 years ago
|
||
Comment on attachment 8527898 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?
Review of attachment 8527898 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/ContentParent.cpp
@@ +3049,5 @@
> + if (ManagedPCrashReporterParent().Length() > 0) {
> + CrashReporterParent* crashReporter =
> + static_cast<CrashReporterParent*>(ManagedPCrashReporterParent()[0]);
> +
> + crashReporter->GeneratePairedMinidump(this);
does this submit a report? some commenting here would be appreciated.
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #8)
> Comment on attachment 8527898 [details] [diff] [review]
> [WIP]: IPC message handlers that return false cause child process to crash
> without a crash report. r=?
>
> Review of attachment 8527898 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: dom/ipc/ContentParent.cpp
> @@ +3049,5 @@
> > + if (ManagedPCrashReporterParent().Length() > 0) {
> > + CrashReporterParent* crashReporter =
> > + static_cast<CrashReporterParent*>(ManagedPCrashReporterParent()[0]);
> > +
> > + crashReporter->GeneratePairedMinidump(this);
>
> does this submit a report? some commenting here would be appreciated.
It doesn't submit it (at this point, we still haven't shown about:tabcrashed to the user, so they haven't yet been given the opportunity to opt-out of sending the report).
My understanding is that this call causes Gecko to generate a minidump for the state of the child process before we kill it. We do something very similar here in PluginModuleChromeParent::ShouldContinueFromReplyTimeout[1]. As a paired minidump, it's possible it's generating a minidump for the parent state as well - perhaps ted can fill us in more there.
Later, when the ContentChild actor dies, ContentParent::ActorDestroy is called, which generates the actual report from the minidumps, and passes the information about the report through the ipc:content-shutdown observer notification, which then shows the UI for submitting the crash report.
So that's kind of how it's all laid out - it looks like all we needed to do was generate that minidump before killing the child process.
I'll update my patch with some documentation about this.
[1]: http://hg.mozilla.org/mozilla-central/file/8c02f3280d0c/dom/plugins/ipc/PluginModuleParent.cpp#l592
Assignee | ||
Comment 10•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8527898 -
Attachment is obsolete: true
Attachment #8527898 -
Flags: feedback?(ted)
Assignee | ||
Comment 11•10 years ago
|
||
Comment on attachment 8527926 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?
Actually, might as well go full-on review request here.
Attachment #8527926 -
Flags: review?(ted)
Comment 12•10 years ago
|
||
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #9)
> My understanding is that this call causes Gecko to generate a minidump for
> the state of the child process before we kill it. We do something very
> similar here in PluginModuleChromeParent::ShouldContinueFromReplyTimeout[1].
> As a paired minidump, it's possible it's generating a minidump for the
> parent state as well - perhaps ted can fill us in more there.
This is correct--we write a pair of minidumps, one for the child and one for the parent. If the user submits a crash report we'll submit both.
> Later, when the ContentChild actor dies, ContentParent::ActorDestroy is
> called, which generates the actual report from the minidumps, and passes the
> information about the report through the ipc:content-shutdown observer
> notification, which then shows the UI for submitting the crash report.
This is also correct. We simply store the dumps in a hashtable keyed by the PID of the child process, and retrieve them when it terminates, and then send out a notification to allow the browser chrome to display the crash UI and submit the report.
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Comment 13•10 years ago
|
||
Comment on attachment 8527926 [details] [diff] [review]
[WIP]: IPC message handlers that return false cause child process to crash without a crash report. r=?
Review of attachment 8527926 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/ContentParent.cpp
@@ +3063,4 @@
> // This ensures the process is eventually killed, but doesn't
> // immediately KILLITWITHFIRE because we want to get a minidump if
> // possible. After a timeout though, the process is forceably
> // killed.
You should probably remove this comment, it doesn't actually seem to be true and I'm not sure if it ever was.
Attachment #8527926 -
Flags: review?(ted) → review+
Assignee | ||
Comment 14•10 years ago
|
||
Argh, so after some testing, this isn't good enough. Sorry to make you review that, ted.
When I say not good enough, I mean that while we're displaying the checkbox to submit a crash report, we're not actually annotating the crash report with the extra minidump from the content process that we've killed - so if the child started spewing garbage at us and we killed it, this crash report is of minimal use.
What we need to do is something akin to what PluginModuleParent does in TerminateChildProcess, where we add an entry to the additional_minidumps list, and annotate it with the minidump generated from the content process.
Another (separate) problem is that we don't get the additional minidumps displayed in the UI of crash stats. We can view them in the unprocessed JSON blob of the crash report, but that's not great. bsmedberg said we have bugs about that somewhere.
bsmedberg - I couldn't find those bugs despite looking for (what I thought were) logical search terms under Socorro:Webapp, but no dice. Can you provide the bug numbers for those crash stats bits I mentioned?
In the meantime, I'll get to start on a proper patch with actual minidumps for the other process.
Assignee | ||
Comment 15•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8527926 -
Attachment is obsolete: true
Assignee | ||
Comment 16•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8535881 -
Attachment is obsolete: true
Assignee | ||
Comment 17•10 years ago
|
||
Comment on attachment 8535883 [details] [diff] [review]
IPC message handlers that return false cause child process to crash without a crash report. r=?
This works better - example of a generated crash report:
https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=e7a59f07-33a3-4048-b241-aca082141212
Attachment #8535883 -
Flags: review?(ted)
Comment 18•10 years ago
|
||
Comment on attachment 8535883 [details] [diff] [review]
IPC message handlers that return false cause child process to crash without a crash report. r=?
Review of attachment 8535883 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/ContentParent.cpp
@@ +3078,5 @@
> + // it exists and is being appended.
> + nsAutoCString additionalDumps("browser");
> + crashReporter->AnnotateCrashReport(
> + NS_LITERAL_CSTRING("additional_minidumps"),
> + additionalDumps);
It would be nice if we got this for free with GeneratePairedMinidump, but it's probably not called enough places to be worthwhile.
Attachment #8535883 -
Flags: review?(ted) → review+
Assignee | ||
Updated•10 years ago
|
OS: Linux → All
Assignee | ||
Comment 19•10 years ago
|
||
Thanks for the review!
remote: https://hg.mozilla.org/integration/fx-team/rev/779342bb7001
Comment 20•10 years ago
|
||
Backed out for mochitest-e10s orange.
https://hg.mozilla.org/integration/fx-team/rev/74d4cd870867
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1454635&repo=fx-team
And B2G Desktop:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1454367&repo=fx-team
Comment 22•10 years ago
|
||
It looks like the browser-chrome tests under browser/components/customizableui/test launch a lot of child processes, so maybe one of them is hitting this KillHard path you've added, and it now causes a failure because the minidumps get written?
I do see some spew like this in the log:
11:13:47 INFO - [Parent 2384] WARNING: pipe error (49): Connection reset by peer: file /builds/slave/fx-team-lx-0000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 457
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
The B2G desktop mochitests seem to be hitting something similar. It's possible this is something that has always happened and your patch turns it into a test failure.
Flags: needinfo?(ted)
Assignee | ||
Comment 23•10 years ago
|
||
Assignee | ||
Comment 24•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8538644 -
Attachment is obsolete: true
Assignee | ||
Comment 25•10 years ago
|
||
Comment on attachment 8538648 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?
So the problem seems to be that some tests rapidly create and destroy single remote tabs, which result in us rapidly creating and destroying content processes.
dom.ipc.tabs.shutdownTimeoutSecs is used by the parent process as a fail-safe - if we just saw our last remote tab go, tell the content process to destroy itself; but if dom.ipc.tabs.shutdownTimeoutSecs goes by without us hearing back, forcibly kill the child.
This timeout is 5 seconds by default in opt builds. On our test machines, this is probably not enough time for the content process to boot up, init XPCOM, get its event loop spinning, and send back a destroy message back to the parent, which is why the parent will periodically force-kill the content process resulting in the minidumps that caused my patch to bounce.
bent suggested that we just disable the timeout on mochitests.
What do you think of that solution, smaug?
Attachment #8538648 -
Flags: review?(bugs)
Comment 26•10 years ago
|
||
Would it make sense to just use higher value?
Assignee | ||
Comment 27•10 years ago
|
||
Doesn't smell good - bumping timeouts to some magic number feels a bit like we're just kicking the can down the road. :/ Is there any value in having KillHard behaviour in mochitests?
Assignee | ||
Comment 28•10 years ago
|
||
Comment 29•10 years ago
|
||
Well, it being something else than 0 would be closer to what we have in normal builds.
Assignee | ||
Comment 30•10 years ago
|
||
I guess? I don't know - I guess I just don't see the value. Then again, I just want to get my patch landed, so I'm not too fussed about the final decision on this bit. :)
I'll see if 10 seconds is sufficient.
Assignee | ||
Comment 31•10 years ago
|
||
Try push with 10 second delay: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2613ff7d534c
Comment 32•10 years ago
|
||
Hmm, some failures in b2g. Why exactly? Do we need even longer timeout?
It is a bit worrisome to see those failures.
Timers are a testing nightmare... If we want to make sure that our KillHard timer works we should have a dedicated test for it. We should just expect that our other mochitests are always going to succeed (or timeout and be marked as failures accordingly). Let's try just disabling the timer by default on mochitests.
Comment 34•10 years ago
|
||
Comment on attachment 8538648 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?
Well, fine, but those b2g tests failing is very worrisome.
Attachment #8538648 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 35•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8538648 -
Attachment is obsolete: true
Assignee | ||
Comment 36•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8539711 -
Attachment is obsolete: true
Assignee | ||
Comment 37•10 years ago
|
||
Comment on attachment 8539712 [details] [diff] [review]
Follow-up: Disable KillHard timer for content processes for Mochitests. r=?
Ok, so it looks as if we're not hitting the KillHard timeout for B2G Desktop, but that there are, I guess, some expected usages of KillHard that do not expect crash reports to come out of them.
For now, I've hidden the paired minidump stuff behind a MOZ_B2G ifdef, and maybe we can investigate this later. Perhaps KillHard can be passed an "aExpected" bool, or there can be two new functions: ExpectedKillHard, UnexpectedKillHard. Something along those lines, anyhow.
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5a5bb2be42cc
Attachment #8539712 -
Flags: review?(bugs)
Updated•10 years ago
|
Attachment #8539712 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 38•10 years ago
|
||
Thanks! Landed:
remote: https://hg.mozilla.org/integration/fx-team/rev/20c494ff9887
remote: https://hg.mozilla.org/integration/fx-team/rev/93b79187eeaf
Comment 39•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/20c494ff9887
https://hg.mozilla.org/mozilla-central/rev/93b79187eeaf
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
Assignee | ||
Comment 40•10 years ago
|
||
bugnotes |
You need to log in
before you can comment on or make changes to this bug.
Description
•