Closed Bug 1128454 Opened 5 years ago Closed 4 years ago

KillHard aborts in mozilla::dom::PContentChild::SendLoadPlugin

Categories

(Core :: Plug-ins, defect)

x86_64
All
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
e10s m8+ ---
firefox40 --- affected
firefox41 --- fixed
firefox42 --- fixed

People

(Reporter: jimm, Assigned: jimm)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 12 obsolete files)

1.46 KB, patch
billm
: review+
Details | Diff | Splinter Review
ContentParent::RecvLoadPlugin fails a lot. bug 1127374 wallpapered this over by removing KillHard return results as a result of failures in plugin loading. Why plugin loading is so failure prone is still an open question though.

+++ This bug was initially created as a clone of Bug #1127374 +++

#1 top KillHard cause (see parent bug)

mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const &)

http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#978
Attached patch diagnostic patch (obsolete) — Splinter Review
Attached patch diagnostic patch (obsolete) — Splinter Review
Assignee: nobody → jmathies
Attachment #8557888 - Attachment is obsolete: true
Attached patch diagnostic patch (obsolete) — Splinter Review
Follow up after bug 1127374 to try and diagnose why this call fails so often. I plan to back this out after a day or two.
Attachment #8557918 - Attachment is obsolete: true
Attachment #8558095 - Flags: review?(wmccloskey)
Comment on attachment 8558095 [details] [diff] [review]
diagnostic patch

Rather than land some temporary hack that causes an uptick in crashes, I think I'll do this through telemetry.
Attachment #8558095 - Flags: review?(wmccloskey)
Attached patch patch (obsolete) — Splinter Review
Attachment #8558095 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
Attachment #8558532 - Attachment is obsolete: true
Comment on attachment 8558533 [details] [diff] [review]
patch

I'm thinking we'll only need this for a week or so, then we can back this out.
Attachment #8558533 - Flags: review?(wmccloskey)
Comment on attachment 8558533 [details] [diff] [review]
patch

Review of attachment 8558533 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +7343,5 @@
> +  "PLUGIN_LOAD_FAILURE_REASON": {
> +    "expires_in_version": "42",
> +    "kind": "linear",
> +    "high": 4294967295,
> +    "n_buckets": 10,

I haven't looked this over thoroughly, but are you sure this will work? I think this will divide the range between 0 and 2**32 into 10 buckets. If so, then your three error codes will almost certainly all end up in the same bucket. I think you might need to "decode" the error number into an enum before giving it to telemetry. Then you can use something like this:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json#220
Comment on attachment 8558533 [details] [diff] [review]
patch

I spoke with vladan in #telemetry, I need to sort possible return values into a single linear scale and submit that. :/ I guess we don't have a telemetry type that's compatible with nsresult values.
Attachment #8558533 - Flags: review?(wmccloskey)
ipc errors are all the same for this signature:

 '(msgtype=0x300040,name=PContent::Msg_LoadPlugin) Processing error: message was deserialized, but the handler returned false (indicating failure)'
Assignee: jmathies → nobody
Assignee: nobody → jmathies
Assignee: jmathies → nobody
Combine the two signatures places this crash at the top of the list - 

Bug 1116884 KillHard child signature breakdown:
----------------------------------------------------
42	28.6%	mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const &,nsresult *,unsigned int *)
34	23.1%	mozilla::dom::PBrowserChild::SendCreateWindow(mozilla::dom::PBrowserChild *,unsigned int const &,bool const &,bool const &,bool const &,nsString const &,nsString const &,nsString const &,nsString const &,bool *,nsTArray<mozilla::dom::FrameScriptInfo> *,nsCString *)
13	8.8%	mozilla::dom::PBrowserChild::SendGetRenderFrameInfo(mozilla::layout::PRenderFrameChild *,mozilla::layers::TextureFactoryIdentifier *,unsigned __int64 *)
13	8.8%	mozilla::dom::PContentChild::SendRpcMessage(nsString const &,mozilla::dom::ClonedMessageData const &,nsTArray<mozilla::jsipc::CpowEntry> const &,IPC::Principal const &,nsTArray<mozilla::OwningSerializedStructuredCloneBuffer> *)
12	8.2%	mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const&, nsresult*, unsigned int*)
10	6.8%	mozilla::dom::PBrowserChild::SendCreateWindow(mozilla::dom::PBrowserChild*, unsigned int const&, bool const&, bool const&, bool const&, nsString const&, nsString const&, nsString const&, nsString const&, bool*, nsTArray<mozilla::dom::FrameScriptInfo>*, nsCString*)
Summary: Investigate why ContentParent::RecvLoadPlugin has a high failure rate → KillHard aborts in mozilla::dom::PContentChild::SendLoadPlugin
Assignee: nobody → jmathies
Attached patch patch v.1 (obsolete) — Splinter Review
Attachment #8558533 - Attachment is obsolete: true
Comment on attachment 8631264 [details] [diff] [review]
patch v.1

To diagnose this lets get some data on where the failure occurs. This is the bridging code in plugin module:

https://dxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp?from=mozilla::plugins::SetupBridge&case=true#104

No aborts except for the Bridge call. That code didn't return good error results, so I've added that and annotated the report with the result.
Attachment #8631264 - Flags: review?(wmccloskey)
I was just thinking of a slight alternative to this: rather than keep the bool return results, just return nsresults instead. I'll work that up tomorrow.
Attached patch patch v.1 (obsolete) — Splinter Review
Attachment #8631264 - Attachment is obsolete: true
Attachment #8631264 - Flags: review?(wmccloskey)
Attached patch patch v.1 (obsolete) — Splinter Review
Attachment #8631560 - Attachment is obsolete: true
Attachment #8631561 - Flags: review?(wmccloskey)
Comment on attachment 8631561 [details] [diff] [review]
patch v.1

Review of attachment 8631561 [details] [diff] [review]:
-----------------------------------------------------------------

Seems good. Can you make sure to test it to make sure the minidump gets annotated properly?

::: ipc/glue/ProtocolUtils.cpp
@@ +185,5 @@
>         MessageChannel* aChildChannel, ProcessId aChildPid,
>         ProtocolId aProtocol, ProtocolId aChildProtocol)
>  {
> +  NS_ENSURE_ARG(aParentPid);
> +  NS_ENSURE_ARG(aChildPid);

I think we're moving away from NS_ENSURE_*. Let's just change |return false| to |return NS_ERROR_INVALID_ARG|.

::: ipc/glue/Transport_posix.cpp
@@ +45,5 @@
>    }
>  
>    aOne->mFd = base::FileDescriptor(fd1, true/*close after sending*/);
>    aTwo->mFd = base::FileDescriptor(fd2, true/*close after sending*/);
>    return true;

NS_OK
Attachment #8631561 - Flags: review?(wmccloskey) → review+
(In reply to Bill McCloskey (:billm) from comment #17)
> Comment on attachment 8631561 [details] [diff] [review]
> patch v.1
> 
> Review of attachment 8631561 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Seems good. Can you make sure to test it to make sure the minidump gets
> annotated properly?

Yes it does:

https://crash-stats.mozilla.com/report/index/93cf73a9-63a2-4906-8712-dd0f82150708

search for BridgePluginError in meta.


> ::: ipc/glue/ProtocolUtils.cpp
> @@ +185,5 @@
> >         MessageChannel* aChildChannel, ProcessId aChildPid,
> >         ProtocolId aProtocol, ProtocolId aChildProtocol)
> >  {
> > +  NS_ENSURE_ARG(aParentPid);
> > +  NS_ENSURE_ARG(aChildPid);
> 
> I think we're moving away from NS_ENSURE_*. Let's just change |return false|
> to |return NS_ERROR_INVALID_ARG|.

I've heard this although I don't understand why, they are convenient macros. Generally I've been ignoring this unless someone flags me, I'll update this patch in that case.

> 
> ::: ipc/glue/Transport_posix.cpp
> @@ +45,5 @@
> >    }
> >  
> >    aOne->mFd = base::FileDescriptor(fd1, true/*close after sending*/);
> >    aTwo->mFd = base::FileDescriptor(fd2, true/*close after sending*/);
> >    return true;
> 
> NS_OK

oops, thanks.
> I've heard this although I don't understand why, they are convenient macros. Generally I've
> been ignoring this unless someone flags me, I'll update this patch in that case.

I think the rationale is that NS_ENSURE_* make it harder to see all the places where a function exits early.
Attached patch patch v.1 (obsolete) — Splinter Review
Attachment #8631561 - Attachment is obsolete: true
Attachment #8631894 - Flags: review+
Comment on attachment 8631894 [details] [diff] [review]
patch v.1

Approval Request Comment
[Feature/regressing bug #]:
new e10s related crash report annotations
[User impact if declined]:
devs need this to address crash rates under e10s
[Describe test coverage new/current, TreeHerder]:
on mc, no issues found.
[Risks and why]: 
low, this is for debugging e10s issues.
[String/UUID change made/needed]:
no
Attachment #8631894 - Flags: approval-mozilla-aurora?
Updating status flags to affected for Aurora.
Comment on attachment 8631894 [details] [diff] [review]
patch v.1

Has been on mc, no issues found.
Attachment #8631894 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Every failure reported over the weekend was NS_ERROR_BRIDGE_OPEN_CHILD, which is the connection opened to the chrome module parent. I'm guessing that probably has some child process initialization behind it that fails.
the two places we can fail here, both in Send -

https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#534

Both indicate the connection previous set up in the call to GetPluginForContentProcess died for some reason before the call to Bridge. There are a lot of failure points in the child process and module init code so this doesn't surprise me. Unfortunately I haven't had any luck reproducing locally so I'm not sure yet how to deal with this. I think the best bet now will be to allow this failure to happen without the abort, and work on investigating failure issues in a follow up.
Attached patch ipdl test fix (obsolete) — Splinter Review
This fixes up an IPDL test that we don't normally compile.
Attachment #8633696 - Flags: review?(jmathies)
Comment on attachment 8633696 [details] [diff] [review]
ipdl test fix

woops, thanks!
Attachment #8633696 - Flags: review?(jmathies) → review+
Attached patch log the channel state (obsolete) — Splinter Review
I've been trying to reproduce this locally by inserting failures over in the child init code but so far no luck reproducing. Lets add channel state logging to see what the state of our ipc channel is. I'm guessing it'll have an error state set, lets confirm that with some data.
Attachment #8634262 - Attachment is obsolete: true
Attachment #8634267 - Flags: review?(wmccloskey)
Comment on attachment 8634267 [details] [diff] [review]
log the channel state

Review of attachment 8634267 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.h
@@ +128,5 @@
>      bool WaitForIncomingMessage();
>  
>      bool CanSend() const;
>  
> +    ChannelState GetChannelState() const {

Please stick a ridiculous prefix on this, like GetChannelState__TotallyRacy or something. And comment that this is unsafe because it doesn't acquire the monitor.
Attachment #8634267 - Flags: review?(wmccloskey) → review+
All the reports indicate the connection state is closed. I might be able to extract the channel closing reason from the plugin channel and report that with these abort. I think I'll make that a follow up so we can get this abort cleared out of aurora.
Comment on attachment 8636038 [details] [diff] [review]
avoid aborts patch

Review of attachment 8636038 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like get this uplifted to aurora, then file a follow up for trying to figure out what the plugin child channel dies. My guess is it'll be a mixture of reasons, all of which are probably expected via code decisions. In the mean time lets get this abort out of the code base for now. We can bring this back on nightly to do additional debugging / reporting.
Attachment #8636038 - Flags: review?(wmccloskey)
Comment on attachment 8636038 [details] [diff] [review]
avoid aborts patch

Review of attachment 8636038 [details] [diff] [review]:
-----------------------------------------------------------------

This is pretty weird, but the only way I can see to follow up on this is to try to test plugin crashes locally. There's no reason that should block this.
Attachment #8636038 - Flags: review?(wmccloskey) → review+
Attachment #8631894 - Attachment is obsolete: true
Attachment #8633696 - Attachment is obsolete: true
Attachment #8634267 - Attachment is obsolete: true
(In reply to Bill McCloskey (:billm) from comment #41)
> Comment on attachment 8636038 [details] [diff] [review]
> avoid aborts patch
> 
> Review of attachment 8636038 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is pretty weird, but the only way I can see to follow up on this is to
> try to test plugin crashes locally. There's no reason that should block this.

We can also bring the abort back on nightly to debug the channel closing.
https://hg.mozilla.org/mozilla-central/rev/d82fc1a3328b
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Comment on attachment 8636038 [details] [diff] [review]
avoid aborts patch

Approval Request Comment
[Feature/regressing bug #]:
No bug. e10s related api handling tweak to turn off content process aborts when we experience errors in plugin loading.
[User impact if declined]:
crashy aurora
[Describe test coverage new/current, TreeHerder]:
just landed on mc, we can uplift next week once we confirm the fix is doing what we expect.
[Risks and why]: 
none
[String/UUID change made/needed]:
none
Attachment #8636038 - Flags: approval-mozilla-aurora?
Confirmed fixed - stops on the 22nd.

http://www.mathies.com/mozilla/killhard.html
hey blassey, can you hook me up with aurora approval on this?
Flags: needinfo?(blassey.bugs)
Comment on attachment 8636038 [details] [diff] [review]
avoid aborts patch

Aurora+ now that the fix has been confirmed.
Attachment #8636038 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Flags: needinfo?(blassey.bugs)
You need to log in before you can comment on or make changes to this bug.