CookieServiceParent::RecvGetCookieString occasionally fails

VERIFIED FIXED

Status

()

VERIFIED FIXED
4 years ago
3 years ago

People

(Reporter: jimm, Assigned: billm)

Tracking

Trunk
x86_64
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10sm5+)

Details

Attachments

(2 attachments)

(Reporter)

Description

4 years ago
mozilla::net::PCookieServiceChild::SendGetCookieString occasionally fails, causing an abort of the child process.

https://dxr.mozilla.org/mozilla-central/source/netwerk/cookie/CookieServiceParent.cpp#74
(Reporter)

Comment 1

4 years ago
This might be tear down related.
Depends on: 1005181
(In reply to Jim Mathies [:jimm] from comment #1)
> This might be tear down related.

It sounds like it's just some race where GetAppInfoFromParams is returning false: 
https://dxr.mozilla.org/mozilla-central/source/netwerk/cookie/CookieServiceParent.cpp#95
(Reporter)

Updated

4 years ago
tracking-e10s: --- → ?
(Reporter)

Updated

4 years ago
Assignee: nobody → mrbkap
(Reporter)

Comment 3

4 years ago
This is predominantly caused by these MsgNotAllowed errors:

(msgtype=0x340001,name=???) Message not allowed: cannot be sent/recvd in this state
Assignee: mrbkap → wmccloskey
tracking-e10s: ? → m5+
(Reporter)

Comment 5

4 years ago
noticed this in today's data, for reasons:

'(msgtype=0x340001,name=???) Message not allowed: cannot be sent/recvd in this state'
'HangMonitor'
"HangMonitor" is the reason we give when the user kills a content process using the slow script/plugin hang UI.
Blocks: 1123439
Assignee: wmccloskey → mrbkap
(Reporter)

Updated

4 years ago
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result, char const*)]
(Reporter)

Comment 7

4 years ago
oops, wrong bug.
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result, char const*)]
Hey Jim,

Is it possible to figure out if there are particular extensions that cause this or generally to get a list of crash reports that are caused by this indirect signature?
Flags: needinfo?(jmathies)
(Reporter)

Comment 9

4 years ago
(In reply to Blake Kaplan (:mrbkap) (please use needinfo!) from comment #8)
> Hey Jim,
> 
> Is it possible to figure out if there are particular extensions that cause
> this or generally to get a list of crash reports that are caused by this
> indirect signature?

Use supersearch and key in on a part of the ipc error message you're interested in, like '0x340001' - 

https://crash-stats.mozilla.com/search/?product=Firefox&version=39.0a1&ipc_channel_error=~0x340001&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Note this hex message value I'm using can change over time: 0x34 is a message class, 0x01 is the message id. Message classes are defined in obj/ipc/ipdl/_ipdlheaders/IPCMessageStart.h, message ids are defined in the main auto-generated protocol header (like PContent.h and PPluginWidget.h).
Flags: needinfo?(jmathies)
(Reporter)

Updated

4 years ago
Blocks: 1142109
I can reproduce this now. I'm in the middle of debugging it.

The basic STR is to install LastPass and visit a page where JS hangs (which is pretty easy because of bug 1113369). Then... do some stuff after it hangs. I see crashes from SendGetInputContext and SendGetCookieString, both urgent messages.
Assignee: mrbkap → wmccloskey
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1142109
Re-opening because of https://bugzilla.mozilla.org/show_bug.cgi?id=1056018#c20.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Created attachment 8586350 [details] [diff] [review]
patch for possible problem

It's really hard to tell what's going on here, but I looked over the code again and found a potential problem. WaitForSyncNotify can return try (meaning a timeout) even if we received a message reply. Additionally, we can receive a reply during ShouldContinueFromTimeout since it releases the lock. So we should check for a reply before deciding to time out. Otherwise we'll never unset mTimedOutMessageSeqno.
Attachment #8586350 - Flags: review?(dvander)
Comment on attachment 8586350 [details] [diff] [review]
patch for possible problem

Review of attachment 8586350 [details] [diff] [review]:
-----------------------------------------------------------------

Good catch.
Attachment #8586350 - Flags: review?(dvander) → review+
(Reporter)

Comment 17

4 years ago
fixed:
http://www.mathies.com/mozilla/client-abort-report.txt
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
(Reporter)

Comment 18

4 years ago
Oops, sorry, I was looking at the wrong list.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It looks like this fixed the MsgNotAllowed errors for non-urgent messages. It didn't fix the problem for urgent messages like GetCookieString since won't process incoming CPOWs in ProcessPendingMessages. So I think we need a separate solution for that.
Created attachment 8589358 [details] [diff] [review]
patch for urgent messages

This patch complements the patch in bug 1142109. They both address the problem where the parent is in the timed out state (i.e., it sent a CPOW that timed out and it hasn't gotten a reply for that CPOW yet). If we get a message in the timed out state, we need to decide what to do.

Originally, we always returned an error for the message. That was to ensure that we didn't start processing the incoming message at the same time the child started processing the timed out CPOW. In general, both sides processing messages at the same time is bad.

The patch in bug 1142109 ensured that we always process incoming messages before sending a message. That way, if the child sends something while a CPOW is in its input queue, it will handle the CPOW before sending its message.

However, it won't handle the CPOW if it's sending an urgent messages, since we defer everything when waiting for an urgent response. However, there's no reason why the parent can't process the child's message in that case, since the child is guaranteed to not process anything, thus avoiding the problem where both sides dispatch at once.

So this patch causes the parent to process incoming urgent messages when it has a timed out CPOW that it hasn't received a response for.
Attachment #8589358 - Flags: review?(dvander)
Attachment #8589358 - Flags: review?(dvander) → review+
I think we can declare this fixed. There have been 9 crashes with "message not allowed" for the 2015-04-15 build so far. In a comparable time period, the 2015-04-14 build had 84 "message not allowed" crashes.

If the crash volume is still too high, we can make "message not allowed" be non-fatal for the child. I was avoiding doing that so we could monitor how often it happens. But as long as it's not too common, it's fine to make it non-fatal.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
(Reporter)

Comment 24

4 years ago
One hit in the last three days, definitely fixed!
Status: RESOLVED → VERIFIED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.