Closed Bug 1154036 (e10s-socorro) Opened 9 years ago Closed 8 years ago

Socorro needs to be able to analyze and display information on 'meta' e10s crash signatures

Categories

(Socorro :: General, task)

x86_64
All
task
Not set
normal

Tracking

(firefox40 affected)

RESOLVED DUPLICATE of bug 1269817
Tracking Status
firefox40 --- affected

People

(Reporter: jimm, Unassigned)

References

Details

(Keywords: meta)

Attachments

(1 file)

Attached file topcrash.py
With e10s there are a few meta crash signatures that display rather egnerically in the top crash tables but hide important information in the crashing threads stack.

We currently process these by hand using a pythons script I wrote that digs into these signatures and pull the important data out.

The following meta bugs represent these signatures:

bug 1116884 - KillHard (windows)
bug 1124064 - KillHard (linux)
bug 1118517 - KillHard (mac)
bug 1130734 - child protocol error aborts
bug 1092216 - cpow send with ipc on the stack aborts

My python script currently generates a report which we use to file bugs:

http://www.mathies.com/mozilla/client-abort-report.txt

I'll post this script as well.
Could you lay out a signature generation mechanism for those, as well as the conditions to apply that mechanism?
KillHard:

1) load reports using super search

win1   = "WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait(unsigned int)"
win2   = "WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait(unsigned int)"
mac1   = "libsystem_kernel.dylib"
linux1 = "libpthread-"

There are a lot of random signatures these show up under, the sigs listed above are the most common so they have the best data under them.

2) walk the crashing thread stack looking for:

"mozilla::ipc::MessageChannel::Send(IPC::Message *,IPC::Message *)" or
"mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*)"

3) once found, *keep* the next frame you find in a table of crashing frames.

This is the "Bug 1116884 KillHard child signature breakdown".

4) fetch RawCrash meta data report for the same signature, get 'ipc_channel_error'. Store this in a sub table under your crashing frame record. You will have multiple ipc_channel_error reasons, so you'll have to sort this sub table, and make sure it has unique entries.

This is "IPC error breakdown per KillHard signature".
Bug 1130734 Child protocol error abort:

1) load reports using supersearch:

win1 = "mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result, char const*)"
mac = ?
linux = ?

2) load RawCrash meta data for each report

3) retrieve 'ipc_channel_error' meta data for each crash, store this in a table sorted based on occurrence.

Currently we don't have any of these in my report since I just fixed the last one last week. But these can show up at any time, so we need to keep an eye out for them.
Bug 1092216 Child abort on send signature breakdown:

These are CPOW aborts where in the content process we have an ipc incall on the stack, and this triggers a CPOW to the parent for some reason. In certain cases this type of operation is not allowed. To fix we need to rejigger the priorities of these message in the ipc protocol definitions.

1) load reports using super search

windows = "mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) | mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*)"
mac = ?
linux = ?

2) load "ProcessedCrash" and "RawCrash" reports

3) walk the crashing thread stack looking for:

["mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*)",
 "mozilla::ipc::MessageChannel::Send(IPC::Message *,IPC::Message *)",
 "mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*)"]

4) once found, grab the next frame. If the next frame is:

["mozilla::dom::PBrowserChild::SendSyncMessage(nsString const &,mozilla::dom::ClonedMessageData const &,nsTArray<mozilla::jsipc::CpowEntry> const &,IPC::Principal const &,nsTArray<nsString> *)",
"nsFrameMessageManager::SendMessage(nsAString_internal const &,JS::Handle<JS::Value>,JS::Handle<JS::Value>,nsIPrincipal *,JSContext *,unsigned char,JS::MutableHandle<JS::Value>,bool)",
"nsFrameMessageManager::SendSyncMessage(nsAString_internal const &,JS::Handle<JS::Value>,JS::Handle<JS::Value>,nsIPrincipal *,JSContext *,unsigned char,JS::MutableHandle<JS::Value>)",
"NS_InvokeByIndex",
"Interpret",
"js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct)"]

skip it and grab the next frame. repeat.

3) once a good frame is found keep it in a table of crashing frames sorted based on occurrence.
Note when I say: "load reports using super search", this implies loading the base list of crashes, then loading the ProcessedCrash report usually. This is where I get the stack data to walk.
I was looking more for instructions on how the processor (i.e. the thing that handles the incoming crashes and creates a signature among other things) could create a better signature for those crashes ( so the super search step is superfluous and is actually converted in some form into the condition when to apply this other signature generation algorithm).

That said, can we add some annotation when creating those crashes which we can filter for?
See Also: → 1162703
Lets make this meta and file individual bugs on things we can do here.
Alias: e10s-socorro
Depends on: 1162703
Keywords: meta
Summary: Crash stats needs to be able to analyze and display information on 'meta' e10s crash signatures → Socorro needs to be able to analyze and display information on 'meta' e10s crash signatures
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6)
> I was looking more for instructions on how the processor (i.e. the thing
> that handles the incoming crashes and creates a signature among other
> things) could create a better signature for those crashes ( so the super
> search step is superfluous and is actually converted in some form into the
> condition when to apply this other signature generation algorithm).

We can file followups here about simple improvements, but some of this analysis can't be done with tools like skip lists.

> That said, can we add some annotation when creating those crashes which we
> can filter for?

Yes, in a few cases you can filter on 'ipc_channel_error', although that doesn't tell you much about what kind of stack you're dealing with. You could combine this with the signature to group.
Sounds to me like as a first step, we should add those frames to the prefix skiplist: "mozilla::ipc::MessageChannel::Send", "mozilla::dom::PBrowserChild::SendSyncMessage", "nsFrameMessageManager::SendMessage",
"nsFrameMessageManager::SendSyncMessage", "NS_InvokeByIndex", "Interpret", "js::Invoke" (or make sure they are on it) so they the next frame is added to the signature if any of those is encountered.

jimm, does that sound right?

If you have an actual algorithm for a better signature generation (also, with the signatures that should result from it), we can look into that as well but it will take more time to implement.

[As a side note, the mozilla::ipc::MessageChannel::Send signature is the majority of crashes on Dev Edition right now, and we're unable to see if it's know cases that are being worked on or not. In the current state of the signatures we get, I see e10s unfit to advance to any other channel (and it's actually disturbing analysis on dev edition already).]
Flags: needinfo?(jmathies)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #9)
> Sounds to me like as a first step, we should add those frames to the prefix
> skiplist: "mozilla::ipc::MessageChannel::Send",
> "mozilla::dom::PBrowserChild::SendSyncMessage",
> "nsFrameMessageManager::SendMessage",
> "nsFrameMessageManager::SendSyncMessage", "NS_InvokeByIndex", "Interpret",
> "js::Invoke" (or make sure they are on it) so they the next frame is added
> to the signature if any of those is encountered.
> 
> jimm, does that sound right?
> 
sure, that should help.
Flags: needinfo?(jmathies)
for liz, the three types of e10s meta crashes:

comment 2 - killhard aborts, these crashes will stay around, we need to automate the process of analyzing them. For example, this is the current #2 top signature on aurora right now.
comment 3 - child aborts, these generally don't show up until someone lands something that triggers a bunch of aborts. Then we fix that one-off and and they go away until it happens again.
comment 4 - cpow aborts - resolved fix, we stopped crashing the child process for this a could months ago.
Flags: needinfo?(lhenry)
Thanks Jim, I just realized this has been lurking in needinfo for quite a while though I did read it at the time. 
Kairo is there anything we need to do here or are there new bugs covering this issue in the meantime? Do we have better e10s info now in crash reports?
Flags: needinfo?(lhenry) → needinfo?(kairo)
We have an ipc annotation that you can supersearch for to find killhard aborts. Then you can do sub searches to try and get numbers for individual messages to try and determine frequency of individual aborts. It's messy manual work. Eventually people will clamor for something automated. I think we should keep this bug around for that.
(In reply to Jim Mathies [:jimm] from comment #13)
> We have an ipc annotation that you can supersearch for to find killhard
> aborts. Then you can do sub searches to try and get numbers for individual
> messages to try and determine frequency of individual aborts. It's messy
> manual work. Eventually people will clamor for something automated. I think
> we should keep this bug around for that.

Are there good ways to use the annotations to adjust the signatures right when we process the crashes? Would be good to have an algorithm for that and then integrate it into our signature generation code in https://github.com/KaiRo-at/socorro/blob/master/socorro/processor/signature_utilities.py

Jim, do you have good ideas on how we should mark those crashes in their signatures?
Flags: needinfo?(kairo) → needinfo?(jmathies)
See comment 2. Do a survey of all the frames above and including the Send frame and add those to a skip 
list or append them to the signature. The key frame is below that MessageChannel::Send call.

example abort:

https://crash-stats.mozilla.com/report/index/15cad3c6-1eed-4e89-8c68-487422160126
Flags: needinfo?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #15)
> See comment 2. Do a survey of all the frames above and including the Send
> frame and add those to a skip 
> list or append them to the signature. The key frame is below that
> MessageChannel::Send call.

So mozilla::ipc::MessageChannel::WaitForSyncNotify and mozilla::ipc::MessageChannel::Send should always append the next frame to the signature? If so, let's file a Socorro bug on that and do it.
Flags: needinfo?(jmathies)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16)
> (In reply to Jim Mathies [:jimm] from comment #15)
> > See comment 2. Do a survey of all the frames above and including the Send
> > frame and add those to a skip 
> > list or append them to the signature. The key frame is below that
> > MessageChannel::Send call.
> 
> So mozilla::ipc::MessageChannel::WaitForSyncNotify and
> mozilla::ipc::MessageChannel::Send should always append the next frame to
> the signature? If so, let's file a Socorro bug on that and do it.

I'm sure there are more frames that can show up there, hence the suggestion of doing a survey of reports to get a complete list. :) In my python scripts I just searched for the Send calls.
Flags: needinfo?(jmathies)
Depends on: 1267306
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: