Closed Bug 512910 Opened 15 years ago Closed 7 years ago

Make it easier to analyze crashes that share a signature

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jruderman, Unassigned)

Details

(Whiteboard: [crashkill][crashkill-metrics])

We usually end up opening 10 random crashes just to get an idea of whether the crashes all look the same.  This is how we learn whether some stacks are more useful than others, or whether the crashes all seem to be due to the same bug.  It's a big waste of time for us humans, and humans aren't very good at it.  Efforts to get the right level of granularity into the initial signatures (e.g. bug 504498) help, but only gets us so far.

I have two ideas for how Socorro can make this a lot easier, but someone who's well-versed in information design would probably have better ideas.


A. When viewing a table of crashes sharing a signature, use most of the horizontal space for the top 5 stack frames.  Use text-overflow:ellipsis or overflow:hidden (like TinderboxPushlog) so there's no wrapping.  Compress the more boring data to make room.  If you sort by it, it's easy to skim and see which are identical.

The main column should look like:
     nsCycleCollectingAutoRefCnt::decr > nsGenericElement::Release > nsCOMPtr_base::assign_with_AddRef > nsCOMPtr<nsILoadGroup>::operator= > nsEventTargetChainItem::Destroy



Create space by:
* Not limiting the table width.
* OS: "Windows NT 6.0.6001 Service Pack 1" --> "Windows Vista SP1"
* Reason: "EXCEPTION_ACCESS_VIOLATION" --> "AV"
* Product & Version: combined, because sorting on version alone is pointless
* Comments: an extra row for crashes that have comments
* Time: hide it; it's less important than the build ID and mostly used as a poor substitute for a sentence at the top stating the crash's frequency.  Or, group by date and call out the date as a separate line.


B. Show a tree of stacks, perhaps as a separate tab from the list of crash reports.  It could look similar to how Activity Monitor displays hang samples:

6000 nsGlobalWindow::cycleCollection::UnmarkPurple
  5000 nsGenericElement::Release
    1500 nsRefPtr<nsIDOMEventListener>::~nsRefPtr<nsIDOMEventListener>
    1000 nsCOMPtr_base::assign_with_AddRef
      500 nsCOMPtr<nsILoadGroup>::operator=
        500 nsEventTargetChainItem::Destroy
    500 nsRefPtr<nsGenericElement>::~nsRefPtr<nsGenericElement>

I can infer information from a tree like this much more quickly than I can by looking at a bunch of individual reasonable-looking stacks.  It would be useful even if you only do it for 50 crash reports at a time, since that's better than the 10 I can handle on my own, but ideally all would be included.

Each node should be collapsible / expandable, and clicking any node should list crash reports that match the stack down to that point.
Whiteboard: [crashkill][crashkill-metrics]
something like I did in Bug 568487 attachment 

 https://bug568487.bugzilla.mozilla.org/attachment.cgi?id=447758 to breakdown nsIFrame::BuildDisplayListForChild 

or bug 568405 for js_CallTracer ( https://bug568405.bugzilla.mozilla.org/attachment.cgi?id=447684 )

might be useful as well simply listing out the top 6 or 7 frames of the stack horizionally, then allowing easy sorting and counting of like frames is a quick way to get at the most frequent stacks that are similar.
re: comment 0
> We usually end up opening 10 random crashes just to get an idea of whether the
> crashes all look the same.  This is how we learn whether some stacks are more
> useful than others, or whether the crashes all seem to be due to the same bug.

I think the reports that we have now like http://people.mozilla.org/crash_stacks/stack-summary-4.0b10pre.txt help to answer some of these questions.

The key is to get something like that into production and be able to run it over any sample of crashes beyond just the top 300.
Component: Socorro → General
Product: Webtools → Socorro
We have signature summaries which I think help with this. Is the tree-of-stacks thing something you still want?
I personally don't find the signature summaries helpful at all for determining the number of bugs or finding the most useful stack trace. But you should ask someone who spends a lot of time in crash stats (like kairo) or fixes crash bugs (like bsmedberg or dbaron) how much tree-of-stacks would help.
I haven't had much to do with crashes lately, but I never found the signature concept that great (though admittedly better than not having it).  It is a rather poor approximation to clustering, as I wrote in http://dbaron.org/log/20101111-crash-future
Socorro could definitely have a stronger Information Architecture, but in order to proceed in a useful manner we will need more specific bugs and a clearer set of ideas.

Crashes in data platform should enable experimentation with alternative signature generation algorithms. There's also a current project to extract and unify signature generation between crash reports and crash pings that may be interesting to observers of this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.