Open Bug 1280658 Opened 8 years ago Updated 3 years ago

Figure out a way to generate common crash signatures between separate systems

Categories

(Socorro :: Signature, task)

task
Not set
normal

Tracking

(Not tracked)

People

(Reporter: ted, Unassigned)

Details

I filed bug 828452, which is one way to do this, but after talking to decoder and lonnen I'm not sure.

We'd like to take crashes from our disparate systems (crash-stats, fuzzing, automated tests, etc) and be able to generate a common signature so that we can map crashes to the same bucket between them. This would help us recognize when a fuzzing bug and a crash in an automated test have the same root cause, for example, or when a crash in Mochitest and a crash our users report are the same thing.

This is a little vague right now, we floated a few ideas and we need to nail down what exactly we want.
This could be done as a library or as a service. Service seems like the better option right now -- it will give us the ability to quickly update all consumers in lockstep rather than waiting for upgrades from each consumer. It also centralizes all inputs to facilitate experimentation with new means of signature generation.

Right now, I favor factoring out the signature generation portion of the crash-stats ingestion pipeline and then enhancing that.
I got started working on a quick prototype for this. I'm going the service way, building a very simple REST API that will accept a stack trace (C or Java) and return a signature. Most of it is copy / pasting from Socorro's signature generation script.
Assignee: nobody → adrian
I think the first step here should be agreeing on a standard of how a crash signature actually looks like. We also have an existing crash signature standard that we use in fuzzing (see https://wiki.mozilla.org/Security/CrashSignatures ) and a library for processing crash input from various sources (Minidump, GDB, ASan, CDB, Apple Crash Reporter and some others) and to create and process signatures (see https://github.com/MozillaSecurity/FuzzManager/tree/master/FTB/Signatures )

Maybe this helps. No matter what kind of standard this should be, it should be available as a library as well because other systems should not depend on some service to generate signatures. But that shouldn't be a problem if the service uses the same library.
Here's my current prototype's code: https://github.com/adngdb/crash-signature-service

It is a simple extraction of the main part of Socorro's signature generation algorithm. It only takes a list of frames as input, and will generate a signature based on that. There's a bit of documentation in the repo's README.
Yea, this is not the kind of signature we had in mind when we started this discussion, and it is most likely insufficient for anything else than Socorro.
Forgot to add: The service could of course serve as a basis for generating signatures based on some other algorithm. But the initial generation part isn't the harder thing and once you have a library, you might not need a service at all (of course it saves you installing things locally though). The harder part to me is standardizing something that includes all features that all participants need and also implement clustering/optimization algorithms for these signatures on top.
I understand that this is not what you need for the Fuzzer, but this is a first step. Before we commit to making something big that works for everyone, we want to make sure that there is value in creating a way to tie crashes together across multiple systems. 

In order to do that, we are building a service that will be as simple as possible, and can work with all other systems. As far as I know, the smallest common denominator for crash data is the stack frames. So this service starts with just that: give it a list of frames and it will return a signature. 

We want to take some crashes from each system, run them through that system, collect the results and see if that produces interesting, coherent and useful information. We are worried that, each system being slightly different, we might end up with misleading, confusing or useless connections between their buckets. 

There are many ways in which we can improve this, make a library, etc. We'll work on that later on. :)
One of the outcomes of our work for the fuzzing was that simple stack based signatures do not work. Especially if you try to correlate different crash sources. That is why we came up with a more flexible signature system that allows e.g. partial stack matching etc. I think trying to extend Socorro's way of bucketing things to all other systems won't really help us.
Socorro signature generation is actually more involved than just using stack frames, FWIW. They're not mentioned in the documentation here:
https://github.com/mozilla/socorro/tree/master/socorro/siglists

...but we definitely have a bunch of rules that generate signatures based on non-stack info, like the one that generates "OOM | small" and "OOM | large" signatures. They all seem to be defined in this file:
https://github.com/mozilla/socorro/blob/ed67c74e4d831f9c5df77dc70348351ed27ec6b6/socorro/processor/signature_utilities.py#L538

I filed bug 1306643 on documenting those, FWIW.
I hope that we move away from crash signatures, and move to a more generic categorization/tagging system for crashes. 

I would like to define the following  

* Rule – A named, pure functional, expression used to identify a particular type of crash.
* Signature – A (rule, crash) pair 

##More about rules
A rule must be machine readable, and reasonably humane to the eyes.  Generic candidates already exist:  For example the SQL `where` clause is a human readable mini language that can filter on various properties.  Elasticsearch, and MongoDB both use expressions (in JSON format) that can apply complex filters to whole JSON documents.  FuzzManager [1] is doing something like this too:  Using “crash signature specifications”[2], and “symptoms” representing domain specific patterns.  I am suggesting a slightly more general language be used to represent complex pattern matching logic found in both FuzzManager and Soccorro.

Rules must be pure functional so they can be safely sent to the (database) server, and applied fast at query time.  We can continue to annotate crashes with the matching rules for awesome query response time, but it is important to be able to apply rules at query time so people can explore.  

Elasticsearch can filter and aggregate on rules very fast; and is serendipitous that Soccoro uses it. Too bad scripting is turned off. 

##More about Signatures 
By defining a signature as a (rule, crash) pair, we can have multiple rules tag a single crash.  Each team can have their own set of rules that do not interfere with others’.  New rules can be defined to explore the usefulness of different filters

##Specific Suggestion
I suggest using JSON expressions [3] for declaring the rules. This is my work, which has evolved from experience using Elasticsearch’s filter expressions; some of the language irregularities have been removed, and some more operators have been added.  I believe it is a good candidate for rule specification because it maps well to Elasticsearch scripts, and has shown itself to be expressive enough to handle categorization in Bugzilla bugs, buildbot jobs, test results, and TC tasks.

FYI: JSON expressions are part of a larger JSON Query Expressions [5] which I am NOT advocating for at this time.

##Example
JSON expressions are used to categorize and partition Bugzilla bugs for various charts at query time [5].  These are much like how I imagine signatures can be handled:  There are multiple “name”, “filter” pairs defining rules; rules can be packaged into (binary) “dimensions”; marking bugs as matching, or not.  Rules can also be packaged into an ordered list; with the first-matching rule deciding how a bug is tagged; creating an effective partition over all bugs.  I am not advocating these more complicated dimensions and partitions at this time, I only wanted to point out how reifying rules allow us to compose more complex, and useful, aggregate rules.

##Better Example
Crash Signatures [2] is a good place to begin.  I am suggesting being more explicit about the operations being performed, so the rules are more general.

**Crash Signature**
> {
> 	"product":"test",
> 	"platform":"x86",
> 	"os":"linux",
> 	"symptoms":[{"type":"crashAddress","address":"< 0x1000"}]
> }

**JSON Expression**
> {"and":[
> 	{"eq":{
> 		"product":"test",
> 		"platform":"x86",
> 		"os":"linux"
> 	}},
> 	{"lt":{"address":"0x1000"}}
> ]}

##Alternate Suggestion
Given that crash stats data is moving to Telemetry, with all its SQL powers, rules could be defined as a SQL where clause.  Although, this suggestion has a few drawbacks: 1) SQL is bad at filtering hierarchical data, unless you use all of SQL, or have some convention about what tables are accessible. NoSQL query filters can handle these nested objects better. 2) SQL is not extensible; inevitably, there will be some common, or complex, filtering logic we will want to give a name to; JSON Expressions, with its {operator: parameters} format make it easy to add domain specific filtering operators  3) SQL is tedious to compose: Safely composing SQL requires text-munging libraries;  compounding SQL rules will require a different language. 

##Alternate Suggestion
Use a subset of common lisp [10]. imho, JSON expressions can be seen as a cheap JSON version of common lisp.


Thank you


[1] https://github.com/MozillaSecurity/FuzzManager
[2] https://wiki.mozilla.org/Security/CrashSignatures
[3] JSON Expressions: https://github.com/klahnakoski/ActiveData/blob/dev/docs/jx_expressions.md
[4] Query Expressions (not advocated): https://github.com/klahnakoski/ActiveData/blob/dev/docs/jx.md
[5] JSON Expressions in use defining dimensions over the Bugzilla bugs: https://github.com/mozilla/charts/blob/platform/platform/modevlib/Dimension-Bugzilla.js
[10] https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
(In reply to Kyle Lahnakoski [:ekyle] from comment #10)
> ##More about rules
> A rule must be machine readable, and reasonably humane to the eyes.  Generic
> candidates already exist:  For example the SQL `where` clause is a human
> readable mini language that can filter on various properties. 
> Elasticsearch, and MongoDB both use expressions (in JSON format) that can
> apply complex filters to whole JSON documents.  FuzzManager [1] is doing
> something like this too:  Using “crash signature specifications”[2], and
> “symptoms” representing domain specific patterns.  I am suggesting a
> slightly more general language be used to represent complex pattern matching
> logic found in both FuzzManager and Soccorro.

I like this part of your proposal. I really liked the simplicity of FuzzManager's definitions, especially compared to the (admittedly powerful, but difficult to change) way that Socorro's signature generation rules are specified.


> ##More about Signatures 
> By defining a signature as a (rule, crash) pair, we can have multiple rules
> tag a single crash.  Each team can have their own set of rules that do not
> interfere with others’.  New rules can be defined to explore the usefulness
> of different filters

I do think this has value--there are certainly going to be things that have different usefulness to different groups. However, I still think there's value to having a canonical signature for a crash, so that we can continue to generate things like the topcrash list. Bucketing crashes is a hard problem, and we'll probably never be perfect at it, but there's value in knowing that certain crashes have a specific volume, and that two crashes fall into the same bucket.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> However, I still think there's value to having a canonical signature
> for a crash, so that we can continue to generate things like the 
> topcrash list. 

I see these new definitions being able to support our current bucketing and signature methods; I hope you agree there is no conflict. I would expect that some rules are labeled "canonical", and serve the purpose of performing bucketing, just as we do now.  

I agree, bucketing will probably never be perfect, I would certainly expect there is uncertainly over which bucket a crash should belong, if only because we do not have enough information to be sure.  In cases of uncertainty, the current strategy of partitioning of crashes into buckets may splinter a common problem into many buckets, and be invisible to us.  Using the (crash, rule) pairs, we can define families of crashes that overlap; a single crash can be double counted to better reflect our uncertainty, and better coalesce similar crashes. Furthermore, population counts from any two (possibly overlapping) families can be legitimately compared to each other: We can still have a topcrash list of families, even if they overlap. 

We do loose the ability to sum the population counts of multiple crash families to get a population of their union, but we can still union families of crashes:

> {"or":[
> 	{"match-rule1", <param1>}
> 	{"match-rule2", <param2>}
> 	{"match-rule3", <param3>}
> ]}
Assignee: adrian → nobody
I've been experimenting with socorro-siggen which is a Python library that has signature generation extracted. It was used over the summer to generate signatures from crash ping data in Telemetry. It's still "alpha-quality", but usable. It generates Socorro-style signatures.

Based on the comments in this bug, it seems like there's some interest with experimenting with other signature generation algorithms. That's cool, but it's not something I'm going to pursue any time soon. Someone else will have to take that ball and run with it and we can see where it goes.
Component: General → Signature
You need to log in before you can comment on or make changes to this bug.