Closed Bug 865791 Opened 12 years ago Closed 12 years ago

Allow for arbitrary classification of crashes at processing and on-demand

Categories

(Socorro :: Backend, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: lars)

References

Details

(Whiteboard: [qa-])

Here's what I think I'd like to do about hang processing; it will also help solve a few other problems I've been dealing with. I would like a way to do arbitrary classification of crash reports. The classifier will be a python function (or class?). The classifier will receive the data from hbase (specifically the processed JSON). The classified will produce: classification: "string", associateddata: "valuestring" [optional] Because we may revise classifications regularly, when we store a classification we should also store the version number of the classifier which produced it. The classifications should be stored in both hbase and postgres. I'm going to throw out these strawmen and let the experts refine as necessary: * Add an hbase family "classifications": each classification will be stored separately as a little JSON e.g. classifications:flash-hang-classification '{version: 3, classification: "bugNNNNNN", associateddata: "highcpu"}' * In postgres, have a partitioned join table with the data: CREATE TABLE classifications ( uuid text NOT NULL, classification text NOT NULL, version integer NOT NULL, associateddata text NULL ) As for applying classifications, I propose the following setup: * At processing time, each classifier decides whether it wants to process a particular report. For example, the Flash hang classifier would run only for Firefox plugin hang reports which have all four Flash minidumps in the report. * The API will expose a way to request classification of other queries of crashes as requested by authenticated users. * The middleware and the new web-facing API layer will expose classifications. * Classifiers will be deployed as part of the regular Socorro release cycle. I'm not sure whether we should just check them into the Socorro codebase, or whether they should live in some separate "config" repository (they are going to be very specific to Mozilla's use-cases).
great idea and I don't think it'll be difficult at all to implement. would you consider the signature generation classes to be classifiers? They seem to fit your description. They're in some ways a degenerate example because they always apply themselves to every crash: { version: 1, classification: 'CSignatureTool', associated_data: 'some | signature' } { version: 1, classification: 'JavaSignatureTool', associated_data: 'some Java signature' } I envision a table in the database that controls what classifiers are loaded. When the processor starts, it queries this active_classifiers table; CREATE TABLE active_classifiers ( name text NOT NULL, class text NOT NULL, --a qualified python class for dynamic loading ); The classifier code object (likely a class) would be dynamically loaded at run time by configman. The processor will hold a collection of classifiers. The worker threads would iterate through the classifiers after the basic processing has been done on a crash. Each classifier class implements an api that accepts a raw & processed crash. They either spew out None (the classifier chose not to apply itself) or it returns the values appropriate for bsmedberg's 'classifications' table. This fits so well with my idea of having multiple signature generation systems.
Yes, I think you could certainly make CSignatureTool and JavaSignatureTool be classifiers. You'd still need logic to pick "the primary signature" somehow.
lars: we'd like to get this done pretty quickly, say in the next few weeks. How will that fit around your PTO? (Beforehand? Should somebody else (selena, Erik) work on it while you're away and get review when you return? Something else?)
I'm in the process of prototyping this right now. I'm pleased to report that I can exploit the existing TransformRules structure to execute the classifiers. That means that the list of classifiers can reside in the existing rules table. A classifier would exist in two parts, a predicate (that would evaluate if a classifier is applicable) and the classifier itself that would accept the raw and processed crash as input. Output of the classifier gets saved back to to processed crash. Saving the classifications to HBase and Postgres will just fall through to the the crash storage system (no modifications to the HBase crashstore, minor mods to the Postgres crashstore). In my first cut implementation of this, the regular signature generation routines will not be changed. However, in the second cut, I'll reimplement the signature generation as a classifier rule. Finally, we can add a final rule in the classifier system that can look through the all the classifiers to select what should be the "primary signature". I expect to have a PR for the first cut by the end of the week (if I'm not interrupted). There will have to be same additional bugs to handle the UI enhancements. Closely related is Bug 864396 ("signature generation tool w/ skiplist from a database") and ought to ship first.
In my prototype of this, within the processed_crash, the result of a set of classifications looks like the example below. { ... 'classifications': { 'my_first_classifier': { 'version': '1.0', 'classification': 'misery', 'associated_data': 'lars made this up' }, 'legacy_signature': { 'version': '3.0', 'classification': 'traditional | piped | signature', 'associated_data': 'CSignature' }, 'my_second_classifier': { 'version': '0.0Alpha', 'classification': 'mayhem', 'associated_data': 'processor2013:2383 running with 24 threads' } }, ... } Classifiers are applied after the MDSW portion of crash processing. The classifier itself consists of two callables: a predicate and an action. Both functions have the same signature: raw_crash - a python dictionary of the original submission processed_crash - a python dictionary of the processed crash processor - a reference to the processor itself. This gives the classifier access to the processor's configuration, logging, name, and any other resources available to the processor (database connection, hbase, elastic search, etc). If the predicate callable returns True, then the action will run. Classifications are saved in an ordered collection and therefore run in a predictable order. The results of one classification are available to subsequent classifications. In this first cut, a classifier is not limited to just adding entries to the 'classifications' element of the processed crash. A classifier may treat the processed_crash as a read/write container and may add/delete/alter any element within. Feedback, pleeez?
One important *future* use-case is that we will want to rerun certain classifications later. Certain classifications may be more expensive or only needed on a smaller set of crashes. It would be nice to be able to do this without complete reprocessing a crash (rerunning MDSW). This seems to indicate that it would be better if processed_crash were a readonly container. What's the advantage of it being a read/write container?
> What's the advantage of it being a read/write container? my prototype implementation is based on the TransformRule system for raw crashes. In order for the raw_crash transform system to work, it has to have complete read/write access to the raw_crash container. It was trivial to extend that system to do your classification request - by extension, this new category of rule has the same read/write abilities on its processed_crash container as the raw_crash transform rule category has on the raw_crash containers. This feature is a boon to future Socorro processing changes. Adding new processing behaviors becomes as simple as adding a reference to the code for the new behavior into a table in Postgres. The processors will just pick up the new behavior without needing recoding. Your classifications would be the first example of rapidly deploying new behavior. A second example would be to facilitate a smooth transition to jsonMDSW by just adding an interim rule/code for translation of pipe dump into json dump. > One important *future* use-case is that we will want to rerun certain > classifications later. does this mean that you may have an ad hoc classification that you wish to run against a subset of crashes that isn't to be applied to the general population by the processors? I suspect I may be able to handle that case of avoiding rerunning MDSW with the predicate function that accompanies each transform rule. If the application of MDSW were to be recoded as one of these rules, it could have a companion predicate that could decline to rerun MDSW when the reprocessing request comes with a tag saying that the reprocessing is for application of classifications.
> does this mean that you may have an ad hoc classification that you wish to > run against a subset of crashes that isn't to be applied to the general > population by the processors? Yes. We sometimes currently have a signature which is actually several different bugs, and we'd like to perform a subgrouping analysis on just that signature. It's not an immediate priority, but I'd like to keep that use-case in mind.
> It's not an immediate priority, but I'd like to keep that use-case in mind. I'm confident that I can accommodate that use case :bsmedberg, do you have a sample classifier that I can adapt to this scheme? I'd like to see how it looks with something actually useful rather than my contrived examples.
I now have a 1st cut of this system running on a branch of by git hub repo. I've reimplemented all of the classifiers in bsmedberg's classifier.py file (and added unit tests). I've made some assumptions that might be faulty, but we can change them if necessary. The use of the word "Skunk" in this code is not permanent, unless no one objects. I find it amusing. see https://github.com/twobraids/socorro/blob/class-json2/socorro/processor/skunk_classifiers.py to get an idea on how they've been transformed to fit Socorro's TransformRule system. Making a new rule requires making a subclass of the SkunkClassificationRule class and overriding the 'action' method. The action method receives as parameters: the raw crash, the processed crash and the class that implements the processor itself. The base SkunkClassificationRule class provides helper functions to add a classification to a processed crash or derivative may write to the processed crash directly. This system adds a new branch to the processed_crash json mapping: a new top level key called "classifications". Under that, this system will create a sub-key called "skunk_works". Under that is "classification", "classification_data", and "classification_version". The rules are added to the processor through the TransformRules table in the database (GUI access pending). To add or remove rules, rows with the fully qualfied class name are either added or removed from the table. Eventually, this will allow for rules to be added/removed without restarting the processor. At this point in time, applying classifiers to crashes after they've initially been processed will require reprocessing (this is likely to change in the future). It is not necessary for the classification rules to deal with the MDSW Pipe Dump. Since we're eventually going to use the json dump, I've added a PipeDumpToJsonDump conversion - we won't have to reimplement the classifiers when we move to jsonMDSW.
from IRC: (10:03:04 AM) lars: bsmedberg: the skunk classifier output will be part of the processed_crash json that is sent to both ES and HBase. Considering the search capabilities of ES, do you still need the classifier table in PG? (10:03:22 AM) bsmedberg: lars: I don't know the answer to that. Probably not ... (10:08:12 AM) lars: bsmedberg: it would be most excellent to not have to do PG support right now. If we find we want PG support in the future the path to do so isn't too difficult. (10:08:31 AM) bsmedberg: yeah, I think that sounds fine
in the skunk processor's function 'filterUnwantedReports', this code appears to require that every dump has a crashed thread. Is that correct? for dump in report.dumps.itervalues(): if dump.error or not dump.crashthread in dump.threads: metadict['error'] = True metadict['classifiedas'] = 'processing-error' break In looking at a sampling of plugin crashes in Socorro, I have yet to find a crash where all the dumps identify a crashing thread. The four dump flash crashes show crashed threads in the 'plugin' and 'browser' dumps, but not the 'flash1' and 'flash2' dumps. It appears to me that these crashes would be given a classification of 'processing-error'. is that correct or am I confused?
Flags: needinfo?(benjamin)
I've got this coded and passing unit, smoke and some ad hoc tests. I need actual crashes that would trigger each of the classifications for more thorough testing. Just some Socorro crash_ids from production would be sufficient as I can pull these myself. Or are all these crashes being diverted away from prod into your skunk processor?
Flags: needinfo?(benjamin)
Assignee: nobody → lars
Target Milestone: --- → 53
I'm only diverting the crashes from nightly/aurora, so we should be receiving these for beta/release. I did some poking and: https://crash-stats.mozilla.com/report/index/5ad8e0cb-9b4f-4af4-b114-f76602130710 should be adbe-3355131 https://crash-stats.mozilla.com/report/index/7b7ac8b8-94e2-4d24-9ed3-580912130710 should be NtUserSetWindowPos | F_1378698112 https://crash-stats.mozilla.com/report/index/bb8a63a0-53a6-4ed3-a06b-517862130710 should be NtUserSetWindowPos | F_468782153 There are a bunch of other NtUserSetWindowPos hangs you can test for "other" I can't really find examples of the "Bug811804" or "Bug812318" classifications without running a jydoop job, and I don't think the webapp shows the multidump dumps anywhere (right?).
Flags: needinfo?(benjamin)
Target Milestone: 53 → 55
Target Milestone: 55 → 56
Target Milestone: 56 → 57
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/dd2a14c205b7cd3069dcd015823608880793b991 Merge pull request #1331 from twobraids/skunk-3-classifiers Fixes Bug 865791 - skunk classifier system added to processor
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 902447
Depends on: 902448
Changes for this bug are going out in 56.
Target Milestone: 57 → 56
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.