Add ability to search crash reports by module
Categories
(Socorro :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: janet, Assigned: willkg)
Details
Attachments
(1 file)
With the impending release of Fenix - we want to be able to easily find and triage crashes that are reported to Socorro, caused by the application services shared library.
At the moment, the only way for us to find these is to click through the search list and look at the stack trace of the reports, and see if it has libfenix.so (the name of the shared library produced by https://github.com/mozilla/application-services for Fenix) in the "Module" column of any of the frames on the stack of the crashing thread.
Ideally - we'd be able to search crash reports by module. Alternatively, if it were possible to filter for crashes originating directly in the module in question, which is likely to still catch most of our problems since we statically link almost everything.
| Assignee | ||
Comment 1•6 years ago
|
||
There's a proto_signature field, but that's function names--not modules. There's bug #1542964 which covers something related--being able to reprocess crash reports where a specific module/debugid is in the stack. I think the two are likely related enough that they can be solved together.
Also related is that I'm reworking a bunch of the Elasticsearch code so we can upgrade from a really old Elasticsearch to a current one. That'll definitely affect what we can do search-wise.
Janet: Is this urgent? Could we live without it for a month? Pretend I don't know what the Fenix or application services schedules look like.
| Reporter | ||
Comment 3•6 years ago
|
||
I think a month or so seems perfectly reasonable. The big 2.0 release should be October, so if we are getting crashes and have some time before then to get the info we need to find and resolve them, that would be great.
| Reporter | ||
Updated•6 years ago
|
| Assignee | ||
Comment 4•6 years ago
|
||
Grabbing this to work on now.
| Assignee | ||
Comment 5•6 years ago
|
||
I ran out of time to do the Elasticsearch upgrade. We can't index all the frames for all the stacks for all the crash reports--there are too many problems with doing that and I wouldn't be able to get that to work in the next week. So I'm looking at ways to cheat that get you answers to the questions you want.
When you say "module", are you talking the same kind of modules in the wiki here?: https://wiki.mozilla.org/Modules/All
If so, what module are you working on and do you have any example crash reports where that module shows up in the stack?
| Assignee | ||
Comment 6•6 years ago
|
||
Janet: I messed up. I see what you mean by module now. Sorry about that.
So, I'm thinking of doing something like this:
- Write a processor rule that creates a
modules_in_stackfield. It'd walk the stack of the crashing thread and pull out modules. The value would be a;separated set ofmodule/debugidstrings. Since it's a set, if a module shows up twice in the stack, it only shows up once in the set. - Add
modules_in_stackto the search fields using the semicolon_analyzer. Then eachmodule/debugidwould be a term. It should be searchable using beings, ends, contains, and other strings things. It should be facetable.
Then we should should be able to do searches like:
- find all crashes with module libfenix.so (this bug)
- find all crashes with libc.so/037B12F7F23D7AD7A9262CB5A6ACCDA10 (bug #1542964)
I looked at 11,000 crash reports across Fenix and Firefox.
- median number of
module/debugiditems in the set: 3 - mean number of
module/debugiditems in the set: 4.7 - 95% of
module/debugiditems in the set: 13 - max number of
module/debugiditems in the set: 19
If each module/debugid is 100 characters, then this is at most 2k to add per crash report. I think this is ok. We get a lot of utility out of "reprocess all the crash reports with module/debugid"--that comes up periodically.
Lonnen, Peter, Adrian, John: What do you think? Is this a good-enough stop-gap fix to cover some use cases now? Does this sound like a bad idea based on experience with previous attempts at similar things?
Updated•6 years ago
|
Comment 7•6 years ago
•
|
||
This sounds reasonable to me. I wasn't familiar with our semicolon_analyzer, but it is being used already for three other fields (app_init_dlls, topmost_filenames, and useragent_locale), so it should be safe to use it on a new fourth field.
Comment 8•6 years ago
|
||
I think this sounds useful. If we're concerned about load we could introduce it behind a sample rate, and crank that sample rate up over the course of a week.
Comment 9•6 years ago
|
||
I think this sounds quite fine to me. No alarms went up in my head while reading your description. :-)
| Assignee | ||
Comment 10•6 years ago
|
||
| Assignee | ||
Comment 11•6 years ago
|
||
willkg merged PR #4996: "bug 1559223, 1542964: add modules_in_stack to processed crash" in 03ffe0f.
I just pushed it out to prod in 2019.07.26. I'll keep an eye on Elasticsearch memory usage.
I'm going to keep this open until Monday. We need to wait for a new Elasticsearch index to be created in prod before searching will work and I want to verify it then.
| Assignee | ||
Comment 12•6 years ago
|
||
The processor has created a new Elasticsearch index with the mapping we need, so "modules_in_stack" should be searchable for crashes received and processed today and going forward.
I'm using this query:
It's a "starts_with" query because the term will be something like "libfenix.so/ABCDEF123456778890" and we need to match just the first part.
That's currently bringing up no results. I think that's because there haven't been any crash reports reported and processed in the last 12 hours that have "libfenix.so" in the stack. If we search for "libxul.so", then it brings up tons of stuff. So while I can't verify the scenario underlying this bug, I'm pretty sure it should work fine once we get the requisite crash reports in.
I'm going to mark it FIXED. If there are still issues, please reopen. Hope that helps!
Description
•