Closed Bug 1559223 Opened 6 years ago Closed 6 years ago

Add ability to search crash reports by module

Categories

(Socorro :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: janet, Assigned: willkg)

Details

Attachments

(1 file)

With the impending release of Fenix - we want to be able to easily find and triage crashes that are reported to Socorro, caused by the application services shared library.

At the moment, the only way for us to find these is to click through the search list and look at the stack trace of the reports, and see if it has libfenix.so (the name of the shared library produced by https://github.com/mozilla/application-services for Fenix) in the "Module" column of any of the frames on the stack of the crashing thread.

Ideally - we'd be able to search crash reports by module. Alternatively, if it were possible to filter for crashes originating directly in the module in question, which is likely to still catch most of our problems since we statically link almost everything.

There's a proto_signature field, but that's function names--not modules. There's bug #1542964 which covers something related--being able to reprocess crash reports where a specific module/debugid is in the stack. I think the two are likely related enough that they can be solved together.

Also related is that I'm reworking a bunch of the Elasticsearch code so we can upgrade from a really old Elasticsearch to a current one. That'll definitely affect what we can do search-wise.

Janet: Is this urgent? Could we live without it for a month? Pretend I don't know what the Fenix or application services schedules look like.

Flags: needinfo?(jdragojevic)

Making this a P2 so it's on my radar.

Priority: -- → P2

I think a month or so seems perfectly reasonable. The big 2.0 release should be October, so if we are getting crashes and have some time before then to get the info we need to find and resolve them, that would be great.

Flags: needinfo?(jdragojevic)

Grabbing this to work on now.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

I ran out of time to do the Elasticsearch upgrade. We can't index all the frames for all the stacks for all the crash reports--there are too many problems with doing that and I wouldn't be able to get that to work in the next week. So I'm looking at ways to cheat that get you answers to the questions you want.

When you say "module", are you talking the same kind of modules in the wiki here?: https://wiki.mozilla.org/Modules/All

If so, what module are you working on and do you have any example crash reports where that module shows up in the stack?

Janet: I messed up. I see what you mean by module now. Sorry about that.

So, I'm thinking of doing something like this:

  1. Write a processor rule that creates a modules_in_stack field. It'd walk the stack of the crashing thread and pull out modules. The value would be a ; separated set of module/debugid strings. Since it's a set, if a module shows up twice in the stack, it only shows up once in the set.
  2. Add modules_in_stack to the search fields using the semicolon_analyzer. Then each module/debugid would be a term. It should be searchable using beings, ends, contains, and other strings things. It should be facetable.

Then we should should be able to do searches like:

  • find all crashes with module libfenix.so (this bug)
  • find all crashes with libc.so/037B12F7F23D7AD7A9262CB5A6ACCDA10 (bug #1542964)

I looked at 11,000 crash reports across Fenix and Firefox.

  • median number of module/debugid items in the set: 3
  • mean number of module/debugid items in the set: 4.7
  • 95% of module/debugid items in the set: 13
  • max number of module/debugid items in the set: 19

If each module/debugid is 100 characters, then this is at most 2k to add per crash report. I think this is ok. We get a lot of utility out of "reprocess all the crash reports with module/debugid"--that comes up periodically.

Lonnen, Peter, Adrian, John: What do you think? Is this a good-enough stop-gap fix to cover some use cases now? Does this sound like a bad idea based on experience with previous attempts at similar things?

Flags: needinfo?(peterbe)
Flags: needinfo?(jwhitlock)
Flags: needinfo?(chris.lonnen)
Flags: needinfo?(adrian)
Flags: needinfo?(peterbe)

This sounds reasonable to me. I wasn't familiar with our semicolon_analyzer, but it is being used already for three other fields (app_init_dlls, topmost_filenames, and useragent_locale), so it should be safe to use it on a new fourth field.

Flags: needinfo?(jwhitlock)

I think this sounds useful. If we're concerned about load we could introduce it behind a sample rate, and crank that sample rate up over the course of a week.

Flags: needinfo?(chris.lonnen)

I think this sounds quite fine to me. No alarms went up in my head while reading your description. :-)

Flags: needinfo?(adrian)

willkg merged PR #4996: "bug 1559223, 1542964: add modules_in_stack to processed crash" in 03ffe0f.

I just pushed it out to prod in 2019.07.26. I'll keep an eye on Elasticsearch memory usage.

I'm going to keep this open until Monday. We need to wait for a new Elasticsearch index to be created in prod before searching will work and I want to verify it then.

The processor has created a new Elasticsearch index with the mapping we need, so "modules_in_stack" should be searchable for crashes received and processed today and going forward.

I'm using this query:

https://crash-stats.allizom.org/search/?modules_in_stack=%5Elibfenix.so&date=%3E%3D2019-07-22T00%3A00%3A00.000Z&date=%3C2019-07-29T23%3A59%3A00.000Z&_facets=signature&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

It's a "starts_with" query because the term will be something like "libfenix.so/ABCDEF123456778890" and we need to match just the first part.

That's currently bringing up no results. I think that's because there haven't been any crash reports reported and processed in the last 12 hours that have "libfenix.so" in the stack. If we search for "libxul.so", then it brings up tons of stuff. So while I can't verify the scenario underlying this bug, I'm pretty sure it should work fine once we get the requisite crash reports in.

I'm going to mark it FIXED. If there are still issues, please reopen. Hope that helps!

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: