add missing module information for panic-related frames
Categories
(Socorro :: Signature, defect, P2)
Tracking
(Not tracked)
People
(Reporter: kats, Assigned: willkg)
References
Details
Attachments
(1 file)
https://crash-stats.mozilla.org/report/index/c0351dcd-a07b-4e84-a964-7bfe00190413 is an example crash report. The signature is [@ GeckoCrash] but the webrender::prim_store::PrimitiveStore::prepare_prim_for_render
frame is the first interesting one.
Reporter | ||
Comment 1•6 years ago
|
||
Actually I guess adding the panic
frame as a sentinel would be better.
Assignee | ||
Comment 2•6 years ago
|
||
This is super fishy. Here's the same crash, but on Windows:
https://crash-stats.mozilla.org/report/index/b93c3882-eead-4ac8-b923-539410190414
I wrote up bug #1544416 to look into how symbols are getting generated. I want to wait on that before doing anything here.
Assignee | ||
Comment 3•6 years ago
|
||
In bug #1544416, we talk about how upgrading Rust to 1.34 has changed some symbols. The change in symbols prevents signature generation from matching one of the sentinels in the sentinels list. Hence the crappy signature.
There's more information in there as well as what we're thinking about doing.
The signature report for this bug is here:
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=GeckoCrash
According to that, there are 17 instances of this in the last month. That's ... very very small number. I don't think I want to do anything explicitly to fix this bug outside of dealing with bug #1544416, so I'm going to make this a P3 and think about it again later.
If circumstances change and this is a bigger issue we need to deal with now, I'm game--just toss in a comment to bump the bug.
Reporter | ||
Comment 4•6 years ago
|
||
According to that, there are 17 instances of this in the last month. That's ... very very small number.
This is kind of surprising to me.
I've been looking at release/beta crash numbers for WebRender using databricks, looking at the crash pings submitted by telemetry, and I see a lot more pings than what shows up on crash-stats. See for instance this notebook where I collect the crash pings from 67 beta builds submitted after Apr 14 (so ~8 days of data) and filter for gpu process crashes with WebRender enabled. In this dataset alone there are 16 crashes with the "GeckoCrash" signature, but I can't find any of them using crash-stats.
For additional context, I had previously done another run (see bug 1540853 comment 6) where I sampled 10% of the crashes since march 1 on beta 67, and the #4 crash on the list was this "GeckoCrash" signature, which is why I was interested in breaking it down to figure out what the root cause was.
Assignee | ||
Comment 5•6 years ago
|
||
I don't have view access to that notebook. I'd love to see it--I'm curious as to how you're doing symbolication and signature generation.
The crash ping and crash report datasets are wildly divergent for a multitude of reasons. I don't know what might be happening in this specific instance. You could write up a bug in toolkits::crashreporting for someone to look into it.
How are you looking for crash reports in Crash Stats based on crash ping data?
Reporter | ||
Comment 6•6 years ago
|
||
Sorry, forgot to give you perms. Should be fixed now. I'm using the fx-crash-sig package to do symbolication, which uses siggen for signature generation. I realize the PyPi packages might be a little out of date relative to what's in github but it seems to work well enough for my purposes.
The crash ping and crash report datasets are wildly divergent for a multitude of reasons.
Can you list some of these reasons? I'm mostly interested in if it affects the validity of what I'm doing.
How are you looking for crash reports in Crash Stats based on crash ping data?
Basically using super search with parameters that match crash pings. For example, this query should give me all the gpu process crashes in 67.0b11 and 67.0b12 in the last week but it only finds 2 which is a far cry from what I'm getting via databricks.
Assignee | ||
Comment 7•6 years ago
•
|
||
Interesting--I knew about Ben's work, but never knew about fx-crash-sig. That's an interesting package that's way out of date.
I extracted siggen as an experiment to support Ben Wu's work, but after that project wrapped up, I stopped updating siggen because no one was using it and I'm really strapped for time. Signature generation consists of code, but also (and more importantly) the signature generation lists which include sentinels. You will definitely see more GeckoCrash signatures using siggen than Socorro especially old versions of siggen. Sorry!
My understand of why crash ping and crash report datasets are wildly divergent is roughly this:
-
Crash reports include PII and thus require user consent to send; users can opt not to send them, users may not see the crash report dialog, users may have opted out permanently.
I'm pretty sure crash ping data always gets sent unless you opt out of telemetry. Crash pings do not include PII.
-
The crash reporting mechanisms are occasionally broken especially as we upgrade llvm, clang, Rust, and other related things. This can affect crash reports, but not crash pings.
-
There are types of crashes where the crash reporting machinery just doesn't send a crash report. These get discovered and fixed periodically. You can follow Toolkits/Crash Reporting in Bugzilla for examples and current state of things.
-
Socorro throttles crash reports for release channel and only processes 10% of them. It's a random 10% but if you're looking at crashes for which there's very low volume, it's possible it doesn't show up in Crash Stats at all. Pretty sure we process all crash ping data.
There are likely other reasons, too. It's possible some of these reasons aren't relevant anymore. Regardless, the two data sets have really different properties.
Hope that helps!
Reporter | ||
Comment 8•6 years ago
|
||
Ok, thanks! From this it sounds like looking at crash pings is actually the better thing to do if we want get a better idea of what behaviour in the wild is like, with the caveat that signature generation is using stale code and so the signatures might be suboptimal.
Assignee | ||
Comment 9•6 years ago
|
||
Here's a PR with a crash report that's similarly affected:
https://github.com/mozilla-services/socorro/pull/4937
There was some movement on the issue in Rust. I'd much prefer it get fixed there and the problem go away.
Do we know if the problematic version of Rust is being used in builds in beta or release channel, yet? If so, that makes it much more likely that we have to do a fix in Socorro, too.
Comment 10•6 years ago
•
|
||
bug 1535657: bp-b481e07c-28ef-48ad-adbf-0d0e30190509
[@ GeckoCrash ] has previously been [@ webrender::prim_store::SpaceMapper<T>::map<T> ]
called
Option::unwrap()
on aNone
value
bug 1524427: bp-cc5ef2f9-8502-473a-98a1-9f1f60190508
[@ GeckoCrash ] has previously been [@ core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image ]
Didn't find a cached resource with that ID!
Assignee | ||
Comment 11•6 years ago
|
||
Jan: Both of those look like crashes in the nightly channel, so I'm confused. How does that answer comment #9?
Comment 12•6 years ago
|
||
Sorry, that was a mid-air. Originally I planned to file a bug asking for the same as comment 1, but found this one instead.
It is source of confusion/duplicates and complicates monitoring of bug fixes.
Comment 13•6 years ago
|
||
Firefox 68 is now on beta.
Are we going to address this in socorro somehow (I wrote https://github.com/jcristau/socorro/commit/6544986f2e55c96ca00270282932c53812861e43 before seeing this bug and the existing PR)?
Or does https://github.com/rust-lang/rust/pull/61007 fix this and we should update the compiler instead?
Assignee | ||
Comment 14•6 years ago
|
||
Pull 61007 is great and probably fixes this issue, but I don't think it can get into a Rust release and then we update the compiler fast enough.
I'm going to add a hack to Socorro to re-add missing module information so that signature generation works correctly. Hopefully we can take the hack out some day.
Grabbing this to do today.
Once I land a fix, I can reprocess affected crashes and they'll end up with better signatures.
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Assignee | ||
Comment 17•6 years ago
|
||
I pushed this to prod just now.
I reprocessed all the crashes with "GeckoCrash" in the signature for the last 3 months. There were 200 or so of them.
Marking this as FIXED.
Description
•