Closed Bug 1547810 Opened 6 years ago Closed 3 years ago

Get some kind of regular reporting of crash ping telemetry

Categories

(Core :: Graphics: WebRender, task, P3)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: jrmuizel, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

We get way more reports in telemetry crash pings then we do crash reports. We should try to have some way of routinely looking at the telemetry results. This will be especially helpful for monitoring release 67.

Blocks: wr-67
Priority: -- → P2
Type: defect → task
Blocks: wr-68
No longer blocks: wr-67
Blocks: wr-telemetry
Depends on: 1544246

As an FYI: that workbook uses fx-crash-sig which is (as near as I can tell) unmaintained. It requires a really old version of the experimental signature generation library. That's one of the reasons it's seeing lots of GeckoCrash signatures.

If you're going to go this route, siggen and fx-crash-sig will need updates and active maintenance.

Depends on: 1553671
Blocks: wr-70
No longer blocks: wr-68
Blocks: wr-71
No longer blocks: wr-70

Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html

siggen and fx-crash-sig would need to be rewritten to use this new dataset, but getting the actual crash data should be much easier and faster than it was previously.

How do siggen and fx-crash-sig need to be rewritten?

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #4)

How do siggen and fx-crash-sig need to be rewritten?

Perhaps "rewritten" is the wrong way of phrasing this. They would need to process the crash payload as it appears in the telemetry.crash dataset, which I think might be a bit different from how you would have fetched it via (e.g.) the python_moztelemetry API.

Ahhh--got it! If the structure of the data set that fx-crash-sig works on has changed, then I think we only need to change fx-crash-sig. It's the library that's responsible for taking a crash ping, extracting the bits that are needed for symbolication, symbolicating using Symbols, and then running the results of that through siggen for signature generation.

Blocks: wr-72
No longer blocks: wr-71
Blocks: wr-73
No longer blocks: wr-72
Blocks: wr-74
No longer blocks: wr-73
Blocks: wr-75
No longer blocks: wr-74
Flags: needinfo?(kats)

That needinfo reminded me, I wrote up a cookbook a few weeks ago on working with crash pings using bigquery: https://docs.telemetry.mozilla.org/cookbooks/crash_pings.html

Much of kats' databricks notebook could be reproduced in sql.tmo as a dashboard using some of the techniques described in there. Getting data on specific signatures is slightly more complicated beast, but as mentioned above much more tractable than previously.

I can take a look and figure out next steps here.

Assignee: nobody → kats
Flags: needinfo?(kats)
Blocks: 1621137

I started playing around with the data in STMO. It seems relatively straightforward to get the crash data and plot number of crashes broken down by buildid and/or vendorId. But I'm not sure what would be the most useful data to display. If anybody has thoughts on that please chime in.

So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".

I'd mostly like to see a list of top signatures

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)

So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".

Yeah, this is what missioncontrol v1 and v2 try to do (try to chart/track crashes normalized by other things over time):

v1: https://missioncontrol.telemetry.mozilla.org
v2: https://metrics.mozilla.com/~sguha/mz/missioncontrol/ex1/mc2/missioncontrol_v2.html

It's a bit of a topic on its own-- it's basically very complicated to get a good signal out of this type of normalized error rate and you need to really consider a wide variety of factors (release dates, update schedules, etc.) to be able to properly interpret what's going on. That said, I don't think we've tried to break this down by graphics chipset before -- it's possible that might yield useful results in some cases.

If all we want is a list of top signatures then I'm not sure STMO is the way to go. It's probably better to spruce up my original databricks workbook and dashboard-ize it.

Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the moztelemetry python thing isn't a thing anymore, so I guess I have to use STMO.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #3)

Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html

Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the telemetry.crash table. Am I missing something?

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)

Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the moztelemetry python thing isn't a thing anymore, so I guess I have to use STMO.

You can access BigQuery from Databricks, but we're not really encouraging use of it these days. I'd encourage you to explore what redash can do, it's pretty powerful e.g. https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions

It's also possible to pull data out from STMO and display it in a different way, this is e.g. what I hooked up for Mike Conley's tab spinner dashboard a few months ago:

https://wlach.github.io/blog/2019/10/using-bigquery-javascript-udfs-to-analyze-firefox-telemetry-for-fun-profit/

Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the telemetry.crash table. Am I missing something?

We need to add the stacks to the schema so they have their own BigQuery column, I just filed bug 1623626 for this. For now you should be able to find them (along with other fields that haven't yet made it into the schema) inside the additional_properties column.

One idea I had was to create a derived dataset of telemetry.crash which included crash signatures derived from this type of information. In theory this shouldn't be terribly difficult, and would enable some interesting things.

Thanks, that helps point me in the right direction. I didn't realize there was more stuff in the additional_properties column. Having a derived dataset with crash signatures would certainly simplify what I'm trying to do!

A note on stacks in crash pings: the ones in Windows crash pings should be every bit as good as the ones on crash-stats. On macOS and Linux not so much.

Blocks: wr-76
No longer blocks: wr-75
Blocks: wr-77
No longer blocks: wr-76
Depends on: 1631563
Blocks: 1607860
Blocks: wr-78
No longer blocks: wr-77
Blocks: wr-79
No longer blocks: wr-78
Blocks: wr-80
No longer blocks: wr-79
Blocks: wr-81
No longer blocks: 1621137
No longer blocks: wr-80
Priority: P2 → P3
No longer blocks: wr-81
No longer blocks: gfx-82
No longer blocks: gfx-83

Quick update on the current status here.

This is waiting on bug 1631563 which in turn is waiting on bug 1636210 for faster symbolication. Once the new symbolication server is deployed, bug 1631563 tracks the work to automatically symbolicate incoming crash pings and create a derived dataset in STMO with that information. I submitted a PoC (with much help from :wlach and :willkg) which demonstrates how it could be done. Once all that is done and the derived dataset is in place, this bug tracks doing whatever gfx-specific thing we want using that data. It should really just be a matter of doing a SQL query against the derived dataset to group by crash signature and sort by descending count, and that will give us the list of top crashing signatures.

So until the dependencies are resolved there's nothing to do here. As it's unlikely to get done in the next week, I'll unassign this bug so somebody else can pick it up when the time comes.

Assignee: kats → nobody
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.