1547810 - Get some kind of regular reporting of crash ping telemetry

Reporter

Description

•

6 years ago

We get way more reports in telemetry crash pings then we do crash reports. We should try to have some way of routinely looking at the telemetry results. This will be especially helpful for monitoring release 67.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

6 years ago

Blocks: wr-67

Priority: -- → P2

Marco Castelluccio [:marco]

Updated

•

6 years ago

Type: defect → task

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-68
No longer blocks: wr-67

Darkspirit

Updated

•

5 years ago

Blocks: wr-telemetry

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 1

•

5 years ago

Here's kat's databricks workbook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/101076/command/101077

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Depends on: 1544246

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 2

•

5 years ago

As an FYI: that workbook uses fx-crash-sig which is (as near as I can tell) unmaintained. It requires a really old version of the experimental signature generation library. That's one of the reasons it's seeing lots of GeckoCrash signatures.

If you're going to go this route, siggen and fx-crash-sig will need updates and active maintenance.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Depends on: 1553671

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-70
No longer blocks: wr-68

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-71
No longer blocks: wr-70

William Lachance (:wlach)

Comment 3

•

5 years ago

Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html

siggen and fx-crash-sig would need to be rewritten to use this new dataset, but getting the actual crash data should be much easier and faster than it was previously.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 4

•

5 years ago

How do siggen and fx-crash-sig need to be rewritten?

William Lachance (:wlach)

Comment 5

•

5 years ago

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #4)

How do siggen and fx-crash-sig need to be rewritten?

Perhaps "rewritten" is the wrong way of phrasing this. They would need to process the crash payload as it appears in the telemetry.crash dataset, which I think might be a bit different from how you would have fetched it via (e.g.) the python_moztelemetry API.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 6

•

5 years ago

Ahhh--got it! If the structure of the data set that fx-crash-sig works on has changed, then I think we only need to change fx-crash-sig. It's the library that's responsible for taking a crash ping, extracting the bits that are needed for symbolication, symbolicating using Symbols, and then running the results of that through siggen for signature generation.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-72
No longer blocks: wr-71

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-73
No longer blocks: wr-72

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-74
No longer blocks: wr-73

Jessie [:jbonisteel] pls NI

Updated

•

5 years ago

Blocks: wr-75
No longer blocks: wr-74

Jessie [:jbonisteel] pls NI

Updated

•

5 years ago

Flags: needinfo?(kats)

William Lachance (:wlach)

Comment 7

•

5 years ago

That needinfo reminded me, I wrote up a cookbook a few weeks ago on working with crash pings using bigquery: https://docs.telemetry.mozilla.org/cookbooks/crash_pings.html

Much of kats' databricks notebook could be reproduced in sql.tmo as a dashboard using some of the techniques described in there. Getting data on specific signatures is slightly more complicated beast, but as mentioned above much more tractable than previously.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 8

•

5 years ago

I can take a look and figure out next steps here.

Assignee: nobody → kats

Flags: needinfo?(kats)

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: 1621137

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 9

•

5 years ago

I started playing around with the data in STMO. It seems relatively straightforward to get the crash data and plot number of crashes broken down by buildid and/or vendorId. But I'm not sure what would be the most useful data to display. If anybody has thoughts on that please chime in.

So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 10

•

5 years ago

I'd mostly like to see a list of top signatures

William Lachance (:wlach)

Comment 11

•

5 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)

So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".

Yeah, this is what missioncontrol v1 and v2 try to do (try to chart/track crashes normalized by other things over time):

v1: https://missioncontrol.telemetry.mozilla.org
v2: https://metrics.mozilla.com/~sguha/mz/missioncontrol/ex1/mc2/missioncontrol_v2.html

It's a bit of a topic on its own-- it's basically very complicated to get a good signal out of this type of normalized error rate and you need to really consider a wide variety of factors (release dates, update schedules, etc.) to be able to properly interpret what's going on. That said, I don't think we've tried to break this down by graphics chipset before -- it's possible that might yield useful results in some cases.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 12

•

5 years ago

If all we want is a list of top signatures then I'm not sure STMO is the way to go. It's probably better to spruce up my original databricks workbook and dashboard-ize it.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 13

•

5 years ago

Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the moztelemetry python thing isn't a thing anymore, so I guess I have to use STMO.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #3)

Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html

Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the telemetry.crash table. Am I missing something?

William Lachance (:wlach)

Comment 14

•

5 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)

Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the moztelemetry python thing isn't a thing anymore, so I guess I have to use STMO.

You can access BigQuery from Databricks, but we're not really encouraging use of it these days. I'd encourage you to explore what redash can do, it's pretty powerful e.g. https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions

It's also possible to pull data out from STMO and display it in a different way, this is e.g. what I hooked up for Mike Conley's tab spinner dashboard a few months ago:

https://wlach.github.io/blog/2019/10/using-bigquery-javascript-udfs-to-analyze-firefox-telemetry-for-fun-profit/

Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the telemetry.crash table. Am I missing something?

We need to add the stacks to the schema so they have their own BigQuery column, I just filed bug 1623626 for this. For now you should be able to find them (along with other fields that haven't yet made it into the schema) inside the additional_properties column.

One idea I had was to create a derived dataset of telemetry.crash which included crash signatures derived from this type of information. In theory this shouldn't be terribly difficult, and would enable some interesting things.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 15

•

5 years ago

Thanks, that helps point me in the right direction. I didn't realize there was more stuff in the additional_properties column. Having a derived dataset with crash signatures would certainly simplify what I'm trying to do!

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

5 years ago

See Also: → https://github.com/mozilla/fx-crash-sig/issues/10

Gabriele Svelto [:gsvelto]

Comment 16

•

5 years ago

A note on stacks in crash pings: the ones in Windows crash pings should be every bit as good as the ones on crash-stats. On macOS and Linux not so much.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 17

•

5 years ago

Just saving some WIP links here so I don't lose them:

https://sql.telemetry.mozilla.org/queries/69281/source
https://iodide.telemetry.mozilla.org/notebooks/470/

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-76
No longer blocks: wr-75

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

5 years ago

Blocks: wr-77
No longer blocks: wr-76

William Lachance (:wlach)

Updated

•

5 years ago

Depends on: 1631563

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

4 years ago

Blocks: 1607860

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

4 years ago

Blocks: wr-78
No longer blocks: wr-77

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

4 years ago

Blocks: wr-79
No longer blocks: wr-78

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

Blocks: wr-80

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

No longer blocks: wr-79

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Blocks: wr-81
No longer blocks: 1621137

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

No longer blocks: wr-80

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Blocks: 1621137

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Priority: P2 → P3

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

Blocks: gfx-82

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

No longer blocks: wr-81

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

Blocks: gfx-83

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

No longer blocks: gfx-82

Jim Mathies [:jimm]

Updated

•

4 years ago

No longer blocks: gfx-83

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 18

•

4 years ago

Quick update on the current status here.

This is waiting on bug 1631563 which in turn is waiting on bug 1636210 for faster symbolication. Once the new symbolication server is deployed, bug 1631563 tracks the work to automatically symbolicate incoming crash pings and create a derived dataset in STMO with that information. I submitted a PoC (with much help from :wlach and :willkg) which demonstrates how it could be done. Once all that is done and the derived dataset is in place, this bug tracks doing whatever gfx-specific thing we want using that data. It should really just be a matter of doing a SQL query against the derived dataset to group by crash signature and sort by descending count, and that will give us the list of top crashing signatures.

So until the dependencies are resolved there's nothing to do here. As it's unlikely to get done in the next week, I'll unassign this bug so somebody else can pick it up when the time comes.

Assignee: kats → nobody

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 19

•

3 years ago

We have this here: https://mathies.com/mozilla/crashes/

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED