Open Bug 1631563 Opened 4 years ago Updated 3 months ago

[meta] Add a bigquery-derived dataset with crash signatures for (a subset of?) telemetry.crash

Categories

(Data Platform and Tools :: General, task, P2)

task
Points:
3

Tracking

(Not tracked)

People

(Reporter: wlach, Unassigned)

References

Details

telemetry.crash includes crash stacks and other data from which we should be able to derive a crash signature using Ben Wu's fx-crash-sig library (see e.g. https://github.com/mozilla/fx-crash-sig/issues/10). We could put those signatures in a table in bigquery (along with the document id) and then use that information for determining how frequently we're seeing various types of signatures.

I can kick-start the data platform basics of this task -- setting up an ETL job, writing to bigquery, etc. in the hope that :kats and others can perform whatever modifications are needed to fx-crash-sig so that we're actually generating useful data.

Rough design parameters:

  • Write this as a python script which takes a day as input, gets the data from telemetry.crash, gets a signature and outputs to a destination in bigquery
  • Can probably get away with running this only against a 1% sample of pings in production (especially on release, where we get a ton of pings)

When we're happy with the results, we can look into scheduling this on Airflow as a regular job.

Depends on: 1631564

Going to start on this today, sorry for the delay.

Assignee: nobody → wlachance

Quick update: working on this, found out that we weren't processing crash data quite correctly (bug 1635212) -- hope to resume work on this soon.

Any update on this Will?

Flags: needinfo?(wlachance)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #3)

Any update on this Will?

Sorry for being a blocker on this, I feel bad. Anyway, I managed to carve out an hour and get a first draft of this finished, you can see the tentative PR here along with a bunch of technical details:

https://github.com/mozilla/fx-crash-sig/pull/11

Unfortunately you need permissions to the fx-crash-sig-bigquery project to see the results (:kats has this, others don't-- but I can add you), but they seem sorta kinda reasonable at least to my untrained eye. We should probably figure out a way forward on this-- I can help with the data engineering / BigQuery bits, but realistically I will need help to get the crash processing machinery into shape. Let's chat about this offline.

Flags: needinfo?(wlachance)

Let me know what I can help with. I'll try to do a siggen release tomorrow to pick up updates.

:wlach - does the updated siggen release unblock you per comment 4?

Flags: needinfo?(wlachance)
Points: --- → 3
Priority: -- → P3

(In reply to Mark Reid [:mreid] from comment #7)

:wlach - does the updated siggen release unblock you per comment 4?

:kats is currently working on this to my knowledge (at least the crash signature generation parts, I volunteered to wire up the airflow portion once we were happy with the results). I believe there's nothing blocking this short of time/energy to work on it. Unassigning myself for now.

Assignee: wlachance → nobody
Flags: needinfo?(wlachance)

Yeah I can assign this bug to myself to indicate that the next steps are on me.

Assignee: nobody → kats
Blocks: wr-80
No longer blocks: wr-80

We landed :kats' work, next steps belong to me:

  1. Add continuous integration to build a docker container to run this extract/write process
  2. Wire it up to telemetry-airflow

I'll at least start on this today, have been juggling a lot of things lately so not sure how long it will take.

Assignee: kats → wlachance

Thanks! Also just to update on this bug: per https://github.com/mozilla/fx-crash-sig/pull/15#discussion_r477519618 this is now (sort of) depending on bug 1636210 to make tecken better handle the symbolication load. It would also be nice to fix bug 1660516 as that can cause a batch of crashes to fail symbolication and then we have to fall back and try each crash individually which is slow. I can look into that.

Depends on: 1636210, 1660516

Not working on this right now, so unassigning myself. :willkg is working on some of the dependencies to getting this implemented.

Assignee: wlachance → nobody

Making this a tracker bug and bumping it up to P2 since esmyth expressed a need for signatures on crash pings.

Priority: P3 → P2
Summary: Add a bigquery-derived dataset with crash signatures for (a subset of?) telemetry.crash → [meta] Add a bigquery-derived dataset with crash signatures for (a subset of?) telemetry.crash

Current status on this:

I'm fixing fx-crash-sig so people can use it as a Python library in notebooks and other places for their ad hoc investigations. (https://github.com/mozilla/fx-crash-sig/issues/25)

When we did some fx-crash-sig a couple of years ago, Tecken (the Mozilla Symbols Server) fell down and we determined that it was a bad idea for symbolication to be tied to symbols upload. I'm working on spinning off a separate symbolication microservice. That's tracked in bug #1636210. I'll make this bug depend on that one.

Once that's done and we can use a symbolication API that can handle bursts, I think the plan was to run that on a sample of crash pings nightly and capture the symbols and crash signatures. I'm fuzzy on the architecture and how that'll work.

Current status on this:

I still need to finish up fixing fx-crash-sig so people can use it as a Python library in notebooks and other places for ad hoc investigations. I got about half-way with that and then I got side-tracked and never got back to finish it up. (https://github.com/mozilla/fx-crash-sig/issues/25)

I still need to finish standing up Eliot. Eliot is the name of the new symbolication API microservice that I'm spinning out of Tecken. I hit two bugs with the underlying library (Symbolic) that were fixed a few weeks ago. I need to circle back, verify the fixes, and then finish standing up Eliot. That's tracked in bug #1636210.

Once Eliot is stood up, I can move on to the work for this bug.

Current status:

Eliot is in production. https://symbolication.services.mozilla.com/

Next step is to update fx-crash-sig and get it working. I might get to this in 2021. https://github.com/mozilla/fx-crash-sig/issues/25

After that, I'm ready to look at this bug. I'll probably start by writing up a project plan to figure out what the goals are, stakeholders, use cases, and the rest of the details. I think I'll get to this in 2022.

Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.