Closed Bug 1575850 Opened 6 years ago Closed 6 years ago

Glean crash recording capabilities are feature complete

Categories

(Data Platform and Tools :: Glean: SDK, task, P2)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: travis_, Assigned: travis_, NeedInfo)

References

Details

(Whiteboard: [telemetry:glean-rs:m8])

Collecting Crash Metrics With Glean

There are several ways in which we currently collect crash data on mobile, including the Google Play Console, Apple AppStore, Sentry, Socorro, and Glean. Each of these collect different amounts and types of information and some of them are reliant upon third party libraries and services. A commonly stated problem with our current approach is that we are missing the capability to effectively connect crash data to current usage/client/version/experiment data. This limits how we can analyze the crash data in relation to other data points.

Recent updates to Glean and android-components has added a way to record crash count telemetry via Glean. Glean helps to solve the issue of connecting crash telemetry with other important data points like the usage/client/version/experiment data that is desired. Now is a perfect time to consider what additional crash information we should have the capability to collect with Glean, as well as what other considerations around crash metrics does Glean need to be able to deal with.

The purpose of this document is to answer some questions regarding the requirements of crash telemetry collection in order to evaluate Glean’s capabilities to satisfy any deficiencies. I will be adding some ni? to a couple of people I know are interested in this to be able to weigh in, if you have been ni? and you know someone else that would be interested in commenting, or if you feel someone else could answer better, please add a ni? or flag me so I can.

Stakeholders

  • Telemetry
  • Data Science
  • Engineering (relating to collection of crash data for debugging purposes)

Note: This list is not intended to be comprehensive, there may be other interested parties.

Questions

There are a few initial questions that need to be answered to help define the requirements. This, in turn, will help to determine whether or not Glean’s current capabilities are sufficient for instrumenting crash metrics in applications.

Question 1: What crash specific data needs to be included?

  • There is existing schema for desktop crash pings that can be found here. Does it make sense to keep anything from this format or start with a new smaller format?
  • An initial look at this produced the requirements found in this doc. There were still many items from the doc that had questions surrounding them, but it should serve as a good starting point for discussion.

Question 2: Should we look for a way to link to the Sentry information, or would it be possible and beneficial to integrate the information/value we get from Sentry into Glean and reduce reliance on third party tools?

  • If so, what information is required, etc...

Question 3: Is it desirable to integrate with current Crash Stats (Socorro) data somehow?

  • What existing data collected by Socorro can be replaced or enhanced by the data we collect?

Question 4: Can this information be sent on the schedule of an existing Glean ping, or are there custom scheduling requirements for collecting crash data?

  • There are lots of reasons to try and send crash data as soon as possible, but what are the actual requirements for the scheduling of crash data upload?

Question 5: With Glean-Core soon to replace Glean-AC, do we still want to rely on android-components, or would consumers be better served if the crash functionality was built into Glean in a more cross-platform way?

  • There will soon be other platforms besides Android supported by Glean, which is why it might make sense if the crash data collection were built into Glean and not reliant upon another library. This doesn’t necessarily obsolete the current lib-crash integration as it could still be used to instrument android applications using android-components.

Emily Thompson , can you please please weigh in for data science (or designate someone from data-science to have an opinion)?

Flags: needinfo?(ethompson)

Stefan, I know you are interested from an Engineering standpoint, would you mind commenting or designating someone who can weigh in on my questions?

Flags: needinfo?(sarentz)
Assignee: nobody → tlong

If anyone else stumbles upon this and wishes to comment, please do! I don't mean to exclude any opinions on the matter.

Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m8]

Saptarshi - can you or Chris or Corey have a look at this?

Flags: needinfo?(ethompson) → needinfo?(sguha)

@Corey to weigh in

Flags: needinfo?(cdowhygelund)
Blocks: 1578473

Handing over to wbeard, as he has much more knowledge of crashes.

Flags: needinfo?(cdowhygelund) → needinfo?(wbeard)

Hi Travis: sguha, Corey and I have discussed this and we'll be putting together a response with input to these questions

(In reply to Chris [:wbeard] (EST) from comment #7)

Hi Travis: sguha, Corey and I have discussed this and we'll be putting together a response with input to these questions

Thanks! Much appreciated and I'm looking forward to the input!

Question 1: What crash specific data needs to be included?

We’ve been working on modeling crash rates recently, which in addition to the typical attributes (os, channel, crash type, version/pre-release version, build id), requires client id to tie to usage. The complexities of Firefox versioning has made for some hurdles when all of the version information isn’t contained in the crash data set, so including as much version information as possible would be useful (example fields in our data have names like app_version, display_version, build_id).

In hindsight, it would be nice to have had usage metrics in the crash ping (session hours, active hours, # URIs). This could be asking for too much as the costs may outweigh the benefits, but it would improve crash rate estimates in the case where a given client doesn't have a matching main ping with the relevant information. Regardless, ensuring there’s a way to easily join crash pings to usage and experiment (enrollment) data would make this data more useful.

We currently treat the different desktop crash types very differently (content vs main vs startup). I don’t know if these desktop crash types map directly onto mobile crash types, but we would need this kind of identifier. Does mobile have a different taxonomy of crash types that engineering treats differently?

We have had other investigations where more fine grained crash reasons (“MozCrashReason”) and memory information is important for diagnosing problems with recent releases, so all of that data would be useful in the crash ping.

Regarding experiments, it would be useful to include experiments that clients are enrolled in. There’s currently a bug to tag clients as part of experiments even if they unenroll. This feature should also probably be part of a Glean crash report, but I’ll tag Felix, who is already involved with mobile experiments.

Question 3: Is it desirable to integrate with current Crash Stats (Socorro) data somehow?

Relman can probably answer this better than we can. Data science hasn’t had a lot of experience with Socorro data other than limited deep dives into details or causes of crashes.

Having said that, being able to easily tie crash telemetry to Socorro reports could enable more kinds of analysis in the future, but there could be issues with being able to connect the PII in Socorro with our broader telemetry ecosystem.

Question 4: Can this information be sent on the schedule of an existing Glean ping, or are there custom scheduling requirements for collecting crash data?

As you note that there are lots of reasons to send the crash ping ASAP, the way this most affects us is the difference between the crash time and submission time. The closer these two are, the better we can model crash rates, and the quicker relman can act on problems with recent releases. Since our data is typically partitioned by submission date, pings with a very different crash time will likely be dropped from most analyses.

I’m NI’ing members of Relman since they’ve had different use cases of crash pings, and felix in case he wants to add since he has more background in mobile & experiments.

Flags: needinfo?(wbeard)
Flags: needinfo?(sguha)
Flags: needinfo?(rkothari)
Flags: needinfo?(mozillamarcia.knous)
Flags: needinfo?(flawrence)

(In reply to Chris [:wbeard] (EST) from comment #9)

We currently treat the different desktop crash types very differently (content vs main vs startup). I don’t know if these desktop crash types map directly onto mobile crash types, but we would need this kind of identifier. Does mobile have a different taxonomy of crash types that engineering treats differently?

Most crash types will map directly as mobile has both a main process and content processes just like desktop. Mobile crashes however include uncaught Java exceptions which simply do not exist on desktop.

We have had other investigations where more fine grained crash reasons (“MozCrashReason”) and memory information is important for diagnosing problems with recent releases, so all of that data would be useful in the crash ping.

Note that only some of the data that is present in a crash report ends up in crash pings. “MozCrashReason” is one of those fields, you can find the others in CrashAnnotations.yaml. Every annotation marked with the ping: true will be in the crash ping, the other's wont.

Having said that, being able to easily tie crash telemetry to Socorro reports could enable more kinds of analysis in the future, but there could be issues with being able to connect the PII in Socorro with our broader telemetry ecosystem.

AFAIK there's been an ongoing effort to do that since we implemented client-side stack-walking (bug 1280469). However AFAIK there is still no tool available to easily correlate crash ping data to Socorro reports even though it is technically possible to do so.

One final note: Fennec did send fully-featured crash pings so the code we used there might be used as a reference.

Ha! Now I see that Travis asked Emily who asked Corey who asked Chris who asked me, and I asked Travis about this on Monday.

From an experiment perspective, it would be lovely if the crash ping could have the same experiments map as the other Glean pings. This, the client_id and the crash ping's submission date are enough for the crash ping to be used to detect the number of crashes a client has experienced in an experiment. It's possible that some esoteric experiments will want to examine the number of crashes meeting some specific criteria (e.g. analyse startup crashes specifically), in which case that information about the crash needs to be in the ping. But we DSes are not the right people to generate that list of information: we won't know what information we need until a product team comes to us asking to measure it!

In any case, experimentation's needs should be served by client_id, submission date and the experiments map plus a subset of whatever data other people want for debugging crashes in a non-experiment context.

Flags: needinfo?(flawrence)

Hello Travis - It might be interesting to circulate this to the @stability mailing list. I suspect that there may be others interested in adding feedback beyond just the Relman team and the others who are already on the ni list.

Some feedback from the mobile side:

I like having more extensive crash pings - but not to replace Sentry. I don't think it should be a goal to replace Sentry because that is a good working system. I am sure we can send Sentry-like data in a crash ping but I am not convinced we would (should) build a similar dashboarding and analysis on top of that. It works, so why reinvent that.

For me the real power is this: having more meta data in crash pings that live in a system we control (like Redash), and which can be tied to other telemetry we have, will result in us being able to create really interesting dashboards. For example I want to ask simple questions like:

  • what is the crash ration difference between ARM64 and ARM32
  • do people with devices with less memory experience more crashes
  • do we get AndroidPermissionErrors on a specific cluster of app versions or device types
  • what are the top 10 crash signatures for the last shipped app version and how doesit compare to the previous one
  • do users with a specific feature flag turned on have a different crash ratio than others

So I think for me the power is not so much in stack traces, i would go to Sentry, but instead to really being able to see aggregate crash metrics over many clients, combined with existing telemetry.

Flags: needinfo?(sarentz)

From my perspective as someone who works in crash analysis and investigation, I am all for integration that would expand our ability to analyze data. However, wondering if the fact that Socorro is now in maintenance mode makes any difference in this discussion. I am not sure if there would be work needed on the Socorro side as part of the integration.

Flags: needinfo?(mozillamarcia.knous)

I've finally had time to go through the input from the needinfo's and I want to thank everyone who contributed. I learned a lot about how we measure and deal with crash information from all of this.

Based on the responses I've gotten, I feel like I can make the following statements in answer to the questions I posed:

Question 1 : What information do we need?

  • We need to include a crash reason or other telemetry about the crash.
  • We need metadata about the client, such as client_id, os, version, manufacturer, etc.
  • Having information about active experiments is important.

Based on the above statements, I feel that Glean is already capable of collecting this information:

  • Crash reasons can be recorded as a metric just like any other, along with additional crash information as needed.
  • Device info as found in the client_info section.
  • Experiment enrollment info found can be found in the ping_info section.

Question 2: Can we integrate with or replace Sentry?

Based mostly on Stefan's comments, I don't think we have any value in trying to supplant Sentry with Glean.

Question 3: Should we integrate with Socorro?

  • Socorro is currently in maintenance mode.
  • Based on Gabriele's comment, it is already technically feasible to correlate ping data to Socorro data.

Based on the above statements there is either low interest in this right now, or it could be done with some effort and probably without modifying Glean to accomplish.

Question 4: What are the scheduling requirements for sending collected crash information?

Basically everyone I've talked to about this have said: ASAP. Fortunately Glean has the capability of allowing consumers to create custom pings which can be sent on a custom schedule, as controlled by the integrating application.

Summary

Based on the information above, and based on Glean's current capabilities, I conclude that no change to Glean is required to be able to accomplish the needs surrounding crash telemetry:

  • Glean can create a custom ping, so an integrating application could instrument a "crash ping" and an upload schedule that suits its own needs. - Glean already has metric types to accommodate a "crash reason", and potentially other valuable information about the crash, so there are no new metric types that need supported in order to accomplish valid crash telemetry.
  • Glean already collects all of the metadata that is desired to be able to answer questions related to crash rates.

What I do propose is: that a section be added to the [Glean documentation] that documents an example or strategy for collecting crash telemetry using existing Glean capabilities such as custom pings, and appropriate Glean data types for crash type, etc.

That leaves an open question about what to do with lib-crash... There is an existing implementation of collecting crash counts using Glean in android-components/lib-crash. This implementation has a couple of drawbacks. First, depending on how it is configured, we may not get the telemetry unless the user chooses to send the data with every crash (if configured to prompt the user, a dialog is shown and the user can ignore it). Secondly, this implementation uses the Glean metrics ping which is only sent at most, once a day. I would support upgrading this to use a custom ping for scheduling, but this would also increase the integration overhead as any application that wanted to use this would need to add the custom ping schema to the pipeline in order to collect it.

I've mentioned this before (though not on this bug), but I feel being able to reconstruct a "crash signature" of a crash ping from the stack is something we really, really want (https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html#symbolication-and-signatures). The "crash reason" field just isn't enough to tell you what kind of event a crash ping represents. Being able to correlate crash ping data with Socorro is one reason to do this, but it's certainly not the only one.

Would glean support an arbitrary json payload with some size (something like this: https://github.com/mozilla/fx-crash-sig/blob/master/fx_crash_sig/sample_traces.py) for a "metric"?

(In reply to William Lachance (:wlach) (use needinfo!) from comment #16)

I've mentioned this before (though not on this bug), but I feel being able to reconstruct a "crash signature" of a crash ping from the stack is something we really, really want (https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html#symbolication-and-signatures). The "crash reason" field just isn't enough to tell you what kind of event a crash ping represents. Being able to correlate crash ping data with Socorro is one reason to do this, but it's certainly not the only one.

Would glean support an arbitrary json payload with some size (something like this: https://github.com/mozilla/fx-crash-sig/blob/master/fx_crash_sig/sample_traces.py) for a "metric"?

I'm not sure Glean will ever support an "arbitrary json payload" as it goes against the design. What Glean would do is support metric types that would allow for capturing the information I see in the sample trace you linked. From what I can see, the fields appear to be uniform and consistent, and that is something that Glean can handle quite well, even if it requires us to make a specific Glean data type to represent the data, the effort is relatively low.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

(In reply to Travis Long from comment #17)

(In reply to William Lachance (:wlach) (use needinfo!) from comment #16)

I've mentioned this before (though not on this bug), but I feel being able to reconstruct a "crash signature" of a crash ping from the stack is something we really, really want (https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html#symbolication-and-signatures). The "crash reason" field just isn't enough to tell you what kind of event a crash ping represents. Being able to correlate crash ping data with Socorro is one reason to do this, but it's certainly not the only one.

Would glean support an arbitrary json payload with some size (something like this: https://github.com/mozilla/fx-crash-sig/blob/master/fx_crash_sig/sample_traces.py) for a "metric"?

I'm not sure Glean will ever support an "arbitrary json payload" as it goes against the design. What Glean would do is support metric types that would allow for capturing the information I see in the sample trace you linked. From what I can see, the fields appear to be uniform and consistent, and that is something that Glean can handle quite well, even if it requires us to make a specific Glean data type to represent the data, the effort is relatively low.

Awesome! It sounds like the right thing to do for now is to move ahead with implementing a glean-style crash pings without the trace annotation, and that add that datatype later when we need it. Thanks for doing this investigation.

See Also: → 1582479

Just one last note to be sure everybody is on the same page: Socorro being in maintenance mode doesn't mean it's going away, it means it's essentially feature-complete. We rely on Socorro for triaging and investigating crashes of native components and we relied on it for Fennec's Java crashes. I would suggest to keep submitting native crashes to Socorro to make them visible to engineering. At the moment there is no other reliable source for that and not having crashes in Socorro means we might not see them.

(In reply to Gabriele Svelto [:gsvelto] from comment #19)

Just one last note to be sure everybody is on the same page: Socorro being in maintenance mode doesn't mean it's going away, it means it's essentially feature-complete. We rely on Socorro for triaging and investigating crashes of native components and we relied on it for Fennec's Java crashes. I would suggest to keep submitting native crashes to Socorro to make them visible to engineering. At the moment there is no other reliable source for that and not having crashes in Socorro means we might not see them.

Absolutely! I hope I was not implying that there was no value to Socorro. I just wanted to point out that there was nothing driving <edit>(or stopping)</edit> integrating it with Glean at the current time.

Hi Travis,
Is Fenix submitting crash pings and are they being populated inside telemetry.crash?
As an aside, does Fennec submit crash pings and do they get populated inside telemetry.crash?

I ask because we want to start creating some baselines for Fenix stability.

Flags: needinfo?(tlong)
Flags: needinfo?(sarentz)

Yes, it looks like Fenix is using the lib-crash implementation: https://github.com/mozilla-mobile/fenix/blob/8f97d247a69fd7321a2d22f21fd6a92d4f689cbd/app/src/main/java/org/mozilla/fenix/components/Analytics.kt#L50

Unfortunately, I don't know if they are getting to telemetry.crash or not, I do know that everything should be queried from redash or BigQuery, and I also think that there is a crash dashboard in the works that may help you out too. I think :chenxia may know more about the dashboard or I found this bug that is related: https://bugzilla.mozilla.org/show_bug.cgi?id=1584351

Let me know if that doesn't get you going in the right direction and I'd be happy to help more, if I am able.

Flags: needinfo?(tlong)

(In reply to "Saptarshi Guha[:joy]" from comment #21)

As an aside, does Fennec submit crash pings and do they get populated inside telemetry.crash?

Fennec did send them (and the ESR68 version still does) and they did get into telemetry.crash.

Just to clarify, Fenix is not sending a "crash ping" via Glean, it's merely sending crash counts via the Glean metrics ping. [:dexter] reminded me of the table names where the data that you are interested in can be found:

org_mozilla_fenix_nightly.metrics
and
org_mozilla_fenix.metrics

Flags: needinfo?(sarentz)
You need to log in before you can comment on or make changes to this bug.