Closed Bug 1121013 Opened 9 years ago Closed 9 years ago

FHR data migration: org.mozilla.crashes

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(firefox40 fixed, firefox41 fixed)

RESOLVED FIXED
Firefox 41
Tracking Status
firefox40 --- fixed
firefox41 --- fixed

People

(Reporter: Dexter, Assigned: benjamin)

References

Details

(Whiteboard: [uplift])

Attachments

(3 files)

org.mozilla.crashes.crashes has the following data:

  content-crash
  content-crash-submission-succeeded
  content-crash-submission-failed
  content-hang
  content-hang-submission-succeeded
  content-hang-submission-failed
  gmplugin-crash
  gmplugin-crash-submission-succeeded
  gmplugin-crash-submission-failed
  main-crash
  main-crash-submission-succeeded
  main-crash-submission-failed
  main-hang
  main-hang-submission-succeeded
  main-hang-submission-failed
  plugin-crash
  plugin-crash-submission-succeeded
  plugin-crash-submission-failed
  plugin-hang
  plugin-hang-submission-succeeded
  plugin-hang-submission-failed

They all need to be added to Telemetry.
No longer depends on: 1121007
Summary: FHR data migration: org.mozilla.crashes.crashes → FHR data migration: org.mozilla.crashes
Blocks: 1121016
No longer blocks: 1121016
Blocks: 1137160
No longer blocks: 1120356
We need to migrate this. This is relatively easy to add as keyed histograms (type count).
The plugin/content hangs and crashes are histograms.

The main-process crashes are more complicated and need design work. We will have counts of aborted sessions from bug 1133536, but counting the rate that breakpad captures crashes and the rate that users submit them requires design. How will this interact with the crash manager?
And FWIW the plugin/content crash and abort metrics are already recorded in telemetry. We'll only need to record submissions.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> The main-process crashes are more complicated and need design work.

All i see in the existing provider are counters:
https://dxr.mozilla.org/mozilla-central/source/services/healthreport/providers.jsm#1144
https://dxr.mozilla.org/mozilla-central/source/services/healthreport/providers.jsm#1186

Am i missing something?
Flags: needinfo?(benjamin)
I'm not sure I understand the question.

We can count plugin/content/gmp crashes and submission as they happen, in histograms.

We can't do that for main-process crashes.
Flags: needinfo?(benjamin)
Ah. Lets say its late for me.
I don't think i will be able to figure this out before i leave, happy for someone else to suggest an approach and mark it as [help] in the meantime.
Depends on: 1140132
For the design of main-process crashes, I think we should use a separate ping type. That way we can record the environment of the process that crashed, rather than lumping it in with the environment of the process which noticed the crash on next startup. It also means that we can attach other interesting data to the new ping, such as the OOM data from bug 1135543, or even other crash annotations, as long as they have appropriate privacy characteristics.

That's rather complicated because it means saving off the old environment and associating it with a crash report. I filed bug 1140132 to make this easier.
Assignee: nobody → benjamin
This gets the environment (and other interesting metadata) into the crash manager. It doesn't seem that there are any automated tests for writing of the crash event files, so I may try to add one, but for now I've verified this manually.
Attachment #8582636 - Flags: review?(dmajor)
Attachment #8582636 - Flags: feedback?(ted)
Comment on attachment 8582636 [details] [diff] [review]
1121013-crashevent

Review of attachment 8582636 [details] [diff] [review]:
-----------------------------------------------------------------

Just took a quick glance for now; will try to dive into the details in a day or two.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +164,5 @@
> +// and are saved in the crash record and sent with Firefox Health Report.
> +static char const * const kCrashEventAnnotations[] = {
> +  "AsyncShutdownTimeout",
> +  "BuildID",
> +  "Environment",

What is the Environment annotation? I haven't seen it before. (I hope it's not environment variables)
(In reply to :dmajor (semi-away, use needinfo) from comment #10)
> What is the Environment annotation? I haven't seen it before. (I hope it's
> not environment variables)

It's from the patch on bug 1140132.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> Created attachment 8582636 [details] [diff] [review]
> 1121013-crashevent
> 
> This gets the environment (and other interesting metadata) into the crash
> manager. It doesn't seem that there are any automated tests for writing of
> the crash event files, so I may try to add one, but for now I've verified
> this manually.

I know gps added copious tests when he wrote this functionality, here's one for example:
https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/unit/test_event_files.js
Excellent, I was looking in the other test directory. This adds the round-trip test for the annotations.
Attachment #8583256 - Flags: review?(ted)
Comment on attachment 8582636 [details] [diff] [review]
1121013-crashevent

Review of attachment 8582636 [details] [diff] [review]:
-----------------------------------------------------------------

I don't need to do another round if there aren't huge changes, but please wait for ted's feedback.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +532,5 @@
> +  bool Valid() {
> +    return mHandle != INVALID_HANDLE_VALUE;
> +  }
> +
> +  void WriteBuffer(const char* buffer, size_t len)

What do you think about taking a void* to emphasize that this is arbitrary bytes? (Reason I ask: I got a little scared when you called this char* method with a wide-string at 691)

@@ +538,5 @@
> +    if (!Valid()) {
> +      return;
> +    }
> +    DWORD nBytes;
> +    WriteFile(mHandle, bytes, len, &nBytes, nullptr);

What is |bytes|?

@@ +595,5 @@
> +#endif
> +
> +template<int N>
> +void
> +WriteLiteral(PlatformWriter& pw, const char (&str)[N])

IMO these three would feel more natural as member functions.

@@ +786,5 @@
> +    if (isGarbageCollecting) {
> +      WriteAnnotation(apiData, "IsGarbageCollecting",
> +                      isGarbageCollecting ? "1" : "0");
> +      WriteAnnotation(eventFile, "IsGarbageCollecting",
> +                      isGarbageCollecting ? "1" : "0");

I don't love all this duplication... but it's not quiiite painful enough to make it worth doing some clever code thing.
Attachment #8582636 - Flags: review?(dmajor) → review+
Depends on: 1149754
Depends on: 1150978
Comment on attachment 8583256 [details] [diff] [review]
1121013-crashevent-test

Review of attachment 8583256 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay.
Attachment #8583256 - Flags: review?(ted) → review+
Comment on attachment 8582636 [details] [diff] [review]
1121013-crashevent

Review of attachment 8582636 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry again for the delay.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +532,5 @@
> +  bool Valid() {
> +    return mHandle != INVALID_HANDLE_VALUE;
> +  }
> +
> +  void WriteBuffer(const char* buffer, size_t len)

I would suggest uint8_t* maybe.

@@ +605,5 @@
> +WriteString(PlatformWriter& pw, const char* str) {
> +#ifdef XP_LINUX
> +  size_t len = my_strlen(str);
> +#else
> +  size_t len = strlen(str);

There's already a #define my_strlen strlen for non-linux platforms at the top of the file, so you don't need this ifdef.

@@ +614,5 @@
> +
> +template<int N>
> +static void
> +WriteAnnotation(PlatformWriter& pw, const char (&name)[N],
> +                const char* value) {

This is nice, cleans up the code below significantly. Thanks!

@@ +693,2 @@
>  #elif defined(XP_UNIX)
> +    markerFile.WriteBuffer(minidumpPath, my_strlen(minidumpPath));

Is it worth making a wchar_t* override in WriteBuffer so you don't have to #ifdef this?

@@ +786,5 @@
> +    if (isGarbageCollecting) {
> +      WriteAnnotation(apiData, "IsGarbageCollecting",
> +                      isGarbageCollecting ? "1" : "0");
> +      WriteAnnotation(eventFile, "IsGarbageCollecting",
> +                      isGarbageCollecting ? "1" : "0");

Yeah, it's not fantastic, but I dunno. With the other code cleanup it's tolerable. You could just use a #define to do both writes like we do with WRITE_STATEX_FIELD below.

@@ +1681,5 @@
>  
> +static bool
> +IsInWhitelist(const nsACString& key)
> +{
> +  for (size_t i = 0; i < ArrayLength(kCrashEventAnnotations); ++i) {

I'm not sure that it's worth trying to do something faster here, since I'm not sure this would even show up on a profile with only 9 entries in the list, but note that this will wind up getting called every time someone calls AnnotateCrashReport, so maybe it's worth thinking about, or at least filing a followup. The overkill solution would be to use something like gperf to generate a perfect hashtable for this known set of keys.
Attachment #8582636 - Flags: feedback?(ted) → feedback+
Blocks: 1131836
Try run https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b7a8dea8538

Part A & B pushed to fx-team.
Keywords: leave-open
Blocks: 1155871
sorry had to back this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=2866587&repo=fx-team
Flags: needinfo?(benjamin)
Flags: needinfo?(benjamin)
Depends on: 1166759
No longer depends on: 1166759
Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WITH_DUMP and this patch leaves that unchanged. r?gfritzsche
Attachment #8612976 - Flags: review?(gfritzsche)
Depends on: 1169691
Comment on attachment 8612976 [details]
MozReview Request: Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT

https://reviewboard.mozilla.org/r/9683/#review8647

::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js:26
(Diff revision 1)
> +  SetupTelemetryArchiveTest();

Nit: JS code functions start with lower-case letter.

::: toolkit/components/crashes/CrashManager.jsm:548
(Diff revision 1)
> +              crashDate: date.toISOString().slice(0, 10),

That slicing is a bit opaque, we should have at least a comment on what format this is expected to produce.

::: toolkit/components/crashes/CrashManager.jsm:537
(Diff revision 1)
> +          if ('TelemetryEnvironment' in metadata) {

Nit: This seems confusing, checking on `metaData` but changing `telemetryMeta`, could just do both on `telemetryMeta`.

::: toolkit/components/crashes/CrashManager.jsm:550
(Diff revision 1)
> +              hasCrashEnvironment: (crashEnvironment !== null),

Could we just put `null` or `{}` as `overrideEnvironment` instead of an additional flag?

::: toolkit/components/crashes/CrashManager.jsm:535
(Diff revision 1)
> +          let crashEnvironment = null;

I think this section should get some comment on what it is doing or preferrably be moved to a separate function.

::: toolkit/components/telemetry/TelemetryEnvironment.jsm:908
(Diff revision 1)
> -      applicationId: Services.appinfo.ID,
> -      applicationName: Services.appinfo.name,
> +      applicationId: Services.appinfo.ID || null,
> +      applicationName: Services.appinfo.name || null,
>        architecture: Services.sysinfo.get("arch"),
> -      buildId: Services.appinfo.appBuildID,
> -      version: Services.appinfo.version,
> -      vendor: Services.appinfo.vendor,
> -      platformVersion: Services.appinfo.platformVersion,
> +      buildId: Services.appinfo.appBuildID || null,
> +      version: Services.appinfo.version || null,
> +      vendor: Services.appinfo.vendor || null,
> +      platformVersion: Services.appinfo.platformVersion || null,

These changes are only needed for tests?

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:39
(Diff revision 1)
> +   * @param expected: an array of [['prop'...], 'value'] to check

This could use a description line on what it is used for.
I assume the use-case is to find pings that are new in the archive and match certain properties.

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:23
(Diff revision 1)
> + * was properly saved. To use, first initializing to collect the starting pings

Nit: initialize

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:62
(Diff revision 1)
> +      for (let [props, eval] of expected) {

Re-using the name `eval` in JS seems confusing.

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:46
(Diff revision 1)
> +   * @returns the ping data if found, null if not found

"@returns The data of the first ping that matches, null if not found"?

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:48
(Diff revision 1)
> +  promiseCheckFor: Task.async(function*(type, expected) {

`promiseFindPingWithProperties` or similar?

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:64
(Diff revision 1)
> +        for (let prop of props) {

Can we do that without jumping to labels please?
A helper function can make this much easier to follow, e.g.:
`let value = getPropertyValue(path, ping);`

::: toolkit/components/telemetry/tests/unit/helper_TelemetryArchive.js:1
(Diff revision 1)
> +const {utils: Cu} = Components;

This should have a `.jsm` extension.
`TelemetryArchiveTesting.jsm` in analogy to `AddonManagerTesting.jsm`?

::: toolkit/components/telemetry/Histograms.json:7551
(Diff revision 1)
> +  "PROCESS_CRASH_SUBMIT_ATTEMPT": {
> +    "expires_in_version": "never",
> +    "kind": "count",
> +    "keyed": true,
> +    "description": "An attempt to submit a crash. Keyed on the CrashManager Crash.type."
> +  },
> +  "PROCESS_CRASH_SUBMIT_SUCCESS": {
> +    "expires_in_version": "never",
> +    "kind": "boolean",
> +    "keyed": true,
> +    "description": "The submission status when main/plugin/content crashes are submitted. 1 is success, 0 is failure. Keyed on the CrashManager Crash.type."
> +  },

For FHR data equivalency these should be "releaseChannelCollection":"opt-out"
Attachment #8612976 - Flags: review?(gfritzsche)
https://reviewboard.mozilla.org/r/9683/#review8699

> Could we just put `null` or `{}` as `overrideEnvironment` instead of an additional flag?

I don't understand this comment. I am already passing null, and that has the effect of saving the ping with the current environment instead of the crashing environment.

> These changes are only needed for tests?

Yes, many xpcshell tests don't mock nsIXULRuntime which leads to these properties being undefined.
Flags: needinfo?(gfritzsche)
https://reviewboard.mozilla.org/r/9683/#review8777

> I don't understand this comment. I am already passing null, and that has the effect of saving the ping with the current environment instead of the crashing environment.

I wonder why we submit the current environment in this case and set a flag about its meaning.
Can't we change things so that we submit an empty environment block or null if we don't get the environment data out of the crash data?
That way the environment wouldn't have different meanings for this ping type and we could drop `hasCrashEnvironment`.
Would this lose us anything?
Flags: needinfo?(gfritzsche)
Comment on attachment 8612976 [details]
MozReview Request: Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT

Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WITH_DUMP and this patch leaves that unchanged. r?gfritzsche
Attachment #8612976 - Attachment description: , and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT → , and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WITH_DUMP and this patch leaves that unchanged. r?gfritzsche
Attachment #8612976 - Flags: review?(gfritzsche)
Comment on attachment 8612976 [details]
MozReview Request: Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT

https://reviewboard.mozilla.org/r/9683/#review9159

This looks good with the below addressed.

::: toolkit/components/telemetry/tests/unit/TelemetryArchiveTesting.jsm:85
(Diff revision 2)
> +    TelemetryController.initLogging();

When using do_get_profile(true), this is not needed.

::: toolkit/components/telemetry/tests/unit/TelemetryArchiveTesting.jsm:81
(Diff revision 2)
> +    Services.prefs.setBoolPref("toolkit.telemetry.enabled", true);
> +    Services.prefs.setBoolPref("datareporting.healthreport.service.enabled", true);

I don't think we need either of those for TelemetryArchive testing purposes.

::: toolkit/components/telemetry/tests/unit/TelemetryArchiveTesting.jsm:55
(Diff revision 2)
> +   * @returns a matching ping if found, or null

"The first matching ping ..."
or "The oldest matching ping ..."
Attachment #8612976 - Flags: review?(gfritzsche) → review+
Is there something important about it being the first/oldest ping? I was trying to avoid specifying that, in case that becomes an implementation burden later. And ordering isn't important to any of the clients that I know of.
Flags: needinfo?(gfritzsche)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #33)
> Is there something important about it being the first/oldest ping? I was
> trying to avoid specifying that, in case that becomes an implementation
> burden later. And ordering isn't important to any of the clients that I know
> of.

Ah, ok. I think it's odd having "random" results in case of multiple matches.
But we can also just address that when there is a need, so let's leave it as is.
Flags: needinfo?(gfritzsche)
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/6034dc30409f
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 41
mreid, this added a new "crash" ping type. Do I need to file a bug to accept this ping type, or will that happen automatically?
Flags: needinfo?(mreid)
Whiteboard: [uplift]
The currently accepted types are:
idle-daily, saved-session, android-anr-report, ftu, loop, flash-video, main, activation, deletion

For new types, yes, please file a bug to add it as an allowed value.
Flags: needinfo?(mreid)
(In reply to Mark Reid [:mreid] from comment #38)
> The currently accepted types are:
> idle-daily, saved-session, android-anr-report, ftu, loop, flash-video, main,
> activation, deletion
> 
> For new types, yes, please file a bug to add it as an allowed value.

Just wanted to clarify - any value is accepted, but values that are not in the above list are bucketed into "OTHER".
Blocks: 1174871
Comment on attachment 8612976 [details]
MozReview Request: Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT

(Side-note: the other patches here are already on 40)

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8612976 - Flags: approval-mozilla-aurora?
Comment on attachment 8612976 [details]
MozReview Request: Bug 1121013 part C - report a new crash ping type on main-process crashes, and record submission rate/failure information in telemetry. Plugin/content crashes are already recorded via SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WIT

Unified telemetry is an important new feature. It is blocking some other projects. Taking it.
Attachment #8612976 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1177118
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.