Closed Bug 723802 Opened 8 years ago Closed 8 years ago

Add telemetry for startup crash detection

Categories

(Toolkit :: Telemetry, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: MattN, Assigned: MattN)

References

Details

Attachments

(2 files, 1 obsolete file)

Startup crash detection was added in bug 294260.  I think the following pieces of telemetry data would be useful:

1) The uptime when start crash tracking began and ended.
** This is so we can measure the proportion of startup crashes that the feature
   detects by comparing with uptime data in crash-stats.
2) Whether the user was forced into safe mode by the startup crash detection.
** This would let us know how useful this feature is: to know if/when people are
   experiencing N consecutive startup crashes (where N is currently 3).
3) Whether the last startup was a crash.
** This would only be recorded if we didn't crash on the following startup since
   it seems telemetry data is only saved at shutdown.  Could I have telemetry
   save to disk in this case?
** It would be nice to know the number of consecutive startup crashes but then
   we'd be counting users more than once with every additional crash.

There are problems collecting some of this data if there is a crash before telemetry is sent or persisted on disk. Any suggestions?

Note that the startup crash detection may detect some crashes which breakpad doesn't but it can only increment the counter if startup reaches the point where it begins tracking.  #2 & #3 will be useful to compare to Socorro data.

Is it necessary to call accumulate with a false value for boolean telemetry when we just want to know how often something happens ie. #2?  From what I can tell it should be enough to just report the interesting (true) case and then compare that to the number of telemetry reports for the time period.  This is what Telemetry::A11Y_INSTANTIATED currently does but it leads to useless bar graphs showing 100% of reporters with it on without showing the total number of reports that didn't specify a value.  It seems like boolean telemetry probes should have a default value.  The advantage of only reporting in the true case is that it has no perf. impact.
Filed bug 723846 for making boolean histograms more useful.

We have bug 719167 for saving telemetry more often during a session.

It's certainly possible to save telemetry at other, specific times--I guess that's what you're asking for in #3?
This is to be able to compare to Socorro data (like bug 722566) to see how effective the crash detection is and to be able to compare crash trends between the two data sources.
Attachment #595566 - Flags: review?(taras.mozilla)
is the intention here to correlate data points across two data sets (telemetry and socorro) or just monitor trends in a similar fashion?
(In reply to Sid Stamm [:geekboy] from comment #3)
> is the intention here to correlate data points across two data sets
> (telemetry and socorro) or just monitor trends in a similar fashion?

There will be no direct correlation so we would compare the aggregate results to get an approximation for the proportion of startup crashes detected.  Then we would compare trends for other uses such as detecting a spike in crashes that Breakpad/Soccorro may not catch.

You can see in the patch we're just reporting two new timestamps.
Comment on attachment 595566 [details] [diff] [review]
v.1 Implement #1 - timeline portion

This patch needs a comment explaining what startupCrashDetectionBegin/startupCrashDetectionEnd are for. Is the goal to determine how much time elapses between begin/end, whether it sometimes begins but doesn't end?
Attachment #595566 - Flags: review?(taras.mozilla)
(In reply to Taras Glek (:taras) from comment #5)

Do you prefer the comment in the header or with the Record() calls?

The goal is in comment 0 and comment 2 but I'm not sure if you want me to elaborate here or you just want those questions answered in the code comment.

If startupCrashDetectionEnd is not called then that means we crashed in which case telemetry wouldn't be sent.  The elapsed time is not as important as the specific times.

Is there a separate privacy review that is needed for new telemetry or is this bug sufficient?
(In reply to Matthew N. [:MattN] from comment #6)
> (In reply to Taras Glek (:taras) from comment #5)
> 
> Do you prefer the comment in the header or with the Record() calls?
> 
> The goal is in comment 0 and comment 2 but I'm not sure if you want me to
> elaborate here or you just want those questions answered in the code comment.

Code comment in the header.

I'm not clear how having these timestamps will help you make decisions.

> 
> If startupCrashDetectionEnd is not called then that means we crashed in
> which case telemetry wouldn't be sent.  The elapsed time is not as important
> as the specific times.


> 
> Is there a separate privacy review that is needed for new telemetry or is
> this bug sufficient?

This is good.
(In reply to Taras Glek (:taras) from comment #7)
> I'm not clear how having these timestamps will help you make decisions.

Here are some more details for usage of the timestamps:
Assumption: The StartupTimeline timestamps and Breakpad uptime both record t=0 at about the same time.
Let's define a startup crash for now as a crash within the first minute of uptime.

Scenario 1 (example):
* Socorro data shows that 80% of startup crashes happen with uptime <= 4s.
* Telemetry data shows startupCrashDetectionBegin happens 60% of the time with uptime > 4s. 

Potential action: Improve startup crash detection to begin tracking earlier since a large proportion of startup crashes are not being detected (so UI won't be shown to users to offer help).
Attachment #595566 - Attachment is obsolete: true
Attachment #595601 - Flags: review?(taras.mozilla)
Attachment #595601 - Flags: review?(taras.mozilla) → review+
this will help us pinpoint performance problems, the opt-in covers this; removing the privacy flag
Comment on attachment 604623 [details] [diff] [review]
v.1 Implement #2+3. STARTUP_CRASH_DETECTED + SAFE_MODE_USAGE

Review of attachment 604623 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/nsXREDirProvider.cpp
@@ +785,1 @@
>  

I'd consider breaking this out into a conditional, just to make it slightly less magical...

  int mode = 1;
  if (gSafeMode)
    if (safeModeNecessary)
      mode = 3
    else
      mode = 2
  m::T::A::(blah, mode)

But no biggie either way.
Attachment #604623 - Flags: review?(dolske) → review+
Comment on attachment 604623 [details] [diff] [review]
v.1 Implement #2+3. STARTUP_CRASH_DETECTED + SAFE_MODE_USAGE

kudos for using the new flag histograms. bug 732053 will make the linear histogram easier to specify in the future.
Attachment #604623 - Flags: review?(taras.mozilla) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdad5c4bfad7
https://hg.mozilla.org/integration/mozilla-inbound/rev/d228248c1b5a

(In reply to Taras Glek (:taras) from comment #13)
> kudos for using the new flag histograms. bug 732053 will make the linear
> histogram easier to specify in the future.

I was the one who brought up the boolean/flag issue so it only makes sense :)  
I'm glad we're going to fix linear histograms since the current behaviour was non-obvious especially because the 0 bucket also shows when inspecting the histogramSnapshots property.
Flags: in-testsuite-
https://hg.mozilla.org/mozilla-central/rev/cdad5c4bfad7
https://hg.mozilla.org/mozilla-central/rev/d228248c1b5a
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
You need to log in before you can comment on or make changes to this bug.