Closed Bug 1676943 Opened 5 months ago Closed 2 months ago

To measure the impact of content crashes on users, record telemetry on how often we show a crash reporter

Categories

(Core :: DOM: Content Processes, task, P3)

task

Tracking

()

RESOLVED FIXED
87 Branch
Fission Milestone M7
Tracking Status
firefox-esr78 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- fixed

People

(Reporter: cpeterson, Assigned: enndeakin)

References

Details

Attachments

(2 files)

To measure the actual impact of crashes on users, we'd like to record how often we show a crash reporter, not just how often we crash.

For example, we submit a lot of ShutdownKills for process shutdown hangs, but we don't actually show a crash reporter to users. Also, if background tab might crash but the user never switches to that tab, we don't actually show a crash reporter to that user.

If a user has opted into automatically submitting crash reports ("Update preferences to automatically submit reports when Firefox crashes" or "Allow Firefox to send backlogged crash reports on your behalf"), we automatically reload the crashed tab if the user switches to it, instead of showing a crash reporter.

We should also differentiate between different crash reporters or automatically reload a crashed page, so we can measure Fission's success on moving tab crashes to subframe crashes:

  • Parent process crash reporter (about:crashparent)
  • Tab crash reporter page (about:crashcontent)
  • Unsubmitted crash notification. This is currently Nightly only, 10 minutes after browser startup. It is also shown in every tab in the first browser window, so we should count that as multiple notifications for every time the user sees the notifications, even if they are all about just one crash!
  • When we automatically reload a crashed page (instead of showing a crash reporter). We don't show the user the crash reporter, but they might be annoyed by the page reload or have lost work in progress in that reloaded tab.

Tracking for Fission Beta milestone (M7).

This work will overlap with Fission subframe crash notification bug 1644911.

Severity: -- → N/A
Fission Milestone: ? → M7
Priority: -- → P3
See Also: → 1644911
Summary: To measure the impact of crashes on users, record how often we show a crash reporter → To measure the impact of crashes on users, record telemetry on how often we show a crash reporter

Neil, since you've been looking into the subframe crash reporter also, would you be willing to take this one up too?

Flags: needinfo?(enndeakin)

If a user has opted into automatically submitting crash reports ("Update preferences to automatically submit reports when Firefox crashes" or "Allow Firefox to send backlogged crash reports on your behalf"), we automatically reload the crashed tab if the user switches to it, instead of showing a crash reporter.

For me, with that option checked, a crash occurring in a background tab just shows a blank page or sometimes the newtab page when I switch to it. (a similar thing happens if I don't use --enable-crash-reporter)

Is this a known issue? With the option unchecked, the tab crashed page appears.

  • Parent process crash reporter (about:crashparent)

This would need to be done by the crash reporter instead I would think, unless there was some clear state that was still around for the next time the browser was launched.

  • Tab crash reporter page (about:crashcontent)

There is currently an existing 'FX_CONTENT_CRASH_PRESENTED' telemetry item, however it only fires once per crash (so a crash that affects multiple windows would only trigger once) and only if the crash is within a currently selected tab. I assume the idea is to add another telemetry item that fires for every window/tab that shows a crash?

  • Unsubmitted crash notification. This is currently Nightly only, 10 minutes after browser startup. It is also shown in every tab in the first browser window, so we should count that as multiple notifications for every time the user sees the notifications, even if they are all about just one crash!

Actually, only one notification is shown, as the unsubmitted crash notification isn't tied to a specific tab/browser.

  • When we automatically reload a crashed page (instead of showing a crash reporter). We don't show the user the crash reporter, but they might be annoyed by the page reload or have lost work in progress in that reloaded tab.

Is this the case described earlier when a crash occurs in a background tab?

Flags: needinfo?(enndeakin)
Assignee: nobody → enndeakin
Status: NEW → ASSIGNED

(In reply to Neil Deakin from comment #3)

If a user has opted into automatically submitting crash reports ("Update preferences to automatically submit reports when Firefox crashes" or "Allow Firefox to send backlogged crash reports on your behalf"), we automatically reload the crashed tab if the user switches to it, instead of showing a crash reporter.

For me, with that option checked, a crash occurring in a background tab just shows a blank page or sometimes the newtab page when I switch to it. (a similar thing happens if I don't use --enable-crash-reporter)

Is this a known issue? With the option unchecked, the tab crashed page appears.

I don't know. mconley, do you know the expected behavior for crashed background tabs?

Flags: needinfo?(mconley)
  • Parent process crash reporter (about:crashparent)

This would need to be done by the crash reporter instead I would think, unless there was some clear state that was still around for the next time the browser was launched.

In that case, I recommend we reduce the scope of this bug to just care about content crashes. Fission's process architecture will change how some content crashes are handled, but shouldn't affect parent process crashes. Since this motivation for this bug was to compare the perceived crashiness of Fission vs e10s, we can ignore parent process crashes.

  • Tab crash reporter page (about:crashcontent)

There is currently an existing 'FX_CONTENT_CRASH_PRESENTED' telemetry item, however it only fires once per crash (so a crash that affects multiple windows would only trigger once) and only if the crash is within a currently selected tab. I assume the idea is to add another telemetry item that fires for every window/tab that shows a crash?

The FX_CONTENT_CRASH_PRESENTED probe was added five years ago (bug 1269961). I doubt anyone is still monitoring it. I don't see a problem with us adapting this existing probe to count each crash notification shown for one crash, if you think that is easier than adding a new probe.

  • Unsubmitted crash notification. This is currently Nightly only, 10 minutes after browser startup. It is also shown in every tab in the first browser window, so we should count that as multiple notifications for every time the user sees the notifications, even if they are all about just one crash!

Actually, only one notification is shown, as the unsubmitted crash notification isn't tied to a specific tab/browser.

IIRC, when I tested the unsubmitted crash scan, a notification was shown in every tab in one of my (multiple) browser windows. Does that count as "only one notification" (because it's associated with the window, not content or a particular tab)?

Do you think the user will perceive this notification as one or many? I don't have a strong opinion. If this behavior is the same for Fission and non-Fission, then we can probably it count either way, whichever is easier.

  • When we automatically reload a crashed page (instead of showing a crash reporter). We don't show the user the crash reporter, but they might be annoyed by the page reload or have lost work in progress in that reloaded tab.

Is this the case described earlier when a crash occurs in a background tab?

Yes. I think this is the same as the background tab scenario. I don't recall if I was thinking of some other special case. So just do what you think make sense.

Flags: needinfo?(enndeakin)
Summary: To measure the impact of crashes on users, record telemetry on how often we show a crash reporter → To measure the impact of content crashes on users, record telemetry on how often we show a crash reporter

Chris, can you confirm the following is ok:

  1. This patch adds telemetry for three things:
  • when a background tab is reloaded: 'dom.contentprocess.crash_tab_reloaded'
  • when the tab crash ui is shown to the user: 'dom.contentprocess.crash_tab_ui_presented'
  • when the ubsubmitted tab crash notification is shown 'dom.contentprocess.unsubmitted_ui_presented'. This happens after 10 minutes after startup.
  1. The telemetry email is cpeterson@mozilla.com

  2. Do we need a separate bug for subframe crash ui being presented, now that bug 1644911 is complete. Do we need other telemetry for subframe crashing? I can add it easily here if we just want a notification when it happens, but it will require a few more steps if we want to add telemetry only when the tab that crashed and shows the ui is switched to.

Flags: needinfo?(enndeakin)
Flags: needinfo?(cpeterson)

Looks good!

(In reply to Neil Deakin from comment #7)

  1. This patch adds telemetry for three things:
  • when a background tab is reloaded: 'dom.contentprocess.crash_tab_reloaded'

IIUC, this is when a background tab has crashed but is auto-reloaded because another tab in the same content process was reloaded by the foreground tab crash UI? If this is different from when a foreground tab is reloaded when a user clicks the "Restore This Tab" button on the foreground crash UI, perhaps the probe name should change to something like dom.contentprocess.crash_background_tab_reloaded?

Does dom.contentprocess.crash_tab_reloaded only count when the user switches to crashed background tab? I assume that's when the actual page reload takes place.

  • when the tab crash ui is shown to the user: 'dom.contentprocess.crash_tab_ui_presented'
  • when the ubsubmitted tab crash notification is shown 'dom.contentprocess.unsubmitted_ui_presented'. This happens after 10 minutes after startup.

Do you know if the unsubmitted crash notification only checks for content processes crashes? Or will it include other background process types (like GPU, Network, or RDD) or parent process crash reports?

I suspect it includes other background processes (since they have no tab in which to show the tab crash reporter UI), but I'm not sure about parent process crashes. Fission should have no impact on crashes from other background process types. I don't think we need to change your probe implementation (unless there is a straightforward way to filter out non-content process crashes), but it would be good to know what we're measuring when we monitor our experiment telemetry.

  1. The telemetry email is cpeterson@mozilla.com

Let's change the email contact to nlayzell@mozilla.com. :) Nika is the Fission tech lead.

  1. Do we need a separate bug for subframe crash ui being presented, now that bug 1644911 is complete.

Without a separate bug, will the new subframe crash UI be counted by dom.contentprocess.crash_tab_ui_presented or not counted at all? I think it would be good to count subframe crash UI in a separate probe from the tab crash UI so we can monitor our experiment to see how many tab crashes become mere subframe crashes with Fission.

Do we need other telemetry for subframe crashing? I can add it easily here if we just want a notification when it happens, but it will require a few more steps if we want to add telemetry only when the tab that crashed and shows the ui is switched to.

Ideally, these probes would only count crash UI that the user actually sees so we can measure "perceived crashiness". If dom.contentprocess.crash_tab_ui_presented and dom.contentprocess.crash_tab_reloaded only measure tabs seen by the user, but the subframe crash UI probe measures all subframe crash UIs, then we won't have a direct comparison of how many tab crashes become mere subframe crashes with Fission.

But if it is a lot of work to only count subframe crash UI in tabs the user sees, then we can probably skip that criteria. In that case, we could compare dom.contentprocess.crash_tab_ui_presented versus overall crash rates for our experiment's Fission branch and non-Fission control branch.

Flags: needinfo?(cpeterson) → needinfo?(enndeakin)

(In reply to Chris Peterson [:cpeterson] from comment #8)

Looks good!

(In reply to Neil Deakin from comment #7)

  1. This patch adds telemetry for three things:
  • when a background tab is reloaded: 'dom.contentprocess.crash_tab_reloaded'

IIUC, this is when a background tab has crashed but is auto-reloaded because another tab in the same content process was reloaded by the foreground tab crash UI? If this is different from when a foreground tab is reloaded when a user clicks the "Restore This Tab" button on the foreground crash UI, perhaps the probe name should change to something like dom.contentprocess.crash_background_tab_reloaded?

Does dom.contentprocess.crash_tab_reloaded only count when the user switches to crashed background tab? I assume that's when the actual page reload takes place.

Honestly, I don't know. The background reloading has always been broken for me as far as I can tell, so I don't know what is supposed to happen. For me when a tab crashes in the background, the background tab when switched to is just blank.

This patch adds the telemetry when the tab crash occurs. If I knew how this was supposed to work, I can move the telemetry to when the tab is switched to.

  • when the ubsubmitted tab crash notification is shown 'dom.contentprocess.unsubmitted_ui_presented'. This happens after 10 minutes after startup.

Do you know if the unsubmitted crash notification only checks for content processes crashes? Or will it include other background process types (like GPU, Network, or RDD) or parent process crash reports?

A single crashed network process appears in the list of unsubmitted crashes and causes the notification to appear. I assume it includes all types as well. I think the check just looks in the crash reports directory for any kind of crashes.

  1. Do we need a separate bug for subframe crash ui being presented, now that bug 1644911 is complete.

Without a separate bug, will the new subframe crash UI be counted by dom.contentprocess.crash_tab_ui_presented or not counted at all? I think it would be good to count subframe crash UI in a separate probe from the tab crash UI so we can monitor our experiment to see how many tab crashes become mere subframe crashes with Fission.

No telemetry for subcrash ui has been added. I'm thinking of just adding a single telemetry probe for when the UI is shown to the user.

Flags: needinfo?(enndeakin)

Does dom.contentprocess.crash_tab_reloaded only count when the user switches to crashed background tab? I assume that's when the actual page reload takes place.

Honestly, I don't know. The background reloading has always been broken for me as far as I can tell, so I don't know what is supposed to happen. For me when a tab crashes in the background, the background tab when switched to is just blank.

This patch adds the telemetry when the tab crash occurs. If I knew how this was supposed to work, I can move the telemetry to when the tab is switched to.

Our goal for these new probes was to count how often the user actually witnesses the crash reporter, so we would not want to count background tab crashes or background subframe crashes unless the user actually re-focuses the crashed tab and sees the background tab's (tab or subframe) crash reporter UI.

I tried testing Firefox's existing behavior for background tab crashes (by killing content processes with Windows Task Manager or Firefox's about:processes), but the results were inconsistent. When I re-focused the crashed background tab, sometimes the page reloaded in front of my eyes, sometimes it was already running (and presumably reloaded in the background before I focused the tab), sometimes it was a blank page (not about:blank, but a white page with no URL), and sometimes it showed the about:tabcrashed page. So I don't know which events or code paths we should be measuring for background tabs...

Do you know if the unsubmitted crash notification only checks for content processes crashes? Or will it include other background process types (like GPU, Network, or RDD) or parent process crash reports?

A single crashed network process appears in the list of unsubmitted crashes and causes the notification to appear. I assume it includes all types as well. I think the check just looks in the crash reports directory for any kind of crashes.

OK. The unsubmitted crash notifications aren't enabled in Beta or Release, so we including the other process types in Nightly is not a big deal.

  1. Do we need a separate bug for subframe crash ui being presented, now that bug 1644911 is complete.

Without a separate bug, will the new subframe crash UI be counted by dom.contentprocess.crash_tab_ui_presented or not counted at all? I think it would be good to count subframe crash UI in a separate probe from the tab crash UI so we can monitor our experiment to see how many tab crashes become mere subframe crashes with Fission.

No telemetry for subcrash ui has been added. I'm thinking of just adding a single telemetry probe for when the UI is shown to the user.

Would "when the UI is shown to the user" include subframe crashes on background tabs that the user has not re-focused? I assume the subframe crash UI will be opened immediately on a background tab if a subframe crashes even if the user hasn't switched back to that tab.

Flags: needinfo?(enndeakin)

Would "when the UI is shown to the user" include subframe crashes on background tabs that the user has not re-focused? I assume the subframe crash UI will be opened immediately on a background tab if a subframe crashes even if the user hasn't switched back to that tab.

This is what I plan on doing:

  • if a subframe crashes in a foreground tab, the telemetry triggers right away. Do we want to be more specific and only trigger when the window is in front if there are multiple windows open?
  • if a subframe crashes in a background (non-selected) tab, the telemetry triggers when the tab is switched to.
  • if multiple tabs use a crashed content process, then the telemetry triggers for all foreground tabs right away but only for non-selected tabs when they are switched to. That means that the same crash could trigger multiple telemetry notifications if the user views multiple tabs that used the same crashed process.
  • closing the notification or submitting the crash report in one tab closes it in all other tabs/windows and doesn't trigger the telemetry anymore.
Flags: needinfo?(enndeakin)

(In reply to Neil Deakin from comment #11)

Would "when the UI is shown to the user" include subframe crashes on background tabs that the user has not re-focused? I assume the subframe crash UI will be opened immediately on a background tab if a subframe crashes even if the user hasn't switched back to that tab.

This is what I plan on doing:

  • if a subframe crashes in a foreground tab, the telemetry triggers right away. Do we want to be more specific and only trigger when the window is in front if there are multiple windows open?

Triggering for any foreground tab sounds fine. No need to make multiple windows a special case.

  • if a subframe crashes in a background (non-selected) tab, the telemetry triggers when the tab is switched to.
  • if multiple tabs use a crashed content process, then the telemetry triggers for all foreground tabs right away but only for non-selected tabs when they are switched to. That means that the same crash could trigger multiple telemetry notifications if the user views multiple tabs that used the same crashed process.
  • closing the notification or submitting the crash report in one tab closes it in all other tabs/windows and doesn't trigger the telemetry anymore.

These steps all sound good. Thanks!

Mike do you know how crashes in background tabs are supposed to be handled? For me, they display no crash reporter and the page is blank. Chris says the behaviour is inconsistent. See comments 9 and 10 for more details.

If the crash reporter isn't supposed to come up with background tabs, maybe we need no extra telemetry for this.

The behaviour for crashing background tabs is supposed to be as such:

No matter what, when background tabs crash, they supposed to immediately go into the "unrestored" state (the state where something loads on-demand when you select them).

So when a user selects a background tab that has crashed and been put into this state, this is the decision tree:

  1. If the user has configured Nightly to send backlogged crash reports on their behalf, the tab restores to its last known location
  2. Otherwise:
    a. If the user has not yet seen an about:tabcrashed tab for any tabs in the selected tabs content process, then about:tabcrashed is shown.
    b. If the user has seen about:tabcrashed already for one of the other tabs in the content process, the tab restores to its last known location

This test is supposed to assert the behaviour: https://searchfox.org/mozilla-central/rev/4f07d49f1c7a823da07e3a231ac87c6435c8fd58/browser/components/sessionstore/test/browser_background_tab_crash.js#3-10

Flags: needinfo?(mconley)

(In reply to Mike Conley (:mconley) (:⚙️) (Catching up on needinfos) from comment #14)

  1. If the user has configured Nightly to send backlogged crash reports on their behalf, the tab restores to its last known location
  2. Otherwise:
    a. If the user has not yet seen an about:tabcrashed tab for any tabs in the selected tabs content process, then about:tabcrashed is shown.
    b. If the user has seen about:tabcrashed already for one of the other tabs in the content process, the tab restores to its last known location

Ideally, this telemetry probe would only count instances of 2a (we show the user the about:tabcrashed UI). We could skip counting cases 1 or 2b because the user doesn't see a "you have crashed" message.

... Unless we think auto-reloading the page in front of the users' eyes counts as the user "experiencing the impact of a crash"? If counting cases 1+2a+2b is less of a hassle than only counting 2a, then counting 1+2a+2b is OK. You don't need to polish a perfect solution for our crash impact experiment. :)

OS: Unspecified → All
Hardware: Unspecified → All
Attachment #9194442 - Attachment description: Bug 1676943, add telemetry for how often tab crash ui is presented to the user, and how often a tab is reloaded after a crash → Bug 1676943, add telemetry for how often tab or subframe crash ui is presented to the user. In addition, add telemetry to indicate that the user is shown the unsubmitted crashes notification bar, r=mconley

Ok, thanks Mike. So, the telemetry added by this bug triggers whenever the tab crashed page is loaded, and is sounds like from Mike's description that the tab crash page is only ever loaded for selected tabs.

The patch attached handles it this way.

Request for data collection review of "crash_tab_ui_presented" and "crash_subframe_ui_presented" telemetry probes

@ chutten: I defaulted to you for this data-review because you're already familiar with Fission's new subframe crash reporter: you reviewed that data collection request in bug 1687039.

Attachment #9202185 - Flags: data-review?(chutten)

Comment on attachment 9202185 [details]
bug-1676943-data-review-request-for-crash_tab_ui_presented.txt

Though the intent in the request is to stop collecting data before the end of the year, the data definitions are set to never expire. Which should it be?

Attachment #9202185 - Flags: data-review?(chutten)

(In reply to Chris H-C :chutten from comment #18)

Though the intent in the request is to stop collecting data before the end of the year, the data definitions are set to never expire. Which should it be?

@ Neil, can you please update your patch to set an explicit expiration for the end of 2021? Fx98 is the last Nightly version of 2021.

Flags: needinfo?(enndeakin)
Flags: needinfo?(enndeakin)

The changes now expire with version 98.

Comment on attachment 9202185 [details]
bug-1676943-data-review-request-for-crash_tab_ui_presented.txt

(In reply to Chris H-C :chutten from comment #18)

Comment on attachment 9202185 [details]
bug-1676943-data-review-request-for-crash_tab_ui_presented.txt

Though the intent in the request is to stop collecting data before the end of the year, the data definitions are set to never expire. Which should it be?

(In reply to Neil Deakin from comment #20)

The changes now expire with version 98.

chutten, Neil has updated his patch to expire the probe in Firefox 98 (December 2021). Do you have any other questions for data review?

Attachment #9202185 - Flags: data-review?(chutten)

Comment on attachment 9202185 [details]
bug-1676943-data-review-request-for-crash_tab_ui_presented.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 98.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 2, Interaction.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. Neil Deakin is responsible for renewing or removing the collection before it expires in Firefox 98.


Result: datareview+

Attachment #9202185 - Flags: data-review?(chutten) → data-review+
Pushed by neil@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/465ee033cea5
add telemetry for how often tab or subframe crash ui is presented to the user. In addition, add telemetry to indicate that the user is shown the unsubmitted crashes notification bar, r=mconley
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch
You need to log in before you can comment on or make changes to this bug.