Closed Bug 1853108 Opened 2 years ago Closed 1 month ago

Crash Reports On-Demand

Categories

(Toolkit :: Crash Reporting, enhancement)

x86_64
All
enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox119 --- affected

People

(Reporter: gcp, Assigned: gerard-majax)

References

(Blocks 2 open bugs)

Details

Attachments

(13 files, 3 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
209.79 KB, image/png
Details
48 bytes, text/x-phabricator-request
Details | Review
310.31 KB, image/png
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Signature list based UX popup for non-visible processes

From bholley:

The final step here is to build a channel back to the client enabling us to specify (via RemoteSettings) specific signatures for which we do or don’t want full crash reports. This would then enable two key things:
We could substantially reduce the frequency with which we currently prompt users to submit crash reports.
We could start narrowly prompting users in situations where we currently don’t due to UX concerns, i.e. utility process crashes.

The key observation here is that a stack (or even just a MOZ_CRASH_REASON) is often sufficient to diagnose a crash, and even when the extra information in a full report is useful, you usually don’t need very many reports. So if we can decouple our capabilities for aggregate statistics and discovery from the ones we use for advanced diagnostics, we can largely eliminate the UX tension around the latter.

Assignee: nobody → lissyx+mozillians
See Also: → 336872
See Also: → 1830250
See Also: → 1758586
Attachment #9404148 - Attachment description: WIP: Bug 1853108 - WIP Use RemoteSettings for specific crash data → WIP: Bug 1853108 - Extract interesting crash based on Remote Settings request and ask user for submission
Depends on: 1923700
Attached image image.png (obsolete) —
Attachment #9404148 - Attachment description: WIP: Bug 1853108 - Extract interesting crash based on Remote Settings request and ask user for submission → Bug 1853108 - Extract interesting crash based on Remote Settings request and ask user for submission r?afranchuk!,gsvelto!
Attachment #9429360 - Attachment description: WIP: Bug 1853108 - Add testing of interesting crash identification → Bug 1853108 - Add testing of interesting crash identification r?afranchuk!,gsvelto!
Attachment #9429361 - Attachment description: WIP: Bug 1853108 - Add testing of interesting crash notification → Bug 1853108 - Add testing of interesting crash notification r?afranchuk!,gsvelto!
Attached image image.png
Attachment #9429958 - Attachment is obsolete: true
Attached image android.png
Depends on: 1926942
Blocks: 1932389
Attachment #9439401 - Attachment description: WIP: Bug 1853108 - Add a secret settings for never show again remote settinggs crash pull → Bug 1853108 - Add a secret settings for never show again remote settinggs crash pull r?kaya!
Attachment #9431226 - Attachment description: WIP: Bug 1853108 - Extract interesting crash on Android → Bug 1853108 - Extract interesting crash on Android r?kaya!
Attachment #9439291 - Attachment description: WIP: Bug 1853108 - Add testing of interesting crash on Android → Bug 1853108 - Add testing of interesting crash on Android r?kaya!,boek!
Attachment #9431226 - Attachment description: Bug 1853108 - Extract interesting crash on Android r?kaya! → Bug 1853108 - Extract interesting crash on Android r?kaya!,boek!,tthibaud!
Attachment #9439401 - Attachment description: Bug 1853108 - Add a secret settings for never show again remote settinggs crash pull r?kaya! → Bug 1853108 - Add a secret settings for never show again remote settinggs crash pull r?kaya!,boek!,tthibaud!
Attachment #9439291 - Attachment description: Bug 1853108 - Add testing of interesting crash on Android r?kaya!,boek! → Bug 1853108 - Add testing of interesting crash on Android r?kaya!,boek!,tthibaud!
Attachment #9462182 - Attachment is obsolete: true

Marking leave-open since we'll land desktop and android separately

Keywords: leave-open
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/050538bae1ab Extract interesting crash based on Remote Settings request and ask user for submission r=afranchuk,gsvelto,fluent-reviewers,bolsson https://hg.mozilla.org/integration/autoland/rev/d584474e6d53 Add testing of interesting crash identification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/75ec2f4cd495 Add testing of interesting crash notification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/46680c248c45 Add testing of interesting crash nightly only enabling r=afranchuk,gsvelto

Backed out for causing xpcshell and bc failures

Backout link

Push with failures

Failure log xpcshell
Failure log bc

Flags: needinfo?(lissyx+mozillians)

Oh maybe we don't cleanup correctly and this breaks others

Flags: needinfo?(lissyx+mozillians)

Ok so we had a bug around pendingIDs in CrashSubmit, potentially missing correct async-ness and the android failure was just that those tests were skipped on android, and when I added mine I forgot to properly skip the failing test

Blocks: 1950866
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1f132039993b Extract interesting crash based on Remote Settings request and ask user for submission r=afranchuk,gsvelto,fluent-reviewers,bolsson https://hg.mozilla.org/integration/autoland/rev/6259bfb753d3 Add testing of interesting crash identification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/04e9eed62cbd Add testing of interesting crash notification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/d5d231374513 Add testing of interesting crash nightly only enabling r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/43aafd88781a Correct asyncness of UnsubmittedCrashHandler r=afranchuk https://hg.mozilla.org/integration/autoland/rev/5b7262c5abe1 Fix race in CrashSubmit.pendingIDs on file iterations r=afranchuk

Backed out for causing lint failure.

Flags: needinfo?(lissyx+mozillians)
Flags: needinfo?(lissyx+mozillians)
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3e2dd273e950 Extract interesting crash based on Remote Settings request and ask user for submission r=afranchuk,gsvelto,fluent-reviewers,bolsson https://hg.mozilla.org/integration/autoland/rev/8503113625c6 Add testing of interesting crash identification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/70b252be4e4a Add testing of interesting crash notification r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/d8248c5e6705 Add testing of interesting crash nightly only enabling r=afranchuk,gsvelto https://hg.mozilla.org/integration/autoland/rev/fc69e3b55559 Correct asyncness of UnsubmittedCrashHandler r=afranchuk https://hg.mozilla.org/integration/autoland/rev/e277ae28b877 Fix race in CrashSubmit.pendingIDs on file iterations r=afranchuk
Regressions: 1951087
Depends on: 1954896
No longer depends on: 1954896

When rewriting https://github.com/mozilla/remote-settings-ondemand-crashes to use data from bigquery, I saw that we're only partitioning top-crashers by process type and release channel. Should we also partition on os?

(In reply to Alex Franchuk [:afranchuk] from comment #25)

When rewriting https://github.com/mozilla/remote-settings-ondemand-crashes to use data from bigquery, I saw that we're only partitioning top-crashers by process type and release channel. Should we also partition on os?

When I wrote this comment I forgot that I already wrote my query in such a way that, while we do initially partition on just process type and release channel for selecting top crashers, before applying per-configuration limits we also group by platform (where platform is os/osversion/architecture). So, our top crashers are per process type and release channel, but then the reports we request are partitioned in such a way that we guarantee diverse platforms (if there are any). That being said, I wonder:

  1. whether that might be too fine-grained: we currently limit to selecting 100 crash hashes per configuration, which multiplied by the number of unique configurations (os, os version, architecture, process type, release channel) may be a lot of selected hashes for crashes which are not platform-specific (I just checked, in the past week there are 1085 distinct configurations), and
  2. whether we want to get top crashers per-os as well, so that some lower-volume OSes don't have their top crashers pushed down (I'm thinking probably we do).

(2) is a straightforward change. I think we still want to incorporate some platform-diversification when selecting hashes, but maybe to address (1) we should have another limit on the overall hashes selected per top crasher (and we can just shuffle the selection to improve diversity). It's a subtle balance to strike: while we want to select enough hashes that the presumably small percentage of people who follow-up and send crash reports will represent enough diversity to serve our needs, we also don't want to be so liberal in our selection that we end up selecting a huge number of hashes. I suppose one implicit feature we have here is that every day a separate set of hashes is selected for the top crashers, so in that way we could select fewer hashes and rely on daily reselection to potentially poke more people into submitting.

Flags: needinfo?(gpascutto)

we currently limit to selecting 100 crash hashes per configuration, which multiplied by the number of unique configurations (os, os version, architecture, process type, release channel) may be a lot of selected hashes for crashes which are not platform-specific

In the end, we only need 1 report, but multiple reports from different configurations can help shine light on what's wrong (e.g. more visibility into program state because of different optimizations). Maybe 100 per configurations is more than we need, I suspect we'll get more than 1% engagement.

whether we want to get top crashers per-os as well, so that some lower-volume OSes don't have their top crashers pushed down (I'm thinking probably we do).

Yes, although if we lose them it wouldn't be the end of the world either - in the end the user volume affected would not be as large.

Flags: needinfo?(gpascutto)

https://github.com/mozilla/remote-settings-ondemand-crashes/pull/7 incorporates the aforementioned changes (and will also close bug 1937869).

Alex, can you share a feedback from a native speaker point of view on the current wording? With https://phabricator.services.mozilla.com/D251017 we could improve it if required.

Flags: needinfo?(afranchuk)

Did we test the current server<>client setup that would be used on Nightly? I'm slightly wary of accidentally spamming all Nightly users if we have a bug on either side. If not, perhaps we should do that first before enabling for all.

If you enable crashPull on a client, we should see the expected amount of crashes being updated, and see, or not see, a banner, etc.

Flags: needinfo?(lissyx+mozillians)

I added a comment fixing plurality to https://phabricator.services.mozilla.com/D225780, and owlish had a similar comment for desktop which wasn't incorporated before merged.

Besides those grammatical fixes, the only other thing I'd suggest is changing the wording from

... unsent crash report that matches crashes being investigated ...

to

... unsent crash report related to crashes being investigated ...

I think that reads more naturally, personally.

Flags: needinfo?(afranchuk)

(In reply to Alex Franchuk [:afranchuk] from comment #32)

I added a comment fixing plurality to https://phabricator.services.mozilla.com/D225780, and owlish had a similar comment for desktop which wasn't incorporated before merged.

There was so much bogus "NOT DONE" at some point, it looks like i missed this commetn

Besides those grammatical fixes, the only other thing I'd suggest is changing the wording from

... unsent crash report that matches crashes being investigated ...

to

... unsent crash report related to crashes being investigated ...

I think that reads more naturally, personally.

Flags: needinfo?(lissyx+mozillians)

(In reply to Alex Franchuk [:afranchuk] from comment #26)

(2) is a straightforward change. I think we still want to incorporate some platform-diversification when selecting hashes, but maybe to address (1) we should have another limit on the overall hashes selected per top crasher (and we can just shuffle the selection to improve diversity). It's a subtle balance to strike: while we want to select enough hashes that the presumably small percentage of people who follow-up and send crash reports will represent enough diversity to serve our needs, we also don't want to be so liberal in our selection that we end up selecting a huge number of hashes. I suppose one implicit feature we have here is that every day a separate set of hashes is selected for the top crashers, so in that way we could select fewer hashes and rely on daily reselection to potentially poke more people into submitting.

Chiming in late on this. This sounds like a good approach to me but I don't think we should worry about casting a net that's too wide. The type of crashes we'll be trying to get are rarely very-high volume to start with, so I don't think we'll ever find ourselves fetching too many crashes in one go, even if we try for any possible platform combination.

(In reply to Gabriele Svelto [:gsvelto] from comment #34)

Chiming in late on this. This sounds like a good approach to me but I don't think we should worry about casting a net that's too wide. The type of crashes we'll be trying to get are rarely very-high volume to start with, so I don't think we'll ever find ourselves fetching too many crashes in one go, even if we try for any possible platform combination.

I agree, but just in case I'd rather start out more conservative and grow the selection limits once we get a good idea of the response rate. I don't want to have tech articles talking about how users are annoyed by constant prompts to send crash reports (which would look bad in a number of ways). Not to mention it might make a user more inclined to click "never send these" instead of "always send these". Of course, such an incident could be rapidly fixed through Remote Settings anyway, so maybe it's not a big concern. We should probably keep in mind that the Nightly/Beta population may have a different response rate than Release, too.

Attachment #9439369 - Attachment description: WIP: Bug 1853108 - Flip setting for useNewCrashReporter → Bug 1853108 - Expose new crash reporter as a secret settings r?sfamisa!
Attachment #9439369 - Attachment is obsolete: true
Attachment #9439291 - Attachment description: Bug 1853108 - Add testing of interesting crash on Android r?kaya!,boek!,tthibaud! → Bug 1853108 - Add testing of interesting crash on Android r?kaya!,boek!,tthibaud!,sfamisa!
Depends on: 1968573
Attachment #9494106 - Attachment description: Bug 1853108 - Add a one week guard for Crash Pull r?afranchuk! → Bug 1853108 - Fix wording and add a one week guard for Crash Pull r?afranchuk!
Pushed by alissy@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/b0b82af21eb8 https://hg.mozilla.org/integration/autoland/rev/d0e9acc90fa0 Extract interesting crash on Android r=kaya,geckoview-reviewers,android-reviewers,owlish,boek,android-l10n-reviewers,delphine,ohall https://github.com/mozilla-firefox/firefox/commit/a32803735de5 https://hg.mozilla.org/integration/autoland/rev/0b5e8cc9701f Add a secret settings for never show again remote settinggs crash pull r=kaya,android-reviewers https://github.com/mozilla-firefox/firefox/commit/11cd776b3929 https://hg.mozilla.org/integration/autoland/rev/1b95b56ffb7b Add testing of interesting crash on Android r=kaya,boek,geckoview-reviewers,android-reviewers,ohall,sfamisa
Duplicate of this bug: 1830250

It's OK to be enabled now on Nightly

Keywords: leave-open

Everything has landed. Remainder of enabling on all channels is bug 1950866

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED

I've created a simple query for checking the crash submission rate for background processes: https://sql.telemetry.mozilla.org/queries/111595.

However, it isn't particularly useful for Nightly because we do get a decent number of reports there. If you look at Beta/Release, those have hardly any. I would be very interested to see what happens to those graphs when we enable this on all channels.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: