Open Bug 1989439 Opened 3 months ago Updated 1 day ago

Use a single code path for sending crash pings across all platforms/applications

Tracking

()

Status:

ASSIGNED

People

(Reporter: afranchuk, Assigned: afranchuk)

References

(Blocks 2 open bugs)

Details

(Keywords: leave-open)

Attachments

(9 files)

Bug 1989439 p1 - Create the crashping crate. r=gsvelto 3 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p3 - Use the crashping crate in the crashreporter client. r=gsvelto 3 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p4 - Use the crashping crate for firefox desktop crashes. r=gsvelto 3 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p6 - Use the crashping crate for Android crashes. r=gsvelto 3 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p8 - Remove unused Glean metrics. r=gsvelto 3 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p7 - Add the lib-crash android test to CI. r=tcampbell 1 month ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1989439 p5 - Improve Throwable.getStacktraceAsJsonString type/module derivation r=tcampbell 1 month ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
[mozilla/probe-scraper] Update repositories.yaml with the crashping library for bug 1989439. (#969) 1 month ago Alex Franchuk [:afranchuk] 49 bytes, text/x-github-pull-request		Details \| Review
Bug 1989439 p2 - Exclusively acquire the Glean store in the crashping crate r=gsvelto 17 days ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review

Alex Franchuk [:afranchuk]

Assignee

Description

•

3 months ago

See bug 1950749 for the motivation here. It would be easiest to write as a Rust crate as we have

precedent for cross-platform Rust code (minidump-analyzer) and
scripts which generate the Glean Rust API (in the crashreporter client), though it's not that complicated for other languages, too.

Alex Franchuk [:afranchuk]

Assignee

Updated

•

3 months ago

Severity: -- → S3

Alex Franchuk [:afranchuk]

Assignee

Comment 1

•

3 months ago

Attached file Bug 1989439 p1 - Create the crashping crate. r=gsvelto — Details

This crate handles converting crash annotations to Glean metrics and
sending the crash ping.

Glean metrics are generated from CrashAnnotations.yaml.

Alex Franchuk [:afranchuk]

Assignee

Comment 2

•

3 months ago

Attached file Bug 1989439 p3 - Use the crashping crate in the crashreporter client. r=gsvelto — Details

Alex Franchuk [:afranchuk]

Assignee

Comment 3

•

3 months ago

Attached file Bug 1989439 p4 - Use the crashping crate for firefox desktop crashes. r=gsvelto — Details

This uses the crash reporter client to send the ping. This will be
cleaner once we remove the CrashService/CrashManager, so that we can
immediately pass the ping ownership to the crash reporter client (and it
can handle storage, minidump analysis, etc). For now, we take this less
elegant approach to preserve the existing behavior.

Resubmission

We should determine whether we should also launch the crashreporter
client in an idle state for a short period when firefox starts to allow
Glean to potentially resubmit pending pings. Otherwise, we may not get
that behavior until another crash occurs. In the future, we will fold
ping sending into the crashhelper, which always launches and stays up.

Alex Franchuk [:afranchuk]

Assignee

Comment 4

•

3 months ago

Attached file Bug 1989439 p6 - Use the crashping crate for Android crashes. r=gsvelto — Details

This combines the minidump-analyzer and crashping crates into one JNI
shared library to be loaded by lib-crash.

Alex Franchuk [:afranchuk]

Assignee

Comment 5

•

3 months ago

Attached file Bug 1989439 p8 - Remove unused Glean metrics. r=gsvelto — Details

Alex Franchuk [:afranchuk]

Assignee

Comment 6

•

1 month ago

Attached file Bug 1989439 p7 - Add the lib-crash android test to CI. r=tcampbell — Details

This requires adding a dependency on geckoview for CI to work (because
the binaries aren't available in the build output; they are only
available in the APK artifact of the geckoview build).

Phabricator Automation

Updated

•

1 month ago

Attachment #9514428 - Attachment description: Bug 1989439 p5 - Remove unused Glean metrics. r=gsvelto → Bug 1989439 p6 - Remove unused Glean metrics. r=gsvelto

Alex Franchuk [:afranchuk]

Assignee

Comment 7

•

1 month ago

Attached file Bug 1989439 p5 - Improve Throwable.getStacktraceAsJsonString type/module derivation r=tcampbell — Details

The typename and module were previously derived from the
Throwable.toString() output, which relies on the toString()
implementation unnecessarily. This change prefers the use of reflection
to get the qualified name of the type, falling back to the toString()
implementation if the type name isn't available.

There are a number of existing tests which cover this behavior (and
continue to pass). This methodology has also been tested in the previous
version of the Java Glean code (hence the change).

Phabricator Automation

Updated

•

1 month ago

Attachment #9514427 - Attachment description: Bug 1989439 p4 - Use the crashping crate for Android crashes. r=gsvelto → Bug 1989439 p5 - Use the crashping crate for Android crashes. r=gsvelto

Phabricator Automation

Updated

•

1 month ago

Attachment #9526155 - Attachment description: Bug 1989439 p5 - Add the lib-crash android test to CI. r=tcampbell → Bug 1989439 p6 - Add the lib-crash android test to CI. r=tcampbell

Phabricator Automation

Updated

•

1 month ago

Attachment #9514428 - Attachment description: Bug 1989439 p6 - Remove unused Glean metrics. r=gsvelto → Bug 1989439 p7 - Remove unused Glean metrics. r=gsvelto

Alex Franchuk [:afranchuk]

Assignee

Comment 8

•

1 month ago

Attached file [mozilla/probe-scraper] Update repositories.yaml with the crashping library for bug 1989439. (#969) — Details

Jan-Erik Rediger [:janerik]

Comment 9

•

25 days ago

Pulling back the discussion from https://github.com/mozilla/probe-scraper/pull/969#pullrequestreview-3487271449 into the bug.

I first want to untangle this a bit, because it's getting confusing to me.

We have a crash ping on Desktop.
- Defined in toolkit/components/crashes/pings.yaml
- Where is this sent from?
We also have a Crash reporter application on Desktop.
- Defined to use the firefox.crashreporter application ID
- It uses the same pings.yaml definition as above
- When is this used instead of the ping above?
We have a crash ping on Android
- defined in mobile/android/android-components/components/lib/crash/pings.yaml
- same definition, but defined so Android can send it?

This patch moves the ping and metrics definitions to toolkit/crashreporter/crashping, a Rust library, that should end up being used by both Desktop and Android.
This library wants to send the crash ping.
On Desktop that is its own application (Crash Reporter from above?).
On Android it's the CrashReporterService.

Another requirement seems to be:
The crash ping should contain a client ID of the application that crashed.
We don't provide a way to extract the client ID from the Glean SDK.
And it's not recommended to re-use the database directory from another application.
What will the client ID be used for?

One observation from me (and why I blocked the probe-scraper PR for now): We can't have the ping defined in multiple places and be marked to be sent from the same applications. Our pipeline won't be happy with that.

Calling in Arkadiusz to help with determining what the right/best app ID is for this use case.

Flags: needinfo?(akomarzewski)

Alex Franchuk [:afranchuk]

Assignee

Comment 10

•

25 days ago

(In reply to Jan-Erik Rediger [:janerik] from comment #9)

We have a crash ping on Desktop.

Defined in toolkit/components/crashes/pings.yaml

Where is this sent from?

Sent from here.

We also have a Crash reporter application on Desktop.

Defined to use the firefox.crashreporter application ID

It uses the same pings.yaml definition as above

When is this used instead of the ping above?

The crash reporter and its ping are used for main process crashes, as opposed to content/background process crashes.

We have a crash ping on Android

defined in mobile/android/android-components/components/lib/crash/pings.yaml

same definition, but defined so Android can send it?

Yes, defined so Android can send it.

This patch moves the ping and metrics definitions to toolkit/crashreporter/crashping, a Rust library, that should end up being used by both Desktop and Android.
This library wants to send the crash ping.
On Desktop that is its own application (Crash Reporter from above?).

Yes, on Desktop the crashreporter will always be used to send the ping.

On Android it's the CrashReporterService.

Another requirement seems to be:
The crash ping should contain a client ID of the application that crashed.
We don't provide a way to extract the client ID from the Glean SDK.
And it's not recommended to re-use the database directory from another application.
What will the client ID be used for?

We need a consistent client ID to be able to calculate crash incidence (# of clients who experience a crash in a given day), for stability measurement purposes.

One observation from me (and why I blocked the probe-scraper PR for now): We can't have the ping defined in multiple places and be marked to be sent from the same applications. Our pipeline won't be happy with that.

This change reduces the definitions to only be in one place. Are you saying that probe-scraper libraries aren't expected to ever define their own pings? Or is the issue that, in the interim between when that PR lands and when this stack lands, there will be multiple definitions of the crash ping because we will not have removed the old definitions in that PR (in an attempt for contiguity)?

Jan-Erik Rediger [:janerik]

Comment 11

•

25 days ago

(In reply to Alex Franchuk [:afranchuk] from comment #10)

This change reduces the definitions to only be in one place. Are you saying that probe-scraper libraries aren't expected to ever define their own pings? Or is the issue that, in the interim between when that PR lands and when this stack lands, there will be multiple definitions of the crash ping because we will not have removed the old definitions in that PR (in an attempt for contiguity)?

The latter. Having both definitions be known to probe-scraper will be problematic, so we just need to make sure to remove the one when we add the other (and then run a dry-run to double-check it works out as expected)

Alex Franchuk [:afranchuk]

Assignee

Comment 12

•

25 days ago

•

Edited

Okay. We can certainly remove the definition at the same time, and either

accept a blip in the data (I assume that would only affect Nightly? Does probe-scraper have some sort of rollout that follows nightly/beta/release?), or
if the application IDs and the ping/metrics definitions are the same, in theory pings would still be routed in the same way (I think?).

But it's not a big deal if we miss some pings on Nightly for a short period either way. The stack is higher risk to land (i.e., it's complicated, and if there's some problem, backing out would be more cumbersome if we also had to back out some probe-scraper changes), so it would make more sense to land that first and make sure it sticks before landing the probe-scraper change.

Chris H-C :chutten|PTO (back Jan 5)

Comment 13

•

24 days ago

For Firefox Desktop, there's nothing in probe-scraper that knows anything about channels. It merely looks at a branch of a repo and uses that history as its input. (For mobile products where channels are different app ids because of app store reasons, probe-scraper does know some things about channels, but not in this way.) If you add a metric or a ping, that night a column or a table is added to that entire application's purview, all at once. So you need not worry.
Yeah, probe-scraper doesn't care what specific files in which specific components a set of metrics and pings are defined, just that they're not duplicated or otherwise manipulated in a way that would ask it to make a non-backwards-compatible schema change.

Alex Franchuk [:afranchuk]

Assignee

Comment 14

•

24 days ago

Okay! On second thought, whether I land the stack first or the probe-scraper PR, probe-scraper will always be out-of-sync with missing files since the stack adds a new file and removes the others, and probe-scraper can only have one or the other. Out of an abundance of caution (or rather, an abundance of making back out simpler if need be), maybe I'll move the (thankfully separate) patch which removes the old files to a separate bug, so that I can land the stack with the old Glean files still existing. Then I can land the probe-scraper PR after, but in the interim it'll still have the pings defined correctly using the old, but identical, files. Though I should note, there will be new metrics which weren't included before. Will it be safe to send those (i.e., will they just be discarded)?

If not, there's also another increment that can be done: landing everything up to the crashping crate addition first (which adds the generated metrics file and ping file), landing probe-scraper changes (to use the new files, but otherwise still ingesting pings sent from clients using the old files), and then land the patches which actually use the crashping crate and remove the old files.

If it's safe to send extra metrics in that interim period before the probe-scraper PR is merged, I'm not sure there's much of a difference otherwise (unless we think the probe-scraper PR will run into issues), so it may be better to be less incremental and as synchronized as possible with the changes.

Chris H-C :chutten|PTO (back Jan 5)

Comment 15

•

24 days ago

Additional properties on known pings that aren't known by the pipeline aren't unsafe. They merely get dumped into the additional_properties column which is present in each ping dataset for exactly this purpose. So any extra metrics that would result in additional properties (ie, all metrics of types other than event) don't cause any problems for us, and only small problems for you if you need to rely on their values for analyses before the pipeline learns about them.

Arkadiusz Komarzewski [:akomar]

Comment 16

•

23 days ago

I was summoned couple of comments above with a question about app ID, but it seems that some aspects were clarified since then.
IIUC proposed plan and assumptions are:

We don't change (or introduce new) app IDs
We remove old ping definition at the same time as we add new one in probe-scraper PR
New ping adds some new metrics

If above are correct, both options described in https://bugzilla.mozilla.org/show_bug.cgi?id=1989439#c14 should be safe.

For awareness: BQ schema generation is decoupled from probe-scraper PRs landing, normally running after midnight UTC. It's the process where clever changes to probe-scraper configuration like this one can hit us and cause problems, for example due to schema backward incompatibility. Because here we are effectively not removing pings, only adding metrics, and we won't have the same ping defined twice - we shouldn't have any issues. However since this is a non-standard change we'll want to manually trigger schema generation soon after we merge the probe-scraper PR to confirm the change propagates correctly.

Flags: needinfo?(akomarzewski)

Alex Franchuk [:afranchuk]

Assignee

Comment 17

•

17 days ago

Attached file Bug 1989439 p2 - Exclusively acquire the Glean store in the crashping crate r=gsvelto — Details

A Glean store cannot be accessed by more than one application at a time.

Phabricator Automation

Updated

•

17 days ago

Attachment #9514425 - Attachment description: Bug 1989439 p2 - Use the crashping crate in the crashreporter client. r=gsvelto → Bug 1989439 p3 - Use the crashping crate in the crashreporter client. r=gsvelto

Phabricator Automation

Updated

•

17 days ago

Attachment #9514426 - Attachment description: Bug 1989439 p3 - Use the crashping crate for firefox desktop crashes. r=gsvelto → Bug 1989439 p4 - Use the crashping crate for firefox desktop crashes. r=gsvelto

Phabricator Automation

Updated

•

17 days ago

Attachment #9527529 - Attachment description: Bug 1989439 p4 - Improve Throwable.getStacktraceAsJsonString type/module derivation r=tcampbell → Bug 1989439 p5 - Improve Throwable.getStacktraceAsJsonString type/module derivation r=tcampbell

Phabricator Automation

Updated

•

17 days ago

Attachment #9514427 - Attachment description: Bug 1989439 p5 - Use the crashping crate for Android crashes. r=gsvelto → Bug 1989439 p6 - Use the crashping crate for Android crashes. r=gsvelto

Phabricator Automation

Updated

•

17 days ago

Attachment #9526155 - Attachment description: Bug 1989439 p6 - Add the lib-crash android test to CI. r=tcampbell → Bug 1989439 p7 - Add the lib-crash android test to CI. r=tcampbell

Phabricator Automation

Updated

•

17 days ago

Attachment #9514428 - Attachment description: Bug 1989439 p7 - Remove unused Glean metrics. r=gsvelto → Bug 1989439 p8 - Remove unused Glean metrics. r=gsvelto

Alex Franchuk [:afranchuk]

Assignee

Comment 18

•

5 days ago

Marking leave-open to land in two separate phases.

Keywords: leave-open

Pulsebot

Comment 19

•

5 days ago

Pushed by afranchuk@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/c4888cef7e32 https://hg.mozilla.org/integration/autoland/rev/6a036937dcbf p1 - Create the crashping crate. r=gsvelto https://github.com/mozilla-firefox/firefox/commit/9c002b5a7061 https://hg.mozilla.org/integration/autoland/rev/924181728eb6 p2 - Exclusively acquire the Glean store in the crashping crate r=gsvelto

Cristian Tuns

Comment 20

•

4 days ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6a036937dcbf
https://hg.mozilla.org/mozilla-central/rev/924181728eb6

Treeherder Bug Filer

Updated

•

3 days ago

Regressions: 2006492

Treeherder Bug Filer

Updated

•

3 days ago

Regressions: 2006493

Jan-Erik Rediger [:janerik]

Comment 21

•

2 days ago

•

Edited

:akomar and I tried to land the probe-scraper change today.
Unfortunately we hit a small bump: It's currently causing a "duplicated metric identifiers" failed check in our probe-sraper/schema generation (as also evidenced by some emails some of you might have gotten).

I think I know why this is happening in our side of the system, but I will need to verify that and eventually reland this.
However we'll have a production change freeze after today, so this will not land today. I hope I have everything lined up for this to land as soon as we're back at work in January.

Note: the code as landed is fine. We will continue to receive data just fine without issues.

Alex Franchuk [:afranchuk]

Assignee

Updated

•

1 day ago

Blocks: 2007044

You need to log in before you can comment on or make changes to this bug.