Use a single code path for sending crash pings across all platforms/applications
Categories
(Toolkit :: Crash Reporting, enhancement)
Tracking
()
People
(Reporter: afranchuk, Assigned: afranchuk)
References
(Blocks 2 open bugs)
Details
(Keywords: leave-open)
Attachments
(9 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
49 bytes,
text/x-github-pull-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review |
See bug 1950749 for the motivation here. It would be easiest to write as a Rust crate as we have
- precedent for cross-platform Rust code (
minidump-analyzer) and - scripts which generate the Glean Rust API (in the crashreporter client), though it's not that complicated for other languages, too.
| Assignee | ||
Updated•3 months ago
|
| Assignee | ||
Comment 1•3 months ago
|
||
This crate handles converting crash annotations to Glean metrics and
sending the crash ping.
Glean metrics are generated from CrashAnnotations.yaml.
| Assignee | ||
Comment 2•3 months ago
|
||
| Assignee | ||
Comment 3•3 months ago
|
||
This uses the crash reporter client to send the ping. This will be
cleaner once we remove the CrashService/CrashManager, so that we can
immediately pass the ping ownership to the crash reporter client (and it
can handle storage, minidump analysis, etc). For now, we take this less
elegant approach to preserve the existing behavior.
Resubmission
We should determine whether we should also launch the crashreporter
client in an idle state for a short period when firefox starts to allow
Glean to potentially resubmit pending pings. Otherwise, we may not get
that behavior until another crash occurs. In the future, we will fold
ping sending into the crashhelper, which always launches and stays up.
| Assignee | ||
Comment 4•3 months ago
|
||
This combines the minidump-analyzer and crashping crates into one JNI
shared library to be loaded by lib-crash.
| Assignee | ||
Comment 5•3 months ago
|
||
| Assignee | ||
Comment 6•1 month ago
|
||
This requires adding a dependency on geckoview for CI to work (because
the binaries aren't available in the build output; they are only
available in the APK artifact of the geckoview build).
Updated•1 month ago
|
| Assignee | ||
Comment 7•1 month ago
|
||
The typename and module were previously derived from the
Throwable.toString() output, which relies on the toString()
implementation unnecessarily. This change prefers the use of reflection
to get the qualified name of the type, falling back to the toString()
implementation if the type name isn't available.
There are a number of existing tests which cover this behavior (and
continue to pass). This methodology has also been tested in the previous
version of the Java Glean code (hence the change).
Updated•1 month ago
|
Updated•1 month ago
|
Updated•1 month ago
|
| Assignee | ||
Comment 8•1 month ago
|
||
Comment 9•25 days ago
|
||
Pulling back the discussion from https://github.com/mozilla/probe-scraper/pull/969#pullrequestreview-3487271449 into the bug.
I first want to untangle this a bit, because it's getting confusing to me.
- We have a crash ping on Desktop.
- Defined in
toolkit/components/crashes/pings.yaml - Where is this sent from?
- Defined in
- We also have a Crash reporter application on Desktop.
- Defined to use the
firefox.crashreporterapplication ID - It uses the same pings.yaml definition as above
- When is this used instead of the ping above?
- Defined to use the
- We have a crash ping on Android
- defined in
mobile/android/android-components/components/lib/crash/pings.yaml - same definition, but defined so Android can send it?
- defined in
This patch moves the ping and metrics definitions to toolkit/crashreporter/crashping, a Rust library, that should end up being used by both Desktop and Android.
This library wants to send the crash ping.
On Desktop that is its own application (Crash Reporter from above?).
On Android it's the CrashReporterService.
Another requirement seems to be:
The crash ping should contain a client ID of the application that crashed.
We don't provide a way to extract the client ID from the Glean SDK.
And it's not recommended to re-use the database directory from another application.
What will the client ID be used for?
One observation from me (and why I blocked the probe-scraper PR for now): We can't have the ping defined in multiple places and be marked to be sent from the same applications. Our pipeline won't be happy with that.
Calling in Arkadiusz to help with determining what the right/best app ID is for this use case.
| Assignee | ||
Comment 10•25 days ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #9)
- We have a crash ping on Desktop.
- Defined in
toolkit/components/crashes/pings.yaml- Where is this sent from?
Sent from here.
- We also have a Crash reporter application on Desktop.
- Defined to use the
firefox.crashreporterapplication ID- It uses the same pings.yaml definition as above
- When is this used instead of the ping above?
The crash reporter and its ping are used for main process crashes, as opposed to content/background process crashes.
- We have a crash ping on Android
- defined in
mobile/android/android-components/components/lib/crash/pings.yaml- same definition, but defined so Android can send it?
Yes, defined so Android can send it.
This patch moves the ping and metrics definitions to
toolkit/crashreporter/crashping, a Rust library, that should end up being used by both Desktop and Android.
This library wants to send thecrashping.
On Desktop that is its own application (Crash Reporter from above?).
Yes, on Desktop the crashreporter will always be used to send the ping.
On Android it's the CrashReporterService.
Another requirement seems to be:
The crash ping should contain a client ID of the application that crashed.
We don't provide a way to extract the client ID from the Glean SDK.
And it's not recommended to re-use the database directory from another application.
What will the client ID be used for?
We need a consistent client ID to be able to calculate crash incidence (# of clients who experience a crash in a given day), for stability measurement purposes.
One observation from me (and why I blocked the probe-scraper PR for now): We can't have the ping defined in multiple places and be marked to be sent from the same applications. Our pipeline won't be happy with that.
This change reduces the definitions to only be in one place. Are you saying that probe-scraper libraries aren't expected to ever define their own pings? Or is the issue that, in the interim between when that PR lands and when this stack lands, there will be multiple definitions of the crash ping because we will not have removed the old definitions in that PR (in an attempt for contiguity)?
Comment 11•25 days ago
|
||
(In reply to Alex Franchuk [:afranchuk] from comment #10)
This change reduces the definitions to only be in one place. Are you saying that probe-scraper libraries aren't expected to ever define their own pings? Or is the issue that, in the interim between when that PR lands and when this stack lands, there will be multiple definitions of the
crashping because we will not have removed the old definitions in that PR (in an attempt for contiguity)?
The latter. Having both definitions be known to probe-scraper will be problematic, so we just need to make sure to remove the one when we add the other (and then run a dry-run to double-check it works out as expected)
| Assignee | ||
Comment 12•25 days ago
•
|
||
Okay. We can certainly remove the definition at the same time, and either
- accept a blip in the data (I assume that would only affect Nightly? Does probe-scraper have some sort of rollout that follows nightly/beta/release?), or
- if the application IDs and the ping/metrics definitions are the same, in theory pings would still be routed in the same way (I think?).
But it's not a big deal if we miss some pings on Nightly for a short period either way. The stack is higher risk to land (i.e., it's complicated, and if there's some problem, backing out would be more cumbersome if we also had to back out some probe-scraper changes), so it would make more sense to land that first and make sure it sticks before landing the probe-scraper change.
Comment 13•24 days ago
|
||
- For Firefox Desktop, there's nothing in probe-scraper that knows anything about channels. It merely looks at a branch of a repo and uses that history as its input. (For mobile products where channels are different app ids because of app store reasons, probe-scraper does know some things about channels, but not in this way.) If you add a metric or a ping, that night a column or a table is added to that entire application's purview, all at once. So you need not worry.
- Yeah, probe-scraper doesn't care what specific files in which specific components a set of metrics and pings are defined, just that they're not duplicated or otherwise manipulated in a way that would ask it to make a non-backwards-compatible schema change.
| Assignee | ||
Comment 14•24 days ago
|
||
Okay! On second thought, whether I land the stack first or the probe-scraper PR, probe-scraper will always be out-of-sync with missing files since the stack adds a new file and removes the others, and probe-scraper can only have one or the other. Out of an abundance of caution (or rather, an abundance of making back out simpler if need be), maybe I'll move the (thankfully separate) patch which removes the old files to a separate bug, so that I can land the stack with the old Glean files still existing. Then I can land the probe-scraper PR after, but in the interim it'll still have the pings defined correctly using the old, but identical, files. Though I should note, there will be new metrics which weren't included before. Will it be safe to send those (i.e., will they just be discarded)?
If not, there's also another increment that can be done: landing everything up to the crashping crate addition first (which adds the generated metrics file and ping file), landing probe-scraper changes (to use the new files, but otherwise still ingesting pings sent from clients using the old files), and then land the patches which actually use the crashping crate and remove the old files.
If it's safe to send extra metrics in that interim period before the probe-scraper PR is merged, I'm not sure there's much of a difference otherwise (unless we think the probe-scraper PR will run into issues), so it may be better to be less incremental and as synchronized as possible with the changes.
Comment 15•24 days ago
|
||
Additional properties on known pings that aren't known by the pipeline aren't unsafe. They merely get dumped into the additional_properties column which is present in each ping dataset for exactly this purpose. So any extra metrics that would result in additional properties (ie, all metrics of types other than event) don't cause any problems for us, and only small problems for you if you need to rely on their values for analyses before the pipeline learns about them.
Comment 16•23 days ago
|
||
I was summoned couple of comments above with a question about app ID, but it seems that some aspects were clarified since then.
IIUC proposed plan and assumptions are:
- We don't change (or introduce new) app IDs
- We remove old ping definition at the same time as we add new one in probe-scraper PR
- New ping adds some new metrics
If above are correct, both options described in https://bugzilla.mozilla.org/show_bug.cgi?id=1989439#c14 should be safe.
For awareness: BQ schema generation is decoupled from probe-scraper PRs landing, normally running after midnight UTC. It's the process where clever changes to probe-scraper configuration like this one can hit us and cause problems, for example due to schema backward incompatibility. Because here we are effectively not removing pings, only adding metrics, and we won't have the same ping defined twice - we shouldn't have any issues. However since this is a non-standard change we'll want to manually trigger schema generation soon after we merge the probe-scraper PR to confirm the change propagates correctly.
| Assignee | ||
Comment 17•17 days ago
|
||
A Glean store cannot be accessed by more than one application at a time.
Updated•17 days ago
|
Updated•17 days ago
|
Updated•17 days ago
|
Updated•17 days ago
|
Updated•17 days ago
|
Updated•17 days ago
|
| Assignee | ||
Comment 18•5 days ago
|
||
Marking leave-open to land in two separate phases.
Comment 19•5 days ago
|
||
Comment 20•4 days ago
|
||
| bugherder | ||
Comment 21•2 days ago
•
|
||
:akomar and I tried to land the probe-scraper change today.
Unfortunately we hit a small bump: It's currently causing a "duplicated metric identifiers" failed check in our probe-sraper/schema generation (as also evidenced by some emails some of you might have gotten).
I think I know why this is happening in our side of the system, but I will need to verify that and eventually reland this.
However we'll have a production change freeze after today, so this will not land today. I hope I have everything lined up for this to land as soon as we're back at work in January.
Note: the code as landed is fine. We will continue to receive data just fine without issues.
Description
•