[meta] Improve developer ergonomics of adding new crash annotations to glean pings
Categories
(Toolkit :: Crash Reporting, enhancement)
Tracking
()
People
(Reporter: afranchuk, Assigned: afranchuk)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: meta)
Attachments
(7 obsolete files)
Because Glean didn't support object metric types when the crash ping was originally created, annotations were manually translated to metrics. This is cumbersome. Ideally, a developer who wants to add an annotation can do so by simply editing CrashAnnotations.yaml and adding code to set the annotation. Instead, they must edit CrashAnnotations.yaml (some code paths will still use it for various purposes besides the legacy ping), the desktop and/or android metrics.yaml glean definitions, and the desktop/crashreporter client code to populate the metrics and/or the android client code to populate the metrics (all in different languages!).
There are a few solutions to this problem (numbered here for easy referencing):
- Bug 1911301 would do a lot to mitigate the overhead here by centralizing the conversion code in one place and in one language. That would arguably help out a lot: asking developers to update multiple programming languages isn't great.
- By adding a little more information to
CrashAnnotations.yaml(where necessary), we may be able to generate the metrics files and/or conversions code. - We could change the crash ping to use an
objectmetric for annotations. This makes querying annotation information in e.g. BigQuery slightly more involved, and also removes the ability to store particular annotations in more intelligent ways (quantitymetrics with units, string lists, etc). However, it would allow the code that stores annotations in glean metrics to be generic. We could potentially also store some things more intelligently by adding extra information toCrashAnnotations.yaml.objectmetrics don't allow arbitrary keys in objects, however we could use a definition like:
type: object
properties:
stringAnnotations:
type: array
items:
type: object
properties:
key: string
value: string
boolAnnotations:
type: array
items:
type: object
properties:
key: string
value: string
...
- Crash annotations could be stored using the glean API, dropping
CrashAnnotations.yamlaltogether. This is unlikely to be tenable since the annotation API is optimized for efficiency and for running in degraded circumstances. The API also features the ability to extract annotations from other processes, which would need to be reworked in some (likely significant) way.
| Assignee | ||
Comment 1•7 months ago
|
||
The ping scope in CrashAnnotations.yaml is manually verified when adding Glean crash ping metrics (i.e., we don't have any way to programmatically verify that annotations put in the metrics are of the scope described in CrashAnnotations.yaml). We could add a lint for this, though it would require enumerating the mapping of annotations to metrics. This is something that would likely be fixed by reworking the ergonomics here.
Another possible solution:
- Use the
metrics.yamlas the newCrashAnnotations.yaml. I.e., it may be possible to define all the annotations within Glean definitions and change the crash annotations codegen to read from the metrics file rather than from theCrashAnnotations.yamlformat. This is similar in spirit to solution 2: combining two separate definitions into one.
| Assignee | ||
Updated•3 months ago
|
| Assignee | ||
Comment 2•3 months ago
|
||
Plan
After some thought, I've come up with the following plan:
- Change
CrashAnnotations.yamlto instead be a Rust source file, so that the more complicated conversion logic from annotations to Glean metrics (e.g., stack traces) can be written once. - Use this new Rust source in:
- a build-time program to generate the language bindings for annotations (we still want language-native annotation enums),
- the crash reporter client,
- either linked into firefox or invoked through the crash reporter client binary, and
- linked into Android. This will be either combined into the same library that has minidump_analyzer, or the same one as crashhelper, to avoid yet another library (we will combine those in the future at some point anyway, when the crash helper does the minidump analysis itself).
- Because Glean uses global statics, also send the crash ping from this new module.
Reasoning
- This allows conversion code to be written in one language, rather than being generated in multiple languages (which would be higher maintenance to test, and more brittle to changes in Glean, language lints, etc).
- Developers adding a crash annotation can do so by adding a definition to the Rust source file of annotations (which can be designed to be fairly declarative). If there is special logic to convert the annotation to a Glean metric, it can be written there, but otherwise default behaviors can handle the rest, and the developers would need only to then set the annotation in their code (no touching Glean
metrics.yamlfiles, nor setting up conversions in many languages). - This is an incremental step toward an improved crash lifecycle. This code could be easily integrated into the crash helper in the future (which would be handling crash pings and reports).
Concession
I have a fairly clear idea of what should be done to implement this, however we could instead keep annotations in CrashAnnotations.yaml and generate all the language-specific code. I estimate this would take more time to develop/test (the only-Rust solution can piggyback off of the existing infrastructure like minidump-analyzer or crashhelper for "distribution" among the language runtimes), however the real cost would be future maintenance.
Comment 3•3 months ago
|
||
I like this idea, and I especially like the fact that we could use the same rust code both at runtime and at compile time. Testing coverage is the cherry on top. We could use this to also further enforce static constraints on crash annotations (such as types and possibly even values, e.g. ensuring some annotation can only hold a subset of values, or the value must have certain constraints, etc...).
| Assignee | ||
Comment 4•3 months ago
•
|
||
I have been working on this, however the build system does not have much support for running compile-time Rust (aside from, perhaps, using a build script to produce side effects). The #build folks advised that I stick to using Python (at least for the time being) for the compile-time aspects. In the future there may be first-class support for running compile-time Rust: it sounded as if they were kicking around ideas to get it to work, because application-services developers need it, too. We might be able to get away with a cargo build script doing it, but for this bug I'm sticking with the Python. There still will be one Rust library that does the work across all platform environments, so we'll have reduced maintenance burden from that.
FYI the main issue #build were concerned with was supporting artifact builds.
| Assignee | ||
Comment 5•3 months ago
|
||
This leaves CrashAnnotations.yaml where it was to not confuse people
and tooling looking for it.
| Assignee | ||
Comment 6•3 months ago
|
||
Since the Java/Kotlin implementation worked based on the file extension,
I figured it could work that way for everthing.
| Assignee | ||
Comment 7•3 months ago
|
||
| Assignee | ||
Comment 8•3 months ago
|
||
This crate handles converting crash annotations to Glean metrics and
sending the crash ping.
Glean metrics are generated directly from CrashAnnotations.yaml.
| Assignee | ||
Comment 9•3 months ago
|
||
| Assignee | ||
Comment 10•3 months ago
|
||
This uses the crash reporter client to send the ping. This will be
cleaner once we remove the CrashService/CrashManager, so that was can
immediately pass the ping ownership to the crash reporter client (and it
can handle storage, minidump analysis, etc). For now, we take this less
elegant approach so to preserve the existing behavior.
| Assignee | ||
Comment 11•3 months ago
|
||
We combine the minidump-analyzer and pingsender crates into one JNI
shared library to be loaded by lib-crash.
Comment 12•3 months ago
|
||
There's a lot of independent bits here. I'd prefer if you'd split them into different bugs which we'll land one at a time instead of as a set given that they don't strictly depend on each other. In particular it feels to me that pt 1 & 2 could land together, then pt 3 and finally pt 4-7. You could turn this into a meta bug given it's broader scope and then move the patches into dependent bugs, each scoping out one of the changes.
| Assignee | ||
Updated•3 months ago
|
Comment 13•3 months ago
|
||
Comment on attachment 9513507 [details]
Bug 1950749 p1 - Move annotations-related things to their own folder. r=gsvelto
Revision D264966 was moved to bug 1989407. Setting attachment 9513507 [details] to obsolete.
Comment 14•3 months ago
|
||
Comment on attachment 9513508 [details]
Bug 1950749 p2 - Refactor generation script to have one entry point. r=gsvelto
Revision D264967 was moved to bug 1989407. Setting attachment 9513508 [details] to obsolete.
Comment 15•3 months ago
|
||
Comment on attachment 9513510 [details]
WIP: Bug 1950749 p4 - Create the pingsender crate. r=gsvelto
Revision D264969 was moved to bug 1989439. Setting attachment 9513510 [details] to obsolete.
Comment 16•3 months ago
|
||
Comment on attachment 9513511 [details]
WIP: Bug 1950749 p5 - Use the pingsender crate in the crashreporter client. r=gsvelto
Revision D264970 was moved to bug 1989439. Setting attachment 9513511 [details] to obsolete.
Comment 17•3 months ago
|
||
Comment on attachment 9513513 [details]
WIP: Bug 1950749 p6 - Use the pingsender crate for firefox desktop crashes. r=gsvelto
Revision D264971 was moved to bug 1989439. Setting attachment 9513513 [details] to obsolete.
Comment 18•3 months ago
|
||
Comment on attachment 9513514 [details]
WIP: Bug 1950749 p7 - Use the pingsender crate for Android crashes. r=gsvelto
Revision D264972 was moved to bug 1989439. Setting attachment 9513514 [details] to obsolete.
Updated•3 months ago
|
Description
•