crash metrics might cause duplicated metrics due to inclusion in gecko and libcrash
Categories
(Data Platform and Tools Graveyard :: Glean Platform, defect)
Tracking
(Not tracked)
People
(Reporter: janerik, Assigned: relud)
References
Details
We're facing a problem of the probe-scraper potentially seing duplicated metrics, and I want to avoid problems before we land code that triggers this.
Current state
Desktop introduced crash metrics (toolkit/components/crashes/metrics.yaml).
These were exposed as part of the gecko library (pr 544).
Through the dependency chain gecko -> firefox_desktop these landed in the Desktop schemas.
Through the dependency chain gecko -> fenix these also landed in the Fenix schemas.
Fenix doesn't actually set these metrics or send the associated crash ping through Gecko.
These files subsequently got moved to the firefox_desktop app.
PR 551 does this for probe-scraper.
It is not merged yet.
Now libcrash from Android Components plans to reimplement the crash metrics for Android (PR 671).
These are an exact copy of the definitions in Desktop and will be kept in sync with it as well.
The dependency chain would be org.mozilla.components:lib-crash -> fenix for these.
probe-scraper testing
I tested the necessary changes in probe-scraper.
First using main and fetching gecko metrics:
python3 -m probe_scraper.runner --dry-run --cache-dir tmp/cache --out-dir tmp/out --glean --glean-repo gecko --glean-limit-date 2023-01-01
Then doing the same on a checkout of PR 551.
Then swapping the lib-crash repo over to https://github.com/afranchuk/firefox-android, branch glean-crash-pings[^1].
And after running a more complete run with some dependencies[^2]:
python3 -m probe_scraper.runner --dry-run --cache-dir tmp/cache --out-dir tmp/out --glean --glean-repo fenix --glean-repo gecko --glean-repo glean-core --glean-repo lib-crash
This results in an error:
- 'crash.process_type' defined more than once in gecko, lib-crash
[...]
[^1]: Requires minimal changes in probe-scraper, because otherwise branch isn't actually allowed for libraries.
[^2]: I removed the exception that would stop the duplicate check to happen.
Expectations
In theory there is no actual overlap where Fenix would get those crash metrics/pings from 2 different dependencies.
gecko never sent them on Android and they won't be exposed going forward.
libcrash is Android-only and will only be introduced after the same metricsa are removed from gecko.
Thus it should work.
Difficulties
I currently don't actually understand all the different options we have on probe-scraper
and how to reproduce what will actually happen on production runs.
The combination of --glean-limit-date, --update, etc. is not clear to me.
So I would need someone to verify that the steps I ran are equivalent to what happens when we land all this.
If this does cause duplication checks to fail we need to figure out a way forward.
Of course the probe-scraper checks could be skipped for specific parts, but I wonder what impact the potential duplication has on generated schemas.
Comment 1•3 years ago
|
||
The latest comments in PR 551 make it appear that we likely will see duplicate metric warnings in production from this.
| Assignee | ||
Comment 2•3 years ago
•
|
||
Yes, we would see duplicate metric errors in prod from this.
After checking out PR 551 and applying PR 558 to fix duplicate metric checks, the following command shows approximately what would actually happen on a production run:
python3 -m probe_scraper.runner --dry-run --cache-dir tmp/cache --out-dir tmp/out --output-bucket=gs://probe-scraper-prod-artifacts --update --glean --glean-repo=firefox-desktop --glean-limit-date=$(date -d -1day +%F) --env=dev
(In reply to Jan-Erik Rediger [:janerik] from comment #0)
The combination of --glean-limit-date, --update, etc. is not clear to me.
as seen in the above example, use --update with --output-bucket=gs://probe-scraper-prod-artifacts and --glean-limit-date=$(date -d -1day +%F) with --glean-repo (or --glean-url) to approximate how prod runs.
--update and --output-bucket=gs://probe-scraper-prod-artifacts ensure that previous results are pulled from the prod output bucket and "updated", instead of starting from scratch. Your normal credentials should be sufficient to read from the output bucket, but not write to it, and --env=dev (which is the default when not specified) ensures that you won't try to write to it.
--glean-limit-date with --glean-repo causes a shallow clone of git repos so that you only scrape commits on or after that utc date, and it's basically like batching all push mode updates that would have been sent in that time.
(In reply to Jan-Erik Rediger [:janerik] from comment #0)
So I would need someone to verify that the steps I ran are equivalent to what happens when we land all this.
The lack of --update and --output-bucket means the steps you ran wouldn't account for gecko still having crash metrics even after it no longer references toolkit/components/crashes/metrics.yaml in repositories.yaml.
Also, before PR 558 there would have incorrectly been no duplicate metric error in prod, which is why you had to go through extra steps to (correctly) trigger one.
| Assignee | ||
Comment 3•3 years ago
•
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #0)
If this does cause duplication checks to fail we need to figure out a way forward.
One option:
when the metrics and pings definitions are added to lib-crash to cover fenix and focus_android, at the same time move the metrics and pings definition from gecko to firefox_desktop, firefox_desktop_background_update, and pine, and run the probe-scraper-glean-gecko-dev airflow job without --update once.
If done correctly, the metrics and pings defined in gecko will disappear, and their history will appear in the new locations, and the resulting schemas will be compatible.
If we don't get all of those things to happen at once, then I would expect either an error in probe-scraper-glean-geck-dev due to duplicate metrics, or an error in mozilla-schema-generator about an incompatible schema update trying to delete a field or table.
After that is successfully complete, then the crash definitions could be removed from firefox_desktop_background_update and pine if appropriate.
Comment 4•3 years ago
|
||
(In reply to Daniel Thorn [:relud] from comment #3)
If done correctly, the metrics and pings defined in gecko will disappear, and their history will appear in the new locations, and the resulting schemas will be compatible.
If we don't get all of those things to happen at once, then I would expect either an error in
probe-scraper-glean-geck-devdue to duplicate metrics, or an error inmozilla-schema-generatorabout an incompatible schema update trying to delete a field or table.After that is successfully complete, then the crash definitions could be removed from firefox_desktop_background_update and pine if appropriate.
If, for whatever reason, we aren't able to do all those things synchronously (which I think is quantized by day?), would we only get errors in the intermediate state until we are able to get it all synced up? Or would there be some sort of persistent error state created?
The lib-crash PR is ready to merge pending the outcome of this bug, so in theory I don't think it would be hard to coordinate the synchronous change (merge that PR and the probe-scraper PR, and run probe-scraper-glean-geck-dev without --update?).
If this presents too large an uncertainty or obstacle, we could potentially keep the metrics in gecko and just name them differently in lib-crash; it's not ideal but it wouldn't present any real issue for developers. It would only make metric analysis/filtering slightly more awkward (from a human perspective, since identical metrics will have different names).
| Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Alex Franchuk from comment #4)
If, for whatever reason, we aren't able to do all those things synchronously (which I think is quantized by day?), would we only get errors in the intermediate state until we are able to get it all synced up? Or would there be some sort of persistent error state created?
yes it is quantized by day.
Merging the lib-crash PR without the other changes will result in failing to push any updates (i.e. new fields) for any app/library in https://github.com/mozilla-mobile/firefox-android until the rest of it is handled, due to duplicate metrics errors.
Merging the probe-scraper PR will have a similar impact on the gecko-dev repo.
Merging the probe-scraper PR and running probe-scraper-glean-gecko-dev without --update, will block all schema changes for all applications and libraries until we reach a backwards compatible state (i.e. everything merged, no previously present metrics or pings missing).
| Assignee | ||
Comment 6•3 years ago
|
||
running
probe-scraper-glean-gecko-devwithout--update
this requires specific support in the airflow dag: https://github.com/mozilla/telemetry-airflow/pull/1656
| Assignee | ||
Comment 7•3 years ago
|
||
(In reply to Daniel Thorn [:relud] from comment #5)
Merging the probe-scraper PR and running
probe-scraper-glean-gecko-devwithout--update, will block all schema changes for all applications and libraries until we reach a backwards compatible state (i.e. everything merged, no previously present metrics or pings missing).
we could alternatively work around this block by allowing crash ping tables for lib-crash consumers to be deleted (and later recreated when lib-crash pr is merged)
Comment 8•3 years ago
|
||
Currently the only crash pings that actually exist are from desktop, so deleting empty tables wouldn't be a problem.
Comment 9•3 years ago
|
||
Now that the airflow dag support is in, are we good to move forward? Who should take action here?
| Assignee | ||
Updated•3 years ago
|
| Assignee | ||
Comment 10•3 years ago
|
||
i made the plan, and i understand it best, so i'll take action.
i'm going to discuss it tomorrow in the glean wg meeting. Assuming no objections in the meeting, I'll execute the plan, with deleting empty tables, shortly after.
| Assignee | ||
Updated•3 years ago
|
Comment 11•3 years ago
|
||
My understanding is that this has been done, correct? If so, do we have the green light to merge the new lib-crash code that has the previously-duplicated metrics (it will happen on Monday to avoid the weekend).
| Assignee | ||
Comment 12•3 years ago
|
||
not yet, i will mark this bug fixed when that is safe to do
Comment 13•3 years ago
|
||
Okay, I asked because I wasn't sure whether merging lib-crash was a step that would happen before this bug was resolved.
| Assignee | ||
Comment 14•3 years ago
•
|
||
you're good to merge lib-crash
Updated•1 year ago
|
Description
•