Open Bug 1924488 Opened 1 month ago Updated 27 days ago

Mirrored legacy telemetry event may be recorded successfully while related Glean event metric originating the mirrored data may be invalid

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

ASSIGNED

People

(Reporter: rpl, Assigned: chutten)

References

Details

While working on rebasing Bug 1917859 on top of Bug 1923015, I did notice that if the set of extra keys defined on the Glean metric event and the related legacy telemetry events that it is being mirrored into get out of sync, in particular if the Glean event is missing extra keys that the legacy telemetry event definition includes, then there is a chance that the legacy telemetry event may be collected as expected while the Glean event that has mirrored that data into the legacy telemetry may actually be invalid (because it includes a key unknown to the Glean metric).

I'm marking this issue as a defect mainly because the current behavior may be surprising and it may end up being potentially misleading, e.g. a test case only covering the legacy telemetry event would still be passing as expected while the Glean event wouldn't be collected (it will end up being recoded as an invalid record), and the mismatch between the data collected into the legacy telemetry vs. Glean may go unnoticed (until the data collected may be compared in redash or when tests may be changed to cover the Glean event explicitly).

As an example, to hit this kind of issue through on top of https://phabricator.services.mozilla.com/D224982 changes:

  • comment out blocklist_state addition from the addons_manager's disable_extension Glean metric definition from toolkit/mozapps/extensions/metrics_legacy.yaml
  • run toolkit/mozapps/extensions/test/xpcshell/test_addon_manager_telemetry_events.js test case and confirm that the assertion on the Glean event from line 278 fails (because the data was invalid from a Glean metric schema perspective), while the assertion from line 266 on the related legacy telemetry event mirrored by that same glean metric passes

Alas, the JS event API is weak to failures of this kind because doing what we'd do for all other language bindings -- generating structs with the names and types of even extras and only accepting arguments of that type to record -- was gonna be expensive. (We'd have to codegen a bunch of webidl types (one for each event metric) and we were warned against it).

So we're stuck supplying a generic event metric for every specific event metric in JS, meaning we're stuck supplying a generic record call that takes an optional record<UTF8String, UTF8String?>? aExtra.

As an optimization (to avoid needing a copy of all those string names), FOG relies on the Glean SDK's internal checking of allowed_extra_keys for validity. Unless we could come up with something clever around ToFfiExtra() (example), we just don't have the information of what strings are acceptable extra keys at the JS/C++ layer where GIFFT decisions are being made.

This is all to say that we're a little stuck by the constraints of the existing design.

...at runtime... hrm. But at compile time we could doublecheck that the definitions line up... yeah. Yeah. We could do that.

Okay! This bug is about writing a python test in toolkit/components/telemetry/tests/python that parses all Glean metrics and all Telemetry probes and ensures that any Glean metrics that are mirrored are mirrored to compatible probes. Don't forget to update the "Add new metric type" docs to include "adding any necessary assertions to <test this bug creates>" in the section on GIFFT.

Assignee: nobody → chutten
Severity: -- → N/A
Status: NEW → ASSIGNED
Type: defect → task
Priority: -- → P1

Other things this test can check: that the types line up. I just spent some frustrating minutes trying to figure out why my GIFFT mirror of a boolean metric wasn't working, and it was because the mirrored-to Scalar was of kind: string instead of kind: boolean.

You need to log in before you can comment on or make changes to this bug.