Open Bug 1935160 Opened 2 months ago Updated 13 days ago

Remove support for Histograms with `kind: flag`

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

ASSIGNED

People

(Reporter: chutten, Assigned: chutten)

References

(Depends on 1 open bug)

Details

(Whiteboard: [fog-migration])

We deprecated histograms with kind: flag eight years ago in bug 1278531. There are a fifteen left:

$ cat toolkit/components/telemetry/Histograms.json | jq 'to_entries | .[] | {key: .key, kind: .value.kind} | select(.kind == "flag") | select(.key | startswith("TELEMETRY_TEST_") | not ) | .key'
"A11Y_INSTANTIATED_FLAG"
"A11Y_ISIMPLEDOM_USAGE_FLAG"
"A11Y_IATABLE_USAGE_FLAG"
"CYCLE_COLLECTOR_OOM"
"CYCLE_COLLECTOR_WORKER_OOM"
"GEOLOCATION_ERROR"
"TELEMETRY_SESSIONDATA_FAILED_LOAD"
"TELEMETRY_SESSIONDATA_FAILED_PARSE"
"TELEMETRY_SESSIONDATA_FAILED_VALIDATION"
"TELEMETRY_SESSIONDATA_FAILED_SAVE"
"STARTUP_CRASH_DETECTED"
"SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE"
"SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY"
"COMPONENTS_SHIM_ACCESSED_BY_CONTENT"
"MASTER_PASSWORD_ENABLED"

Let's see if we can finally be rid of them.

Question for component/probe owners:

  • Are these probes used today? If so, in what ways?
  • Are you aware of how flag histograms are different from other probe types in Legacy Telemetry?
  • I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

ni? James Teh for A11Y_INSTANTIATED_FLAG, A11Y_INSTANTIATED_FLAG, and A11Y_IATABLE_USAGE_FLAG
ni? mccr8 for CYCLE_COLLECTOR_OOM, and CYCLE_COLLECTOR_WORKER_OOM
ni? peterv for GEOLOCATION_ERROR, and COMPONENTS_SHIM_ACCESSED_BY_CONTENT
ni? gsvelto for STARTUP_CRASH_DETECTED
ni? stadard8 for SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE, and SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY
ni? micah for MASTER_PASSWORD_ENABLED
ni? me for TELEMETRY_SESSIONDATA_FAILED_*

Flags: needinfo?(tigleym)
Flags: needinfo?(standard8)
Flags: needinfo?(peterv)
Flags: needinfo?(jteh)
Flags: needinfo?(gsvelto)
Flags: needinfo?(continuation)
Flags: needinfo?(chutten)
Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #1)

Question for component/probe owners:

  • Are these probes used today? If so, in what ways?

They are unused.

  • Are you aware of how flag histograms are different from other probe types in Legacy Telemetry?

Alas, yes.

  • I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

We can remove them without issue.

I think it is okay if we remove CYCLE_COLLECTOR_OOM and CYCLE_COLLECTOR_WORKER_OOM. How do I even look at telemetry nowadays? Is there some kind of interface to telemetry data that is usable for a humble engineer who isn't looking at dashboards all day?

Flags: needinfo?(continuation)
Depends on: 1935172

(In reply to Chris H-C :chutten from comment #1)

ni? stadard8 for SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE, and SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY

These aren't useful in their current form (there's a few other related probes that can probably also be removed as well). However, it is on my task list to work out what we want to do to replace them. I'd kinda like to do that all at the same time, though I could be convinced that we should just remove the probes that aren't useful and add the new ones in a few weeks (depending on expected timescales).

Flags: needinfo?(standard8)

(In reply to Andrew McCreight [:mccr8] from comment #3)

I think it is okay if we remove CYCLE_COLLECTOR_OOM and CYCLE_COLLECTOR_WORKER_OOM. How do I even look at telemetry nowadays? Is there some kind of interface to telemetry data that is usable for a humble engineer who isn't looking at dashboards all day?

GLAM is the go-to for that: https://glam.telemetry.mozilla.org/

(In reply to Mark Banner (:standard8) from comment #4)

(In reply to Chris H-C :chutten from comment #1)

ni? standard8 for SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE, and SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY

These aren't useful in their current form (there's a few other related probes that can probably also be removed as well). However, it is on my task list to work out what we want to do to replace them. I'd kinda like to do that all at the same time, though I could be convinced that we should just remove the probes that aren't useful and add the new ones in a few weeks (depending on expected timescales).

For now I'm happy to wait. It won't be until the end of the process that we'll get antsy wanting to remove APIs that these probes rely upon. Is there a bug we can block up against?

Flags: needinfo?(standard8)

Morgan, you know more about our dashboards than I. Some questions:

  1. Am I correct that we still use A11Y_INSTANTIATED_FLAG in our dashboard(s)? I think that's how we determine that the a11y engine is running.
  2. Am I correct that we don't have any dashboard queries for A11Y_ISIMPLEDOM_USAGE_FLAG or A11Y_IATABLE_USAGE_FLAG? I know what these measure, but I don't think we've used that data for years, certainly not since I started at Mozilla 7 years ago.
Flags: needinfo?(jteh) → needinfo?(mreschenberg)

(In reply to Chris H-C :chutten from comment #1)

  • Are you aware of how flag histograms are different from other probe types in Legacy Telemetry?

I'll be completely honest: I don't understand the difference between a flag histogram probe vs a boolean scalar probe. What does the former give us that the latter doesn't? I very much suspect that a boolean scalar would give us what we need and that this was only added as a flag histogram because boolean scalars didn't exist when these probes were created. However, I'd like to be certain about the differences between these two before I make that call.

Flags: needinfo?(chutten)
Blocks: 1935420

(In reply to James Teh [:Jamie] from comment #8)

(In reply to Chris H-C :chutten from comment #1)

  • Are you aware of how flag histograms are different from other probe types in Legacy Telemetry?

I'll be completely honest: I don't understand the difference between a flag histogram probe vs a boolean scalar probe. What does the former give us that the latter doesn't? I very much suspect that a boolean scalar would give us what we need and that this was only added as a flag histogram because boolean scalars didn't exist when these probes were created. However, I'd like to be certain about the differences between these two before I make that call.

Let this be a spoiler warning for anyone wanting to preserve the mystery of the ancient and terrible Histogram with kind: flag.

Its differences from a Scalar with kind: boolean are:

  • It's significantly more expensive to store and send, and annoying to use in analyses as a result
    • Flag histogram payload format to record a value of false: "NAME": { "bucket_count": 3, "histogram_type": 3, "sum": 0, "range": [1, 2], "values": { "0": 1, "1": 0 } }. Boolean scalar payload format to record a value of false: "category.name": false
  • Flag histograms are the only kind of histogram exempt from the optimization of "if no value has been set, exclude the probe from the payload"
    • However, their values are still cleared when a ping is sent.
    • This means that most pings will have false in them, and only the ping immediately after the flag is set to true will show that it was ever set to true.
    • It is impossible to detect if the value of the histogram was set to false

I can think of no situations in which someone would ever prefer a Histogram with kind: flag to record data. Before subsessions were invented (ie, the move to the "main" ping from the "saved-session" ping -- aka Telemetry Unification) it might have been an acceptable way to store a single boolean value for something like A11Y_INSTANTIATED_FLAG where, if we never got to instantiate a11y, we'd submit a default value of false. But, honestly, even then you could have inferred that from the ping not containing the value at all.

Flags: needinfo?(chutten)
Flags: needinfo?(tigleym) → needinfo?(mtigley)

I don't think anybody has been looking at STARTUP_CRASH_DETECTED for a while. I'll ask around but I don't think it's in use anymore.

Flags: needinfo?(gsvelto)

For A11Y_INSTANTIATED_FLAG, all we need is to track that the accessibility engine has been instantiated in a Firefox session. So it seems that a boolean scalar would be just fine. Migrating the dashboard queries is another matter entirely.

Let's wait for Morgan's response on whether/how we use these flags on the dashboard. But I think we'll end up migrating A11Y_INSTANTIATED_FLAG and removing the others.

(In reply to Chris H-C :chutten from comment #1)

  • I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

Much appreciated. Are you able to help with dashboard query migration or only Gecko code migration? I suspect the former might be more of a challenge than the latter.

(In reply to James Teh [:Jamie] from comment #11)

(In reply to Chris H-C :chutten from comment #1)

  • I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

Much appreciated. Are you able to help with dashboard query migration or only Gecko code migration? I suspect the former might be more of a challenge than the latter.

I'll help however I can in instrumentation and analysis. I imagine, depending on your query, a COALESCE(payload.processes.parent.scalars.a11y_is_instantiated, <the histogram code you're already using>) would get us 60% of the way there. But I'll wait to hear from Morgan.

Are these probes used today? If so, in what ways?

We're not actively using these probes today. However, it's the only probe we have for recording primary password usage, so we shouldn't remove it altogether.

Are you aware of how flag histograms are different from other probe types in Legacy Telemetry?

I'm not aware what the difference is here, although based on comment 9 it seems like keeping kind: flag isn't necessary for us. We just want to be able to track whether or not primary password is enabled.

I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

Is it possible to migrate this histogram while keeping the history of the old telemetry data? We'd want to be able to view past data and new telemetry from the migrated probe, ideally in one dashboard/query. Similar to Jamie's question in comment 11, are you able to help with putting together that query?

Thank you!

Flags: needinfo?(mtigley)

(In reply to Micah [:mtigley] (she/her) from comment #13)

I'm willing to supply the work to migrate or remove these histograms to your tastes. What would you like me to do?

Is it possible to migrate this histogram while keeping the history of the old telemetry data? We'd want to be able to view past data and new telemetry from the migrated probe, ideally in one dashboard/query. Similar to Jamie's question in comment 11, are you able to help with putting together that query?

Thank you!

Not only will the data kick around for as long as our data retention policies dictate, clients who don't update will forever be sending the information via the MASTER_PASSWORD_ENABLED histogram. Per comment 11 I'm more than happy to help out with bridging the gap. Could you point me at a query that you presently use to investigate this data?

Flags: needinfo?(mtigley)

(In reply to Chris H-C :chutten from comment #14)

Could you point me at a query that you presently use to investigate this data?

The Credential Management team hasn't been using this probe in any queries at present, however, these are some past queries its been used in.

Flags: needinfo?(mtigley)

Oh, yes, these are of the style where COALESCE will do the trick. e.g. for this query's

 mozfun.hist.extract(payload.histograms.master_password_enabled).sum as master_password_enabled,

It'd become

  COALESCE(payload.processes.parent.scalars.master_password_enabled, mozfun.hist.extract(payload.histograms.master_password_enabled).sum > 0) AS master_password_enabled

(Assuming we name the Scalar something like master.password.enabled). I'll file a bug blocking this one for removing MASTER_PASSWORD_ENABLED and supplying a bool scalar to replace it.

Depends on: 1936036
Depends on: 1936125

(In reply to James Teh [:Jamie] from comment #7)

Morgan, you know more about our dashboards than I. Some questions:

  1. Am I correct that we still use A11Y_INSTANTIATED_FLAG in our dashboard(s)? I think that's how we determine that the a11y engine is running.
  2. Am I correct that we don't have any dashboard queries for A11Y_ISIMPLEDOM_USAGE_FLAG or A11Y_IATABLE_USAGE_FLAG? I know what these measure, but I don't think we've used that data for years, certainly not since I started at Mozilla 7 years ago.
  1. I don't see this flag on the dashboard, it looks like we're relying on INSTANTIATORS instead (all our graphs have client strings attached to them, we don't have any graphs that are just "a11y on vs. a11y off".). I'm not sure if these flags are mirrored on Android, but it could be worth checking with the gecko view folks to make sure they don't have dashboards that use this value -- I think cpeterson had an a11y tracker for fenix at one point? (this query looks possibly relevant)
  2. Yes, I don't see any usage of those flags in dashboards or otherwise. I believe they were on the list of telem we reviewed earlier this year / last year and decided they were unused.
Flags: needinfo?(mreschenberg)

(In reply to Morgan Reschenberg [:morgan] from comment #17)

(In reply to James Teh [:Jamie] from comment #7)

Morgan, you know more about our dashboards than I. Some questions:

  1. Am I correct that we still use A11Y_INSTANTIATED_FLAG in our dashboard(s)? I think that's how we determine that the a11y engine is running.
  2. Am I correct that we don't have any dashboard queries for A11Y_ISIMPLEDOM_USAGE_FLAG or A11Y_IATABLE_USAGE_FLAG? I know what these measure, but I don't think we've used that data for years, certainly not since I started at Mozilla 7 years ago.
  1. I don't see this flag on the dashboard, it looks like we're relying on INSTANTIATORS instead (all our graphs have client strings attached to them, we don't have any graphs that are just "a11y on vs. a11y off".). I'm not sure if these flags are mirrored on Android, but it could be worth checking with the gecko view folks to make sure they don't have dashboards that use this value -- I think cpeterson had an a11y tracker for fenix at one point? (this query looks possibly relevant)
  2. Yes, I don't see any usage of those flags in dashboards or otherwise. I believe they were on the list of telem we reviewed earlier this year / last year and decided they were unused.

Geckoview has never shipped with Legacy Telemetry turned on, so I can confirm for you that this histogram is definitely not being used by Android dashboards. That query's using this Glean metric, so it's definitely safe.

I'll file a bug and a patch to remove all three, then!

Depends on: 1936822
Depends on: 1937195

:afarre is looking at GEOLOCATION_ERROR, and COMPONENTS_SHIM_ACCESSED_BY_CONTENT, while peterv is away.

Flags: needinfo?(peterv) → needinfo?(afarre)
Depends on: 1938160

(In reply to Chris H-C :chutten from comment #6)

(In reply to Mark Banner (:standard8) from comment #4)

(In reply to Chris H-C :chutten from comment #1)

ni? standard8 for SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE, and SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY

These aren't useful in their current form (there's a few other related probes that can probably also be removed as well). However, it is on my task list to work out what we want to do to replace them. I'd kinda like to do that all at the same time, though I could be convinced that we should just remove the probes that aren't useful and add the new ones in a few weeks (depending on expected timescales).

For now I'm happy to wait. It won't be until the end of the process that we'll get antsy wanting to remove APIs that these probes rely upon. Is there a bug we can block up against?

I just filed bug 1938160.

Flags: needinfo?(standard8)

So wrt GEOLOCATION_ERROR, we have other GEOLOCATION_* telemetry touched as late as Jul 2024. Maybe Kagami has a more strong opinion?

Flags: needinfo?(krosylight)

GEOLOCATION_* can go, we now have corresponding Glean telemetry including https://searchfox.org/mozilla-central/rev/6f4bc07bd9eb07ed34774841ecf3b1f3678791e8/dom/geolocation/Geolocation.cpp#625. 👍🏻

Flags: needinfo?(krosylight)

And COMPONENTS_SHIM_ACCESSED_BY_CONTENT is very low, but still non-zero. Nika reviewed the move of this probe and posed a good question at bug 1062418 comment 4, but I can't really say that I can make a decision based off of that comment. Maybe you remember Nika, and can have an opinion?

Flags: needinfo?(afarre) → needinfo?(nika)
Depends on: 1938458

(In reply to Andreas Farre [:farre] from comment #23)

And COMPONENTS_SHIM_ACCESSED_BY_CONTENT is very low, but still non-zero. Nika reviewed the move of this probe and posed a good question at bug 1062418 comment 4, but I can't really say that I can make a decision based off of that comment. Maybe you remember Nika, and can have an opinion?

My memory is this is a relic from a long time ago when parts of XPCOM was exposed to the web in Firefox, and when we transitioned away from it, we needed to keep a fake version of the Components object exposed to webpages, which this counter is tracking the use of.

If the problem here is that the counter is using old telemetry, we could probably get it transitioned to a normal DOM usage counter at this point if we can't remove the Components object completely. Would probably make it easier to track using the usecounter tools. I haven't looked at the current usage numbers and how they compare to last time (which was still a remarkably high rate for such a long-deprecated "feature" IIRC?), though.

Flags: needinfo?(nika)
Depends on: 1938601

(In reply to Nika Layzell [:nika] (ni? for response) from comment #24)

If the problem here is that the counter is using old telemetry, we could probably get it transitioned to a normal DOM usage counter at this point if we can't remove the Components object completely. Would probably make it easier to track using the usecounter tools. I haven't looked at the current usage numbers and how they compare to last time (which was still a remarkably high rate for such a long-deprecated "feature" IIRC?), though.

About 4k of a sample of 232k pings received since the beginning of December have a raised flag for COMPONENTS_SHIM_ACCESSED_BY_CONTENT (query)

Current status:

  • A11Y_INSTANTIATED_FLAG, A11Y_ISIMPLEDOM_USAGE_FLAG, A11Y_IATABLE_USAGE_FLAG - Removed in bug 1936822
  • CYCLE_COLLECTOR_OOM, CYCLE_COLLECTOR_WORKER_OOM - Removed in bug 1935172
  • GEOLOCATION_ERROR - Removed in bug 1938458
  • TELEMETRY_SESSIONDATA_FAILED_LOAD, TELEMETRY_SESSIONDATA_FAILED_PARSE, TELEMETRY_SESSIONDATA_FAILED_VALIDATION, TELEMETRY_SESSIONDATA_FAILED_SAVE - Removed in bug 1937195
  • STARTUP_CRASH_DETECTED - Removed in bug 1936125
  • COMPONENTS_SHIM_ACCESSED_BY_CONTENT - Replaced with Use Counter components_shim_resolved in bug 1938601
  • MASTER_PASSWORD_ENABLED - Replaced with Scalar with kind: bool primary.password.enabled and Glean boolean metric primary.password.enabled in bug 1936036

Pending:

  • SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE, SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY - bug 1938160
Whiteboard: [fog-migration]
No longer blocks: 1935420
You need to log in before you can comment on or make changes to this bug.