Closed Bug 1850863 Opened 1 year ago Closed 10 months ago

multiple glean messaging_system metrics have sent no values in the last 28 days

Categories

(Firefox :: Messaging System, task, P2)

task
Points:
2

Tracking

()

RESOLVED FIXED
124 Branch
Iteration:
123.3 - Jan 15 - Jan 19
Tracking Status
firefox124 --- fixed

People

(Reporter: dmosedale, Assigned: dmosedale)

References

(Blocks 1 open bug)

Details

(Whiteboard: [omc])

Attachments

(1 file, 1 obsolete file)

Thanks to Ed for discovering that no values of the messaging_system.action metric Glean Dictionary have been sent: Looker.

I've played around a bit and have yet to reproduce logs or examples of Firefox locally sending telemetry of this form (either with PingCentre or with Glean, but I haven't put too much time into it yet.

This needs to be dug into more.

Summary: messaging_system.action metric has sent no values → messaging_system.action metric has sent no values in the last 28 days
Assignee: nobody → dmosedale
Iteration: --- → 119.2 - Sept 11 - Sept 22
Priority: -- → P1
Whiteboard: [omc]

The next steps are to dig deeper into the original patches and do more extensive tracking. This is slated for 119.2, so things are currently on track.

Iteration: 119.2 - Sept 11 - Sept 22 → 120.1 - Sep 25 - Oct 6
Summary: messaging_system.action metric has sent no values in the last 28 days → multiple glean messaging_system metrics have sent no values in the last 28 days

TL;DR

After doing some digging on the action metric (documented below), I decided it would be a good idea to see if any other columns had the same issue, and found that:

  • There are four Glean metrics which have empty BigQuery columns (i.e. columns with only null values in them) (going back to the start of FxMS Glean data collection in 115 Nightly on 2023-05-24).
  • There does not appear to be any data loss or meaningful negative consequence here.
  • Most of these are the result of a tactic we used when landing the telemetry to avoid data loss.

Relevant details about Glean & BigQuery

  • If a metric is defined for a Glean ping, it will be included with every instance of that ping, even if the metric has not been explicitly set.

  • If nothing is received for a metric in a ping where it is sent, it gets ETLed into BigQuery in a way that causes it to appear as null in SELECTs.

  • The above two lines mean that columns full of nulls are completely consistent with no telemetry being sent by the code for a particular metric.

  • There is effectively no meaningful cost for BigQuery columns with only nulls.

  • When we were creating browser/newtab/components/metrics.yaml, we decided that we wanted to err on the side of having places to put data that we might send (rather than on the side of finding out later that we had discarded sent data that there was no place to put). So, we made sure to add metrics for things in the existing docs that looked plausible, even if we weren't seeing them currently in the debugging output.

Next steps

  • Columns can't be removed from a table without versioning the schema, and this is rarely done for various reasons.
  • After discussion with chutten, the appropriate way to handle this sort of thing is usually to remove unused metrics from metrics.yaml, which will mark them as obsolete in the Glean dictionary. I expect this will make sense to do with all of these except for (maybe) msstoresignedin, which I will spin off another bug for.

The columns:

  • .action

    • likely reason added to metrics.yaml: existing telemetry docs.
    • no data received, appears unused in source code (more details below)
    • proposal: mark as obsolete by removing from metrics.yaml
  • .cfr_action

    • likely reason added to metrics.yaml: unclear, seems like just a bug
    • current & older searches have pretty much no uses of this string in telemetry.
    • proposal: mark as obsolete by removing from metrics.yaml
  • .page

    • likely reason added to metrics.yaml: metrics collection docs about basic shape of user event pings
    • user_events are used in both newtab code and asrouter code. However, having dug through searchfox and vscode, I haven't found instances of page keys being sent from ASRouter user_event pings, and I haven't found any columns in the PingCentre tables that collect it.
    • proposal: mark as obsolete by removing from metrics.yaml
  • .attribution.msstoresignedin

    • likely reason added to metrics.yaml: attribution code
    • no columns in the old telemetry tables collect this
    • null column in FxMS Glean tables presumably because it only gets sent if there's a windows install campaign (e.g. Test new install method for Windows 11 users), and none of those have yet been done.
    • proposal: spin off a bug to decide what to do here, if anything, because a lot depends on how product and desktop-integration want to use this (or not)

More details for .action:

The AS Router part of the data events docs and user_event ping basic shape docs show them as having action fields. An example of what actually gets sent:

Submitting Glean ping for {"source":"CFR","message_id":"CFR_FIREFOX_VIEW","bucket_id":"CFR_FIREFOX_VIEW","event":"IMPRESSION","addon_version":"20230911154121","locale":"en-US","client_id":"a2d2550a-0255-4d3a-b330-983d15d7ae7a","pingType":"cfr"}

It appears that action fields used to get sent, but that changed when AS Router telemetry migrated to a new data pipeline in bug 1585147 via this PR, which added code that deleted the action field for a whole pile of telemetry types.

For more due diligence, I looked through code (via SearchFox and otherwise) for the string action:, finding lots of false positives (links to examples):

Adding a couple of needinfos for feedback on the analysis and proposals about what to do with these columns; comments from others are welcome too.

Flags: needinfo?(edilee)
Flags: needinfo?(chutten)

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)

  • If a metric is defined for a Glean ping, it will be included with every instance of that ping, even if the metric has not been explicitly set.

In actuality we do not send in the ping an empty metric value. You can see this in pings received by the Debug Ping Viewer which will have nothing for these various fields. But this is moot because your second point about how BigQuery has NULL for any column that didn't have an express value set makes it look like every ping is "complete". There is no difference in the data between "Key is present with value null" and "Key is not present", and this is a characteristic of our data platform.

As for removal: if you remove these metrics and values do decide to show up later, they'll get caught in the messaging_system.unknown_keys metric which should prompt an alert via the Messaging System System Health dashboard. As such you won't be unaware if these metrics start being collected again somehow. Which I think is nice.

Everything else in your investigation and report seems groovy to me.

Flags: needinfo?(chutten)
Iteration: 120.1 - Sep 25 - Oct 6 → 120.2 - Oct 9 - Oct 20

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)

If this is supposed to be the documented action, are we getting the data elsewhere? The examples include activity_stream_event, activity_stream_user_event, activity_stream_session, cfr_user_event. Maybe other data obviates this such as some combination of ping type and event?

For about:welcome data, there's also a page under event_context. Should it be at the top level instead? The documentation seems to have examples of pocket, and that data shouldn't be affected if removing this?

Overall the proposals for obsoleting means we're sure the data doesn't need to be sent as opposed to a bug that the data doesn't get sent or gets to other columns?

Flags: needinfo?(edilee)
Priority: P1 → P2
Iteration: 120.2 - Oct 9 - Oct 20 → 121.1 - Oct 23 - Nov 3
Iteration: 121.1 - Oct 23 - Nov 3 → 121.2 - Nov 6 - Nov 17
Iteration: 121.2 - Nov 6 - Nov 17 → 122.1 - Nov 20 - Dec 1
Points: --- → 2
Iteration: 122.1 - Nov 20 - Dec 1 → 122.2 - Dec 4 - Dec 15

Ed wrote:

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)

    .action
       likely reason added to metrics.yaml: existing telemetry docs.

If this is supposed to be the documented action, are we getting the data elsewhere?

My suspicion is that your hypothesis:

Maybe other data obviates this such as some combination of ping type and event?

is correct and is one of the reasons noone has been missing this stuff since we stopped sending it in 2020. (*)

The fact that we stopped sending it in 2020 and we're not aware of any outstanding issues makes me think it's very unlikely that there is still a need for it, beyond what is already being served by existing data.

(*) The action fields used to get sent, but that changed when AS Router telemetry migrated to a new data pipeline in bug 1585147 via this PR, which added code that deleted the action field for every type of user event that is explicitly handled in `TelemetryFeed.jsm:createASRouterUserEvent``. I have yet to find any events that would fall through and hit the default clause (which would not delete the action field).

The examples include activity_stream_event,

There are no static uses of this string in code:
https://searchfox.org/mozilla-central/search?q=activity_stream_event&path=&case=true&regexp=false

It seems very unlikely that it's dynamically constructed either:
https://searchfox.org/mozilla-central/search?q=user_event&path=&case=true&regexp=false

activity_stream_user_event, activity_stream_session,

These two (and other activity/discovery stream stuff that I'm aware of) are handled through different PingCentre namespaces. These two in particular are handled here

cfr_user_event. Maybe other data obviates this such as some combination of ping type and event?

I think so, as discussed above.

   .page
       likely reason added to metrics.yaml: metrics collection docs about basic shape of user event pings

For about:welcome data, there's also a page under event_context. Should it be at the top level instead?

It's actually already promoted to the top-level as event_page:

https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/metrics/messaging_system_event_page
https://searchfox.org/mozilla-central/rev/877adfeec838cdad97369441865bac605109427d/browser/components/aboutwelcome/modules/AboutWelcomeTelemetry.jsm#189-190,194-195

The documentation seems to have examples of pocket, and that data shouldn't be affected if removing this?

The documentation covers all three PingCentre namespaces mentioned & linked to above. My understanding is that pocket & newtab do not depend on anything from the (now defunct) messaging system namespace.

Nan, Scott, do either of you have any reason to think that Pocket/Newtab is missing data that would have been sent under the messaging namespace up until our recent removal of the FxMS use of PingCentre?

Overall the proposals for obsoleting means we're sure the data doesn't need to be sent as opposed to a bug
that the data doesn't get sent or gets to other columns?

To me, it means that we're ~90% sure, based on the research I've documented in this bug, (which I speculate is about as good as it gets in a codebase of this size & complexity with a dynamic language). Given that we have seen nothing in these columns since turning on Glean in July, this seems quite unlikely to happen.

As Chris pointed out in comment 3, if we do happen to have missed a case, this will show up in the unknown_keys metric, which we see in the regular dashboard triage.

Which is to say, I propose we go ahead with this deprecation to minimize confusion.

That said, to be extra certain that no surprises pop up and land in these columns between now and then, we could easily push this bug out to some iteration 6 months in the future and annotate the Glean dictionary entries of these entries as "planning to deprecate" with a pointer to this bug.

Thoughts?

Flags: needinfo?(sdowne)
Flags: needinfo?(najiang)
Flags: needinfo?(edilee)

The documentation covers all three PingCentre namespaces mentioned & linked to above. My understanding is that pocket & newtab do not depend on anything from the (now defunct) messaging system namespace.

Correct, those three are independent without any interactions.

Nan, Scott, do either of you have any reason to think that Pocket/Newtab is missing data that would have been sent under the messaging namespace up until our recent removal of the FxMS use of PingCentre?

I can't think of how those changes to FxMS could affect the Pocket/Newtab data collection as they don't depend on each other.

Flags: needinfo?(najiang)
Iteration: 122.2 - Dec 4 - Dec 15 → 123.1 - Dec 18 - Dec 29

I talked to Scott, and the one thing that came up was the possibility that
https://github.com/mozilla/jetstream/issues/1750 might be related.

That was filed before the first Glean patches for FxMS landed on m-c, so clearly, the Glean migration was not causal to the jetstream issue.

Nan and I looked at the tables the missing metrics in Jetstream came from:

Ad Clicks by Branch:
moz-fx-data-shared-prod.telemetry_derived.experiment_search_aggregates_live_v1

Cumulative Ad Clicks By Branch
mozdata.telemetry.experiment_cumulative_ad_clicks

Search Count with Ads by Branch
moz-fx-data-shared-prod.telemetry_derived.experiment_search_aggregates_live_v1

Cumulative Search Count with Ads by Branch
mozdata.telemetry.experiment_cumulative_search_with_ads_count

These are not messaging system tables, and Nan confirmed that they are also not Activity Stream tables, so it's not data that anyone we know would be missing. And it seems wildly unlikely that data someone else is looking for and is important would suddenly start showing up in the messaging system telemetry (or have ever been likely to have been there in the first place, given the namespacing).

One could imagine a theoretical shared root cause, and I poked around on our Acryl instance, but didn't learn much of interest.

It feels like we're hitting diminishing returns on additional analysis here. And so my opinion hasn't changed:

I propose we go ahead with this deprecation to minimize confusion.

That said, to be extra certain that no surprises pop up and land in these columns between now and then, we could easily push this bug out to some iteration 6 months in the future and annotate the Glean dictionary entries of these entries as "planning to deprecate" with a pointer to this bug.

Thoughts?

Flags: needinfo?(sdowne)
Iteration: 123.1 - Dec 18 - Dec 29 → 123.2 - Jan 1 - Jan 12

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #5)

Ed wrote:

   .page

For about:welcome data, there's also a page under event_context. Should it be at the top level instead?

It's actually already promoted to the top-level as event_page:

Seems reasonable to remove. I suppose we'll just stick with event_page instead of switching to potentially more general page that could be used for non-events? Very similarly action vs event could reuse the latter for not-events or we can introduce something else if both are needed later or we don't want to mix them up.

More broadly, do we need to care about the specific metric names if I suppose we just update the description to clarify, e.g., "this is actually used for both events and not-events." Seems practical to just go with what we already have and not spend much time on specific naming, but maybe a little bit of metrics architecting could help?

Flags: needinfo?(edilee)

Here's a Look showing that the Microsoft Store column has seen a bit of data (presumably tests), so I'm going to leave that alone, since I'm aware that there are ongoing efforts around store and experiments planned.

@chutten, do you have any thoughts on Ed's questions in comment 8? My inclination is not to spend the effort on small chunks here until we have a higher-level architecture strategy for how we want to move forward with FxMS telemetry as a whole (it's unclear to me when that will make sense to prioritize).

Flags: needinfo?(chutten)

I've attached a patch that will remove the three completely unused columns, and once we come to a conclusion on the questions in comment 8 , I'll request review, or else adjust as appropriate.

Changing descriptions is a good thing (no one'll use them if they fall out of date), though if you make changes to the names or how they're used, you may wish to check the ETL derivations downstream of "messaging-system" pings (like onboarding). Nothing terrible would happen if something went awry -- all the data'd be in firefox_desktop.messaging_system for replay and backfill -- but it'd cause less disruption to make the changes ahead of time rather than post facto.

Flags: needinfo?(chutten)
Iteration: 123.2 - Jan 1 - Jan 12 → 123.3 - Jan 15 - Jan 19

(In reply to Chris H-C :chutten from comment #13)

Changing descriptions is a good thing (no one'll use them if they fall out of date), though if you make changes to the names or how they're used, you may wish to check the ETL derivations downstream of "messaging-system" pings (like onboarding). Nothing terrible would happen if something went awry -- all the data'd be in firefox_desktop.messaging_system for replay and backfill -- but it'd cause less disruption to make the changes ahead of time rather than post facto.

Right, I guess what I'm trying to ask is less about the possibility and order of work for making such changes (though your thoughts there are helpful) and more about whether this is a wise use of energy to do now rather than later. My instinct is that it's not worth making these sorts of changes until and unless we know that we need them, in part because it's a non-zero amount of work competing with lots of other stuff, and it's hard to tell in advance whether the things we want to do in the future are likely to have the close-enough semantics to share a metric or not. More on this in a bit...

Iteration: 123.3 - Jan 15 - Jan 19 → 124.1 - Jan 22 - Feb 2
Iteration: 124.1 - Jan 22 - Feb 2 → 123.3 - Jan 15 - Jan 19
Flags: needinfo?(chutten)

Looking back at the specific renaming of event_page to event, there is a general rule of thumb to prefer specific names over general ones. Obviously, messaging-system is a bit of a special case, but we've found in our experience with data collections and their consumers, in many instances people only have the energy to look at the names of things. Specificity and clarity are to be prized wherever possible.

Flags: needinfo?(chutten)

I think this also applies to the action vs. events: that's also a proposed generalization, so I'm inclined to assume that in the future, we'd be more likely to add something specific than to go out of our way to generalize. So, I'm going to leave things as they are, except for the three metrics in the patch, which I will put up for review.

Attachment #9372039 - Attachment is obsolete: true

Specifically, the plan is to remove messaging_system.{page, action, cfr_action} as described in comment 2 and comment 3. We're leaving .attribution.msstoresignedin, as it turns out that column is not actually empty, and we have reason to believe it will be useful in the future.

Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: