multiple glean messaging_system metrics have sent no values in the last 28 days
Categories
(Firefox :: Messaging System, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox124 | --- | fixed |
People
(Reporter: dmosedale, Assigned: dmosedale)
References
(Blocks 1 open bug)
Details
(Whiteboard: [omc])
Attachments
(1 file, 1 obsolete file)
Thanks to Ed for discovering that no values of the messaging_system.action metric Glean Dictionary have been sent: Looker.
I've played around a bit and have yet to reproduce logs or examples of Firefox locally sending telemetry of this form (either with PingCentre or with Glean, but I haven't put too much time into it yet.
This needs to be dug into more.
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 1•1 year ago
|
||
The next steps are to dig deeper into the original patches and do more extensive tracking. This is slated for 119.2, so things are currently on track.
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 2•1 year ago
•
|
||
TL;DR
After doing some digging on the action metric (documented below), I decided it would be a good idea to see if any other columns had the same issue, and found that:
- There are four Glean metrics which have empty BigQuery columns (i.e. columns with only
null
values in them) (going back to the start of FxMS Glean data collection in 115 Nightly on 2023-05-24). - There does not appear to be any data loss or meaningful negative consequence here.
- Most of these are the result of a tactic we used when landing the telemetry to avoid data loss.
Relevant details about Glean & BigQuery
-
If a metric is defined for a Glean ping, it will be included with every instance of that ping, even if the metric has not been explicitly set.
-
If nothing is received for a metric in a ping where it is sent, it gets ETLed into BigQuery in a way that causes it to appear as null in SELECTs.
-
The above two lines mean that columns full of nulls are completely consistent with no telemetry being sent by the code for a particular metric.
-
There is effectively no meaningful cost for BigQuery columns with only nulls.
-
When we were creating
browser/newtab/components/metrics.yaml
, we decided that we wanted to err on the side of having places to put data that we might send (rather than on the side of finding out later that we had discarded sent data that there was no place to put). So, we made sure to add metrics for things in the existing docs that looked plausible, even if we weren't seeing them currently in the debugging output.
Next steps
- Columns can't be removed from a table without versioning the schema, and this is rarely done for various reasons.
- After discussion with chutten, the appropriate way to handle this sort of thing is usually to remove unused metrics from metrics.yaml, which will mark them as obsolete in the Glean dictionary. I expect this will make sense to do with all of these except for (maybe)
msstoresignedin
, which I will spin off another bug for.
The columns:
-
.action
- likely reason added to metrics.yaml: existing telemetry docs.
- no data received, appears unused in source code (more details below)
- proposal: mark as obsolete by removing from metrics.yaml
-
.cfr_action
- likely reason added to metrics.yaml: unclear, seems like just a bug
- current & older searches have pretty much no uses of this string in telemetry.
- proposal: mark as obsolete by removing from metrics.yaml
-
.page
- likely reason added to metrics.yaml: metrics collection docs about basic shape of user event pings
- user_events are used in both newtab code and asrouter code. However, having dug through searchfox and vscode, I haven't found instances of
page
keys being sent from ASRouter user_event pings, and I haven't found any columns in the PingCentre tables that collect it. - proposal: mark as obsolete by removing from metrics.yaml
-
.attribution.msstoresignedin
- likely reason added to metrics.yaml: attribution code
- no columns in the old telemetry tables collect this
- null column in FxMS Glean tables presumably because it only gets sent if there's a windows install campaign (e.g. Test new install method for Windows 11 users), and none of those have yet been done.
- proposal: spin off a bug to decide what to do here, if anything, because a lot depends on how product and desktop-integration want to use this (or not)
More details for .action
:
The AS Router part of the data events docs and user_event ping basic shape docs show them as having action
fields. An example of what actually gets sent:
Submitting Glean ping for {"source":"CFR","message_id":"CFR_FIREFOX_VIEW","bucket_id":"CFR_FIREFOX_VIEW","event":"IMPRESSION","addon_version":"20230911154121","locale":"en-US","client_id":"a2d2550a-0255-4d3a-b330-983d15d7ae7a","pingType":"cfr"}
It appears that action fields used to get sent, but that changed when AS Router telemetry migrated to a new data pipeline in bug 1585147 via this PR, which added code that deleted the action field for a whole pile of telemetry types.
For more due diligence, I looked through code (via SearchFox and otherwise) for the string action:
, finding lots of false positives (links to examples):
- message actions from our JSON messages
- ping keys for non-about:welcome stuff which end up getting deleted in TelemetryFeed.jsm (mostly in policy functions) before being sent
Adding a couple of needinfos for feedback on the analysis and proposals about what to do with these columns; comments from others are welcome too.
Comment 3•1 year ago
|
||
(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)
- If a metric is defined for a Glean ping, it will be included with every instance of that ping, even if the metric has not been explicitly set.
In actuality we do not send in the ping an empty metric value. You can see this in pings received by the Debug Ping Viewer which will have nothing for these various fields. But this is moot because your second point about how BigQuery has NULL
for any column that didn't have an express value set makes it look like every ping is "complete". There is no difference in the data between "Key is present with value null" and "Key is not present", and this is a characteristic of our data platform.
As for removal: if you remove these metrics and values do decide to show up later, they'll get caught in the messaging_system.unknown_keys
metric which should prompt an alert via the Messaging System System Health dashboard. As such you won't be unaware if these metrics start being collected again somehow. Which I think is nice.
Everything else in your investigation and report seems groovy to me.
Updated•1 year ago
|
Comment 4•1 year ago
|
||
(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)
- .action
- likely reason added to metrics.yaml: existing telemetry docs.
If this is supposed to be the documented action
, are we getting the data elsewhere? The examples include activity_stream_event, activity_stream_user_event, activity_stream_session, cfr_user_event. Maybe other data obviates this such as some combination of ping type and event?
- .page
- likely reason added to metrics.yaml: metrics collection docs about basic shape of user event pings
For about:welcome data, there's also a page
under event_context. Should it be at the top level instead? The documentation seems to have examples of pocket, and that data shouldn't be affected if removing this?
Overall the proposals for obsoleting means we're sure the data doesn't need to be sent as opposed to a bug that the data doesn't get sent or gets to other columns?
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
|
||
Ed wrote:
(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #2)
.action likely reason added to metrics.yaml: existing telemetry docs.
If this is supposed to be the documented action, are we getting the data elsewhere?
My suspicion is that your hypothesis:
Maybe other data obviates this such as some combination of ping type and event?
is correct and is one of the reasons noone has been missing this stuff since we stopped sending it in 2020. (*)
The fact that we stopped sending it in 2020 and we're not aware of any outstanding issues makes me think it's very unlikely that there is still a need for it, beyond what is already being served by existing data.
(*) The action fields used to get sent, but that changed when AS Router telemetry migrated to a new data pipeline in bug 1585147 via this PR, which added code that deleted the action field for every type of user event that is explicitly handled in `TelemetryFeed.jsm:createASRouterUserEvent``. I have yet to find any events that would fall through and hit the default clause (which would not delete the action field).
The examples include activity_stream_event,
There are no static uses of this string in code:
https://searchfox.org/mozilla-central/search?q=activity_stream_event&path=&case=true®exp=false
It seems very unlikely that it's dynamically constructed either:
https://searchfox.org/mozilla-central/search?q=user_event&path=&case=true®exp=false
activity_stream_user_event, activity_stream_session,
These two (and other activity/discovery stream stuff that I'm aware of) are handled through different PingCentre namespaces. These two in particular are handled here
cfr_user_event. Maybe other data obviates this such as some combination of ping type and event?
I think so, as discussed above.
.page likely reason added to metrics.yaml: metrics collection docs about basic shape of user event pings
For about:welcome data, there's also a page under event_context. Should it be at the top level instead?
It's actually already promoted to the top-level as event_page
:
https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/metrics/messaging_system_event_page
https://searchfox.org/mozilla-central/rev/877adfeec838cdad97369441865bac605109427d/browser/components/aboutwelcome/modules/AboutWelcomeTelemetry.jsm#189-190,194-195
The documentation seems to have examples of pocket, and that data shouldn't be affected if removing this?
The documentation covers all three PingCentre namespaces mentioned & linked to above. My understanding is that pocket & newtab do not depend on anything from the (now defunct) messaging system namespace.
Nan, Scott, do either of you have any reason to think that Pocket/Newtab is missing data that would have been sent under the messaging namespace up until our recent removal of the FxMS use of PingCentre?
Overall the proposals for obsoleting means we're sure the data doesn't need to be sent as opposed to a bug
that the data doesn't get sent or gets to other columns?
To me, it means that we're ~90% sure, based on the research I've documented in this bug, (which I speculate is about as good as it gets in a codebase of this size & complexity with a dynamic language). Given that we have seen nothing in these columns since turning on Glean in July, this seems quite unlikely to happen.
As Chris pointed out in comment 3, if we do happen to have missed a case, this will show up in the unknown_keys
metric, which we see in the regular dashboard triage.
Which is to say, I propose we go ahead with this deprecation to minimize confusion.
That said, to be extra certain that no surprises pop up and land in these columns between now and then, we could easily push this bug out to some iteration 6 months in the future and annotate the Glean dictionary entries of these entries as "planning to deprecate" with a pointer to this bug.
Thoughts?
Comment 6•1 year ago
|
||
The documentation covers all three PingCentre namespaces mentioned & linked to above. My understanding is that pocket & newtab do not depend on anything from the (now defunct) messaging system namespace.
Correct, those three are independent without any interactions.
Nan, Scott, do either of you have any reason to think that Pocket/Newtab is missing data that would have been sent under the messaging namespace up until our recent removal of the FxMS use of PingCentre?
I can't think of how those changes to FxMS could affect the Pocket/Newtab data collection as they don't depend on each other.
Updated•1 year ago
|
Assignee | ||
Comment 7•11 months ago
|
||
I talked to Scott, and the one thing that came up was the possibility that
https://github.com/mozilla/jetstream/issues/1750 might be related.
That was filed before the first Glean patches for FxMS landed on m-c, so clearly, the Glean migration was not causal to the jetstream issue.
Nan and I looked at the tables the missing metrics in Jetstream came from:
Ad Clicks by Branch:
moz-fx-data-shared-prod.telemetry_derived.experiment_search_aggregates_live_v1
Cumulative Ad Clicks By Branch
mozdata.telemetry.experiment_cumulative_ad_clicks
Search Count with Ads by Branch
moz-fx-data-shared-prod.telemetry_derived.experiment_search_aggregates_live_v1
Cumulative Search Count with Ads by Branch
mozdata.telemetry.experiment_cumulative_search_with_ads_count
These are not messaging system tables, and Nan confirmed that they are also not Activity Stream tables, so it's not data that anyone we know would be missing. And it seems wildly unlikely that data someone else is looking for and is important would suddenly start showing up in the messaging system telemetry (or have ever been likely to have been there in the first place, given the namespacing).
One could imagine a theoretical shared root cause, and I poked around on our Acryl instance, but didn't learn much of interest.
It feels like we're hitting diminishing returns on additional analysis here. And so my opinion hasn't changed:
I propose we go ahead with this deprecation to minimize confusion.
That said, to be extra certain that no surprises pop up and land in these columns between now and then, we could easily push this bug out to some iteration 6 months in the future and annotate the Glean dictionary entries of these entries as "planning to deprecate" with a pointer to this bug.
Thoughts?
Updated•11 months ago
|
Comment 8•11 months ago
•
|
||
(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #5)
Ed wrote:
.page
For about:welcome data, there's also a page under event_context. Should it be at the top level instead?
It's actually already promoted to the top-level as
event_page
:
Seems reasonable to remove. I suppose we'll just stick with event_page
instead of switching to potentially more general page
that could be used for non-events? Very similarly action
vs event
could reuse the latter for not-events or we can introduce something else if both are needed later or we don't want to mix them up.
More broadly, do we need to care about the specific metric names if I suppose we just update the description to clarify, e.g., "this is actually used for both events and not-events." Seems practical to just go with what we already have and not spend much time on specific naming, but maybe a little bit of metrics architecting could help?
Assignee | ||
Comment 9•11 months ago
•
|
||
Here's a Look showing that the Microsoft Store column has seen a bit of data (presumably tests), so I'm going to leave that alone, since I'm aware that there are ongoing efforts around store and experiments planned.
Assignee | ||
Comment 10•11 months ago
•
|
||
@chutten, do you have any thoughts on Ed's questions in comment 8? My inclination is not to spend the effort on small chunks here until we have a higher-level architecture strategy for how we want to move forward with FxMS telemetry as a whole (it's unclear to me when that will make sense to prioritize).
Assignee | ||
Comment 11•11 months ago
|
||
Assignee | ||
Comment 12•11 months ago
|
||
I've attached a patch that will remove the three completely unused columns, and once we come to a conclusion on the questions in comment 8 , I'll request review, or else adjust as appropriate.
Comment 13•11 months ago
|
||
Changing descriptions is a good thing (no one'll use them if they fall out of date), though if you make changes to the names or how they're used, you may wish to check the ETL derivations downstream of "messaging-system" pings (like onboarding
). Nothing terrible would happen if something went awry -- all the data'd be in firefox_desktop.messaging_system
for replay and backfill -- but it'd cause less disruption to make the changes ahead of time rather than post facto.
Updated•11 months ago
|
Assignee | ||
Comment 14•11 months ago
|
||
(In reply to Chris H-C :chutten from comment #13)
Changing descriptions is a good thing (no one'll use them if they fall out of date), though if you make changes to the names or how they're used, you may wish to check the ETL derivations downstream of "messaging-system" pings (like
onboarding
). Nothing terrible would happen if something went awry -- all the data'd be infirefox_desktop.messaging_system
for replay and backfill -- but it'd cause less disruption to make the changes ahead of time rather than post facto.
Right, I guess what I'm trying to ask is less about the possibility and order of work for making such changes (though your thoughts there are helpful) and more about whether this is a wise use of energy to do now rather than later. My instinct is that it's not worth making these sorts of changes until and unless we know that we need them, in part because it's a non-zero amount of work competing with lots of other stuff, and it's hard to tell in advance whether the things we want to do in the future are likely to have the close-enough semantics to share a metric or not. More on this in a bit...
Updated•10 months ago
|
Assignee | ||
Updated•10 months ago
|
Comment 15•10 months ago
|
||
Looking back at the specific renaming of event_page to event, there is a general rule of thumb to prefer specific names over general ones. Obviously, messaging-system is a bit of a special case, but we've found in our experience with data collections and their consumers, in many instances people only have the energy to look at the names of things. Specificity and clarity are to be prized wherever possible.
Assignee | ||
Comment 16•10 months ago
|
||
I think this also applies to the action vs. events: that's also a proposed generalization, so I'm inclined to assume that in the future, we'd be more likely to add something specific than to go out of our way to generalize. So, I'm going to leave things as they are, except for the three metrics in the patch, which I will put up for review.
Assignee | ||
Comment 17•10 months ago
|
||
Updated•10 months ago
|
Assignee | ||
Comment 18•10 months ago
|
||
Comment 19•10 months ago
|
||
Comment 20•10 months ago
|
||
bugherder |
Description
•