Closed Bug 1834618 Opened 1 year ago Closed 1 year ago

Preliminary ad hoc System Validation of Messaging System reinstrumentation

Categories

(Toolkit :: Telemetry, task)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Unassigned)

References

Details

Once the reinstrumentation of Messaging System reaches Nightly and begins sending data, we can start taking a casual ad hoc look at the data that's being received and conduct iterative system validation.

This should include (but is of course not limited to):

  • Ensuring the descriptions are as full and complete and correct as they can be
  • Ensuring we didn't miss any fields (showing up in invalid_nested_data or unknown_keys)
  • Ensuring the data is of comparable shape and content to its predecessors in the messaging_system.* dataset namespace

Also including JSON parse errors

Some nuance on the JSON parse error is that some event_context values are coming in that aren't JSON (they can be bare strings like "whats-new-panel" for the "undesired-events" messaging_system.ping_type). This appears to be deliberate and doesn't affect the ability of the basic messaging_system.event_context to report the value so itself isn't an error, even though it is appropriately catching a ParseError on the client.

Put another way: the presence of a non-zero value of messaging_system.event_context_parse_error is not sufficient evidence of invalidity of the reinstrumentation. The rest of the ping payload needs to be examined to see if it is a case like "undesired-events"

(( Note that the scope of this bug includes adding discoveries such as these to the metrics' descriptions or at least as Gleannotations to aid future analysts ))

Initial Exploration Notebook: https://colab.research.google.com/drive/1ieOZziL3QI8mj8xU-xByFDRJPYsBD8TV#scrollTo=kcU16XNrgmiz

This uses a sample of 5000 pings.

Some interesting findings:

First of all, we don't see data that is very obviously wrong. Always a good sign!

It is worth rethinking context parse error; I will investigate potential implications of the way we've chosen to handle it now that we see the percentage of cases that will error: https://bugzilla.mozilla.org/show_bug.cgi?id=1835151

That's mostly it on that front.

Moving onto fields:

Here are the events we're actually getting:

ASR_RS_NO_MESSAGES            4736
ASR_RS_ERROR                   113
IMPRESSION                      85
MOMENTS_PAGE_SET                20
CLICK_BUTTON                    18
DISMISS                         17
INDEXEDDB_OPEN_FAILED            3
TRANSACTION_FAILED               2
TARGETING_EXPRESSION_ERROR       2
SELECT_CHECKBOX                  1
DISMISSED                        1
SESSION_END                      1
ENABLE                           1
Name: string.messaging_system_event

While we aren't getting a huge percentage of pings containing attribution, we do see some!

%2528not%2Bset%2529    27
whatsnew                3
Name: string.messaging_system_attribution_campaign
%2528not%2Bset%2529    30
Name: string.messaging_system_attribution_content
mozillaci    21
mozorg        7
Name: string.messaging_system_attribution_dlsource
www.google.com     16
www.bing.com       11
firefox-browser     3
Name: string.messaging_system_attribution_source
chrome     16
edge       11
firefox     3
Name: string.messaging_system_attribution_ua

Our ASR specific recorded locale matches our expectations for nightly locales!

Here's where the pings were actually coming from:

undesired-events    4856
moments               20
spotlight             20
whats-new-panel        7
cfr                    6
infobar                2
Name: string.messaging_system_ping_type, dtype: int64

Here's some useful context data, consider we can tell who clicked primary and secondary buttons!

{"source":"secondary_button","page":"spotlight"}
"message-groups" 
{"page":"about:firefoxview"}
{"page":"about:welcome"} 
{"source":"primary_button","page":"about:firefoxview"} 
{"source":"primary_button","page":"spotlight"}  

We even observe some upgrades:
FX_MR_106_UPGRADE 18

Looking at all the data that's come in over the weekend, we have 0 unknown keys and 0 invalid nested data.

Of note is that somehow I specified invalid_nested_data be sent on the "metrics" ping instead of the "messaging-system" ping. Whoops. (bug 1835656)

We might be able to call this ad hoc approach done and move on to building the system health dashboard. Dan, what do you think?

Flags: needinfo?(dmosedale)

Dan and the group chatted about the ad hoc data validation at today's Messaging System PingCentre Reinstrumentation sync and he's happy to move on to building the system health dashboard.

Status: NEW → RESOLVED
Closed: 1 year ago
Flags: needinfo?(dmosedale)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.