Bug 1837230 Comment 2 Edit History

Original comment by

on 2023-06-07 14:04:16 PDT

Things this is _not_:
* Differences in what Glean vs Telemetry think a "channel" or a "version" is. Both get this information from the same place and report them (and have it normalized) in the same way. Haven't checked that they're all using buildids from buildhub2 (ie, moz-issued builds), but I don't expect this to be it.
* One weird client. These are client counts and though most of the difference happened on May 24, that is to be expected given the trigger mechanism.
* Worse than it looks. I've looked at the [set intersection of legacy telemetry client_id as reported by the two systems](https://sql.telemetry.mozilla.org/queries/92545/source) and it does show that the missing segment does appear to be missing (ie, the two populations are not very much more disjoint than the basic client counts suggest)

This this _probably_ isn't:
* Queue overflow. Though the event is recorded very early and we do each day have 1300-2k clients affected by preinit queue overflow, that is out of 30-57 _million_ clients ( < 0.0001 %) so it's unlikely to have the size of effect we see.

Things it _could be_:
* At-shutdown scheduling differences. Glean doesn't send "events" pings at shutdown. Legacy Telemetry sends "event" pings at shutdown. If this is a short-lived session (or one with no inactivity) then Glean might not get a chance to send events. To try and work this out, I looked at the population of clients who sent us these events from telemetry _and_ glean, telemetry, glean, telemetry but _not_ glean, and glean but _not_ telemetry.
* How many "baseline" pings did these clients send? This is to give us an idea of about how much these populations interacted with the browser in general. This is a little risky on its own as it's kinda like asking "For some clients we suspect might have difficulty sending Glean stuff, how well represented are they in Glean (ie, _the thing we suspect they might be having difficulty with_)", but that's why I'm also checking "main" pings.
* A refresher: There are 3.8k clients sending events with matching category via Telemetry "event" pings. There are 1.4k sending events with matching category via Glean "events" pings. There are 280ish that are in Glean but not Telemetry and 2.6k in Telemetry but not Glean.
* [In the "baseline" ping record](https://sql.telemetry.mozilla.org/queries/92546/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 1.4M "baseline" pings. The 1.4k with events in Glean pings sent 3.3M. The 280ish in Glean but not Telemetry sent 2.4M. The 2.6k in Telemetry not Glean sent 563k.
* [In the "main" ping record](https://sql.telemetry.mozilla.org/queries/92548/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 484k "main" pings. The 1.4k with events in Glean pings sent 770k. The 280ish in Glean but not Telemetry sent 470k. The 2.6k in Telemetry not Glean sent 183k.
* This means that the clients from whom we received Glean "events" pings with matching category were much chattier than those we received Telemetry "event" pings from. This suggests they are longer-term users, which means they'd be less affected by any at-shutdown scheduling differences.

It's a bit of an odd way to look at the question "Are the clients we're hearing from only in Telemetry not hanging around long enough to send Glean "events" pings?", but it's the way I could think of to do this in comparison to the population we're looking at. I suppose we could count distinct `session_id`s to see how many of the telemetry-but-not-glean crew come back... (gosh this query's taking forever).

Hey, the query finally came back and... wait, there's only [250ish clients in the "main" ping record that were in the "event" ping record with matching category?](https://sql.telemetry.mozilla.org/queries/92552/source). Out of the 2.6k in telemetry but not Glean? That's odd: a "main" ping is submitted at shutdown only [eighteen lines after the event ping is](https://searchfox.org/mozilla-central/rev/887d4b5da89a11920ed0fd96b7b7f066927a67db/toolkit/components/telemetry/app/TelemetryControllerParent.sys.mjs#916). The only reason it might not be (attempted to be) uploaded right then and there would be if this was the first application session (in which case we'd send a "first-shutdown" ping instead).

Tomorrow I'll check "first-shutdown" pings for membership from these telemetry-but-not-glean reports. But so far all this evidence is consistent with the theory that this is an effect of at-shutdown scheduling differences. If this remains the case tomorrow, we'll need to prioritize bug 1837233.

Revision 1 by

Chris H-C :chutten

on 2023-06-07 14:04:51 PDT

Things this is _not_:
* Differences in what Glean vs Telemetry think a "channel" or a "version" is. Both get this information from the same place and report them (and have it normalized) in the same way. Haven't checked that they're all using buildids from buildhub2 (ie, moz-issued builds), but I don't expect this to be it.
* One weird client. These are client counts and though most of the difference happened on May 24, that is to be expected given the trigger mechanism.
* Worse than it looks. I've looked at the [set intersection of legacy telemetry client_id as reported by the two systems](https://sql.telemetry.mozilla.org/queries/92545/source) and it does show that the missing segment does appear to be missing (ie, the two populations are not very much more disjoint than the basic client counts suggest)

Things this _probably_ isn't:
* Queue overflow. Though the event is recorded very early and we do each day have 1300-2k clients affected by preinit queue overflow, that is out of 30-57 _million_ clients ( < 0.0001 %) so it's unlikely to have the size of effect we see.

Things it _could be_:
* At-shutdown scheduling differences. Glean doesn't send "events" pings at shutdown. Legacy Telemetry sends "event" pings at shutdown. If this is a short-lived session (or one with no inactivity) then Glean might not get a chance to send events. To try and work this out, I looked at the population of clients who sent us these events from telemetry _and_ glean, telemetry, glean, telemetry but _not_ glean, and glean but _not_ telemetry.
* How many "baseline" pings did these clients send? This is to give us an idea of about how much these populations interacted with the browser in general. This is a little risky on its own as it's kinda like asking "For some clients we suspect might have difficulty sending Glean stuff, how well represented are they in Glean (ie, _the thing we suspect they might be having difficulty with_)", but that's why I'm also checking "main" pings.
* A refresher: There are 3.8k clients sending events with matching category via Telemetry "event" pings. There are 1.4k sending events with matching category via Glean "events" pings. There are 280ish that are in Glean but not Telemetry and 2.6k in Telemetry but not Glean.
* [In the "baseline" ping record](https://sql.telemetry.mozilla.org/queries/92546/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 1.4M "baseline" pings. The 1.4k with events in Glean pings sent 3.3M. The 280ish in Glean but not Telemetry sent 2.4M. The 2.6k in Telemetry not Glean sent 563k.
* [In the "main" ping record](https://sql.telemetry.mozilla.org/queries/92548/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 484k "main" pings. The 1.4k with events in Glean pings sent 770k. The 280ish in Glean but not Telemetry sent 470k. The 2.6k in Telemetry not Glean sent 183k.
* This means that the clients from whom we received Glean "events" pings with matching category were much chattier than those we received Telemetry "event" pings from. This suggests they are longer-term users, which means they'd be less affected by any at-shutdown scheduling differences.

It's a bit of an odd way to look at the question "Are the clients we're hearing from only in Telemetry not hanging around long enough to send Glean "events" pings?", but it's the way I could think of to do this in comparison to the population we're looking at. I suppose we could count distinct `session_id`s to see how many of the telemetry-but-not-glean crew come back... (gosh this query's taking forever).

Hey, the query finally came back and... wait, there's only [250ish clients in the "main" ping record that were in the "event" ping record with matching category?](https://sql.telemetry.mozilla.org/queries/92552/source). Out of the 2.6k in telemetry but not Glean? That's odd: a "main" ping is submitted at shutdown only [eighteen lines after the event ping is](https://searchfox.org/mozilla-central/rev/887d4b5da89a11920ed0fd96b7b7f066927a67db/toolkit/components/telemetry/app/TelemetryControllerParent.sys.mjs#916). The only reason it might not be (attempted to be) uploaded right then and there would be if this was the first application session (in which case we'd send a "first-shutdown" ping instead).

Tomorrow I'll check "first-shutdown" pings for membership from these telemetry-but-not-glean reports. But so far all this evidence is consistent with the theory that this is an effect of at-shutdown scheduling differences. If this remains the case tomorrow, we'll need to prioritize bug 1837233.

Revision 2 by

Chris H-C :chutten

on 2023-06-08 07:29:01 PDT

Things this is _not_:
* Differences in what Glean vs Telemetry think a "channel" or a "version" is. Both get this information from the same place and report them (and have it normalized) in the same way. Haven't checked that they're all using buildids from buildhub2 (ie, moz-issued builds), but I don't expect this to be it.
* One weird client. These are client counts and though most of the difference happened on May 24, that is to be expected given the trigger mechanism.
* Worse than it looks. I've looked at the [set intersection of legacy telemetry client_id as reported by the two systems](https://sql.telemetry.mozilla.org/queries/92545/source) and it does show that the missing segment does appear to be missing (ie, the two populations are not very much more disjoint than the basic client counts suggest)

Things this _probably_ isn't:
* Queue overflow. Though the event is recorded very early and we do each day have 1300-2k clients affected by preinit queue overflow, that is out of 30-57 _million_ clients ( < 0.0001 %) so it's unlikely to have the size of effect we see.

Things it _could be_:
* At-shutdown scheduling differences. Glean doesn't send "events" pings at shutdown. Legacy Telemetry sends "event" pings at shutdown. If this is a short-lived session (or one with no inactivity) then Glean might not get a chance to send events. To try and work this out, I looked at the population of clients who sent us these events from telemetry _and_ glean, telemetry, glean, telemetry but _not_ glean, and glean but _not_ telemetry.
* How many "baseline" pings did these clients send? This is to give us an idea of about how much these populations interacted with the browser in general. This is a little risky on its own as it's kinda like asking "For some clients we suspect might have difficulty sending Glean stuff, how well represented are they in Glean (ie, _the thing we suspect they might be having difficulty with_)", but that's why I'm also checking "main" pings.
* A refresher: There are 3.8k clients sending events with matching category via Telemetry "event" pings. There are 1.4k sending events with matching category via Glean "events" pings. There are 280ish that are in Glean but not Telemetry and 2.6k in Telemetry but not Glean.
* [In the "baseline" ping record](https://sql.telemetry.mozilla.org/queries/92546/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 1.4M "baseline" pings. The 1.4k with events in Glean pings sent 3.3M. The 280ish in Glean but not Telemetry sent 2.4M. The 2.6k in Telemetry not Glean sent 563k.
* [In the "main" ping record](https://sql.telemetry.mozilla.org/queries/92548/source), the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 484k "main" pings. The 1.4k with events in Glean pings sent 770k. The 280ish in Glean but not Telemetry sent 470k. The 2.6k in Telemetry not Glean sent 183k.
* This means that the clients from whom we received Glean "events" pings with matching category were much chattier than those we received Telemetry "event" pings from. This suggests they are longer-term users, which means they'd be less affected by any at-shutdown scheduling differences.

It's a bit of an odd way to look at the question "Are the clients we're hearing from only in Telemetry not hanging around long enough to send Glean "events" pings?", but it's the way I could think of to do this in comparison to the population we're looking at. I suppose we could count distinct `session_id`s to see how many of the telemetry-but-not-glean crew come back... (gosh this query's taking forever).

Hey, the query finally came back and... ~wait, there's only [250ish clients in the "main" ping record that were in the "event" ping record with matching category?](https://sql.telemetry.mozilla.org/queries/92552/source). Out of the 2.6k in telemetry but not Glean? That's odd: a "main" ping is submitted at shutdown only [eighteen lines after the event ping is](https://searchfox.org/mozilla-central/rev/887d4b5da89a11920ed0fd96b7b7f066927a67db/toolkit/components/telemetry/app/TelemetryControllerParent.sys.mjs#916). The only reason it might not be (attempted to be) uploaded right then and there would be if this was the first application session (in which case we'd send a "first-shutdown" ping instead).~

~Tomorrow I'll check "first-shutdown" pings for membership from these telemetry-but-not-glean reports. But so far all this evidence is consistent with the theory that this is an effect of at-shutdown scheduling differences. If this remains the case tomorrow, we'll need to prioritize bug 1837233.~ See below

Bugzilla

Quick Search

Bug 1837230 Comment 2 Edit History