Closed Bug 1837230 Opened 1 year ago Closed 1 year ago

Event count discrepancy for Glean `browser.launched_to_handle :: system_notification` event mirrored to legacy event

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(firefox116 fixed)

RESOLVED FIXED
Tracking Status
firefox116 --- fixed

People

(Reporter: nalexander, Assigned: chutten)

References

Details

Attachments

(2 files)

I have a Glean event mirrored to a legacy telemetry event. The event is only emitted in a single place.

When I query for event counts, I get 3x as many legacy events: compare Glean with ~500 events for backgroundtaskmessage-notification-release-4pct-en:treatment-old to legacy with ~1500 events for backgroundtaskmessage-notification-release-4pct-en:treatment-old.

Some context: the browser.launched_to_handle :: system_notification is emitted when a user clicks on a Windows native notification to launch Firefox when it is not currently running. These notifications are a relatively new message surface displayed by the Windows-only background update process when it enrols into certain Nimbus experiments. The event is emitted during command-line handling immediately upon startup, so if we were seeing a race or queue overflow it would likely be related to startup.

It is very unlikely that a user would generate two of these events, at least two with a common name: the name identifies the Nimbus experiment and enrolled branch of the background update task profile, and the background update task should only show (at most) 1 notification.

Things this is not:

  • Differences in what Glean vs Telemetry think a "channel" or a "version" is. Both get this information from the same place and report them (and have it normalized) in the same way. Haven't checked that they're all using buildids from buildhub2 (ie, moz-issued builds), but I don't expect this to be it.
  • One weird client. These are client counts and though most of the difference happened on May 24, that is to be expected given the trigger mechanism.
  • Worse than it looks. I've looked at the set intersection of legacy telemetry client_id as reported by the two systems and it does show that the missing segment does appear to be missing (ie, the two populations are not very much more disjoint than the basic client counts suggest)

Things this probably isn't:

  • Queue overflow. Though the event is recorded very early and we do each day have 1300-2k clients affected by preinit queue overflow, that is out of 30-57 million clients ( < 0.0001 %) so it's unlikely to have the size of effect we see.

Things it could be:

  • At-shutdown scheduling differences. Glean doesn't send "events" pings at shutdown. Legacy Telemetry sends "event" pings at shutdown. If this is a short-lived session (or one with no inactivity) then Glean might not get a chance to send events. To try and work this out, I looked at the population of clients who sent us these events from telemetry and glean, telemetry, glean, telemetry but not glean, and glean but not telemetry.
    • How many "baseline" pings did these clients send? This is to give us an idea of about how much these populations interacted with the browser in general. This is a little risky on its own as it's kinda like asking "For some clients we suspect might have difficulty sending Glean stuff, how well represented are they in Glean (ie, the thing we suspect they might be having difficulty with)", but that's why I'm also checking "main" pings.
    • A refresher: There are 3.8k clients sending events with matching category via Telemetry "event" pings. There are 1.4k sending events with matching category via Glean "events" pings. There are 280ish that are in Glean but not Telemetry and 2.6k in Telemetry but not Glean.
    • In the "baseline" ping record, the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 1.4M "baseline" pings. The 1.4k with events in Glean pings sent 3.3M. The 280ish in Glean but not Telemetry sent 2.4M. The 2.6k in Telemetry not Glean sent 563k.
    • In the "main" ping record, the 3.8k clients sending events with a matching category via Telemetry "event" pings sent a total of 484k "main" pings. The 1.4k with events in Glean pings sent 770k. The 280ish in Glean but not Telemetry sent 470k. The 2.6k in Telemetry not Glean sent 183k.
    • This means that the clients from whom we received Glean "events" pings with matching category were much chattier than those we received Telemetry "event" pings from. This suggests they are longer-term users, which means they'd be less affected by any at-shutdown scheduling differences.

It's a bit of an odd way to look at the question "Are the clients we're hearing from only in Telemetry not hanging around long enough to send Glean "events" pings?", but it's the way I could think of to do this in comparison to the population we're looking at. I suppose we could count distinct session_ids to see how many of the telemetry-but-not-glean crew come back... (gosh this query's taking forever).

Hey, the query finally came back and... wait, there's only 250ish clients in the "main" ping record that were in the "event" ping record with matching category?. Out of the 2.6k in telemetry but not Glean? That's odd: a "main" ping is submitted at shutdown only eighteen lines after the event ping is. The only reason it might not be (attempted to be) uploaded right then and there would be if this was the first application session (in which case we'd send a "first-shutdown" ping instead).

Tomorrow I'll check "first-shutdown" pings for membership from these telemetry-but-not-glean reports. But so far all this evidence is consistent with the theory that this is an effect of at-shutdown scheduling differences. If this remains the case tomorrow, we'll need to prioritize bug 1837233. See below

Assignee: nobody → chutten
Severity: -- → S4
Status: NEW → ASSIGNED
Priority: -- → P1
See Also: → 1837233

Hrm, of the 2.6k of clients sending browser.launched_to_handle events via Telemetry "event" pings but not Glean "events" pings, only 21 are on their first session. Then why aren't the other 2k+ findable in the "main" ping record?

Ah, it's because I made a mistake and only looked at "main" pings after June 1. Whoops. Once I expand it to May 23+ they all show up. I'll edit my prior comment.

As for the number of sessions we see in the "main" ping record: 400k. so it's not just one-off application sessions. There should be plenty of opportunity for the Glean "events" pings to be sent for scheduling reason 3 (startup).

Maybe I need to take another look at preinit queue overflow.

Nope. Another dead end. Only 10 clients report having any queue overflow. Might be because an inability to deliver the "events" ping might influence the ability to deliver the "metrics" ping as well.

To sum:

  • We have a long list of things it isn't:
    • Differences in what Glean vs Telemetry think a "channel" or a "version" is.
    • One weird client
    • Worse than it looks.
    • Preinit queue overflow.
  • The best guess for what it could be (at-shutdown ping scheduling differences) doesn't quite pan out since there's just so many sessions that the recorded event should be sent at-startup.
  • Mayyyybe since the sessions are so short (over half are 2min or shorter) Glean's not getting past preinit? BrowserGlue says it guarantees init will run, though.

I'm at a loss. ...though, if all the sessions are short that Glean is never init until late enough in shutdown that it can't upload a ping... but no, it uploads "baseline" pings just fine.

I don't know why this is happening, and I can't think of a way to look at the data to get any more clues.

The only thing I can think of to do is to broaden the search. Instead of targetting this small population, let's add an event recorded at the top of dch_handle and see in what situations we don't have alignment between Glean and Telemetry when we collect across all profiles. This will also give us some instrumentation to experiment with in bug 1837233.

Attachment #9338133 - Flags: data-review?(tlong)

Comment on attachment 9338133 [details]
data collection review request

Data Review

  1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, through the metrics.yaml file and the Glean Dictionary.

  1. Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, through the telemetry preferences in the application settings.

  1. If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A, collection to end or be renewed by version 120

  1. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 2, Interaction data

  1. Is the data collection request for default-on or default-off?

Default-on

  1. Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

No

  1. Is the data collection covered by the existing Firefox privacy notice?

Yes

  1. Does the data collection use a third-party collection tool?

No

Result

data-review+

Attachment #9338133 - Flags: data-review?(tlong) → data-review+

Had a weird thought that this could be a metrics configuration (aka Data Control Plane aka Server Knobs) that disabled Glean collection for these clients. I very much doubt it, but include this idea for completeness. There is presently no way to tell whether they were under a metrics configuration (see bug 1837659), but the broad instrumentation in the attached patch will make it more clear.

The only currently active Data Control Plane configuration is in Nightly and Release on Desktop, and is enabling 3 urlbar engagement events. It is not being used to disable anything at the current time.

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4e710e9028e0 Instrument an early event for FOG validation r=perry.mcmanis,nalexander
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

Early returns over the weekend don't show a similar effect. I mean, there's still more clients reporting the events in Telemetry "events" pings than in Glean "event" pings, but only about a (shrinking) 12-15% discrepancy (which could be due to scheduling differences). Nothing close to the reported amount.

This could be due to channel effects: maybe we'll only see the reported effect when we reach release.

Or maybe the difference is caused by the same cause, but widening the scope has obscured its magnitude by including a bunch of properly-reported data. In that case I'd expect to see an increased discrepancy when we focus on just Windows, but I didn't see any such thing when I checked.

We'll see whether the trends continue.

(In reply to Chris H-C :chutten from comment #13)

Early returns over the weekend don't show a similar effect. I mean, there's still more clients reporting the events in Telemetry "events" pings than in Glean "event" pings, but only about a (shrinking) 12-15% discrepancy (which could be due to scheduling differences). Nothing close to the reported amount.

This could be due to channel effects: maybe we'll only see the reported effect when we reach release.

Or maybe the difference is caused by the same cause, but widening the scope has obscured its magnitude by including a bunch of properly-reported data. In that case I'd expect to see an increased discrepancy when we focus on just Windows, but I didn't see any such thing when I checked.

We'll see whether the trends continue.

This is very interesting. Could you look at the notification experiment clients who sent telemetry events but not Glean events and see if they generally had very short sessions? That seems the most likely at this point.

Flags: needinfo?(chutten)

I did look at them, and we see in the telemetry-not-glean clients that their sessions are very likely to be very short. However, the telemetry-and-glean clients too have a similar distribution. I'm not enough of a statistician to tease out whether it's the same or different and at what confidence, but I think the low population numbers in both cohorts will make significance difficult to assert.

I did, at your prompting, consider a new potential theory of the bug, though. A recap on short sessions and Glean:

  • Short sessions shouldn't matter to Glean, since BrowserGlue will ensure FOG is init before (orderly) shutdowns. (Or, at least that was my interpretation of BrowserGlue's _scheduleStartupIdleTasks suggestion that tasks that are okay with not being run in some sessions be in scheduleBestEffortUserIdleTasks instead).
  • On init, the preinit queue is flushed and pending operations happen. Certain builtin pings can be submitted ("dirty-startup"-reason "baseline pings and "overdue"/"upgrade"-reason "metrics" pings and the "events" ping if any events from a prior session are found)
    • This both ensures that (modulo preinit queue overflow, which we ascertained is not a widespread problem in the experiment's client population) all early metric data is applied to storage and if we find ourselves unable to submit the "events" ping containing the browser.launched_to_handle events by the end of the session that started with that event, we'll send it first thing on the next session.
      • This may explain why the day-to-day ping returns that started this investigation began with Telemetry out-sending Glean, but ended with Glean out-sending Telemetry.
  • Though submission is not upload, we (now) wait to process at least one ping on short sessions. This does not guarantee that we'll have a chance given enough startups to upload all of the submitted and pending pings. But the more sessions there are, the more likely there is to be one long enough to upload the pending pings: and we can send fifteen pings a minute at current rates.

Or, to sum: short sessions shouldn't affect Glean's ability to be present and correct. If Telemetry can do it, Glean can do it better. (Or at least that's the intent.)

So about that theory: there's one wrinkle I didn't cover in the recap: delay_ping_lifetime_io. Telemetry is in-memory and only ever persists full ping payloads if it wants to ensure something sticks around between sessions. The Glean SDK persists to a database because, coming from mobile, it has less faith in the application's ability to be present long-term without the OS whacking it with an axe. This approach of persisting every sample is less necessary and more expensive on Desktop with its long-lived sessions with orderly shutdown and Desktop's likelihood of running on spinning (not solid-state) disks. Thus, on Desktop, the Glean SDK is instructed to only persist "ping"-lifetime data (which is most of it) when asked to by the embedder: and presently the embedder persists opportunistically on idle or on (orderly) shutdown.

But we're not really worried about disorderly shutdown, are we? I mean, we are as in Mozilla is: but they don't explain the effect we're seeing. So why does this matter: won't the event record just persist at shutdown and be sent early on the next session (and I've established most of these clients have plenty of sessions which which to do so)?

Not if shutdown happens before init.

FOG will shut down during ShutdownPhase::XPCOMShutdown. I haven't been able to ascertain during what shutdown phase BrowserGlue's idle task queue will ensure it runs its task to initialize fog, but it might be ShutdownPhase::XPCOMShutdownThreads. If it is indeed later than FOG's shutdown we won't persist ping lifetime data at shutdown. Meaning the preinit queue will indeed process and be all happy and groovy and initialization will be complete: and then the rest of shutdown happens and FOG doesn't have an orderly shutdown.

I might even be able to see evidence of this in the data. I can look in the Glean "baseline" record for legacy client_ids from the telemetry-not-glean set. I would expect to see a higher proportion of "dirty-startup"-reason "baseline" pings in them than their telemetry-and-glean cousins. And, if so, (and maybe even if not) we could bridge this gap by a simple call to persist_ping_lifetime_data() right about here. (Or, maybe more sensibly, in the Glean SDK right about here).

(( And this is to say nothing about how I should probably make a point of ensuring shutdown happens after init as a matter of habit . The only reason it isn't that way today is because of the split between constructing the FOG singleton and running FOG init (which runs Glean init). ))

All this is to say, I've got a couple of tasks:

  1. Check for evidence of "dirty-startup"-reason "baseline" pings in the telemetry-not-glean cohort. I think we ought to be able to get enough of a population to check if there's surprisingly-many of them compared to others in the experiment group.
  2. Close the hole. Even if #1 doesn't come up with anything, we should probably persist the preinit queue when it's done being flushed anyway. We'll belt and suspenders it to get it into code faster via FOG and fix it in the SDK for whenever the next release comes along.
  3. Review FOG's init and shutdown design to see if there's a way to ensure shutdown a) still happens only once, and b) happens only after init.
Flags: needinfo?(chutten)

I'm not seeing the effect I'd expect if "init after shutdown" were a widespread issue. (Either that or "dirty_startup"-reason "baseline" pings are common enough and volatile enough in the population that the effect I'm measuring has been drowned.) 17.7% (814/4603) of the telemetry-not-glean cohort sent a "dirty_startup"-reason "baseline" ping whereas the telemetry-and-glean cohort's at 27.4% (725/2646).

Weirdly, both of these cohorts seem high. I'm counting proportion of Glean clients with matching legacy client_ids... maybe I should try again and count the legacy ids instead?

In legacy client numbers, we find 2.6% (63 of 2424) of the telemetry-not-glean cohort sent a "dirty_startup"-reason "baseline" ping whereas the telemetry-and-glean cohort's at 20% (233/1167). Three notes:

  1. Wow, these proportions are really high? "dirty_startup" is supposed to be a sign of something anomalous. Of course, we don't monitor the proportion of reasons so it makes sense it's slipped.
  2. Wow, the total client population counts are wildly different. Does this mean Legacy client_ids are being cloned? Glean client_ids are being regenerated? Something else?
  3. No sign of the effect. Dang.

I'm rather at a loss once again... but y'know what? I've never actually tried this locally. I should do that, especially now that the early instrumentation has landed.

Well would you look at that. I can start and exit Firefox without FOG being init. BrowserGlue is not guaranteeing that it's being run.

I have to be pretty darn quick. But my computer's pretty fast, and this is an opt build.

And wouldn't you know it, in each and every session Telemetry is able to submit and persist an "event" ping with telemetry.early_event#startup in it. (naturally, only the app sessions where FOG inits do we record a fog.validation.validate_early_event in a Glean "events" ping.)

This might be it.

Well, poo. What do we do.

See Also: → 1839426
See Also: → 1850873
See Also: → 1850878
See Also: → 1876445
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: