Open Bug 1847950 Opened 7 months ago Updated 5 months ago

Investigate why Glean-sent "topsites-impression"-type "topsites" pings are more likely to be received, but from fewer `context_id`s than PingCentre-sent pings

Categories

(Toolkit :: Telemetry, task)

task

Tracking

()

ASSIGNED

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(1 file, 1 obsolete file)

In bug 1844357 we found that, as expected, Glean is a more reliable sender of Contextual Services data than PingCentre. Except.

For Glean-sent "topsites-impression"-ping-type "top-sites" pings vs PingCentre-sent "topsites-impression" pings:

  • Glean, as expected, receives more pings
  • Glean, unexpectedly, receives pings from fewer context_ids - with PingCentre-sent data containing 0.78% context_ids that Glean never reports from in the sample. (Glean-sent data contains 0.21% context_ids that PingCentre never reports from, so this nets to an overall volumetric loss of 0.57%)

Now, this is below the usual rule-of-thumb 1% "give a care" threshold. So maybe we can "just" live with this difference.

But it's the only class of ping across Messaging System and Contextual Services to exhibit this behaviour. So it might be worth investigating why.

Either way, we need a clear explanation for this and confidence in that explanation before we turn off PingCentre-sent Contextual Services pings.

Hypothesis: bug 1837230 found that Glean isn't guaranteed to record data or send information from short, resource-constrained application sessions. "topsites-impressions" pings are sent for sponsored topsites that are visible on the newtab page, which happens as early as we can manage on startup, so it's not unreasonable that ultra-short sessions that do not initialize Glean are a contributing factor.

This hypothesis is not directly testable as CS pings cannot be joined with anything but other CS pings. But we can look at the received pings of context_ids that appear in only the PingCentre-sent data and see if anything stands out (geo, channel, build, os, number of pings, ...) compared to the broader population. Maybe it's most usual for us to receive at least six "topsites-impression" pings per context_id, but these context_ids only send three? (I'm spitballin' here)

See Also: → 1851034

We're now running an experiment where we trigger a "user is inactive" edge early-ish in an orderly Firefox Desktop shutdown. This really shouldn't make a difference, but it seems to be reversing the coverage of context_ids in topsites-impression pings so that Glean is once again sending more.

It's still close enough (Glean hears from only 0.54% of contexts that PC doesn't, and PC hears from 0.34% of contexts that Glean doesn't (net 0.2% gain in Glean's favour)) that this clearly hasn't brought it fully up to the level of topsite-click (Glean not PC at 3.2%, PC not Glean at 0.48%)... but it's heartening that there are things we can do to influence this if we so choose.

But how? How did this make any difference at all?

Well, after digging around in some oldish (~3yo) code in the Glean SDK uploader, I might have an idea. On init, Glean will trigger the upload of any at-startup "events" or "metrics" pings, and the initial client-active-causing "baseline" ping. But that simply spawns the glean.upload thread so it can do its own thing.

One thing it can do is wait at most 3 times for 1s each if the upload manager is still loading pending pings from disk. (This number of times and length of time were chosen in the era where Glean only had consumers on mobile, where disk I/O was fast). If all those upload triggers and the preinit dispatcher queue with its topsites-impression "top-sites" pings was flushed during that waiting period, and the pending pings dir took longer than 3s to scan, the glean.upload thread would exit.

If, then, no other ping was submitted, the glean.upload thread would never be restarted and try to upload anything.

But if we changed behaviour so during shutdown we triggered a client inactive which submits some pings which triggers the upload... well, that might just kick things back into gear and give Firefox a chance to upload those submitted pings that have been lying around.

Maybe. (This code is complicated).

But if this is so, then I think there might be an SDK change here we should try involving triggering upload after the pending pings dir has finished being scanned. It might not be a straightforward fix (since we don't want to do this every time we scan the dir, only when we're doing so as part of initializing the global Glean), but it's the closest thing to a lead I've had in some time.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Attachment #9353485 - Attachment is obsolete: true

Keeping the bug open to track whether a subsequent release and vendor will make any difference.

You need to log in before you can comment on or make changes to this bug.