Closed Bug 1552211 Opened 5 years ago Closed 5 years ago

Validate the 'baseline' ping in Fenix

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: chutten)

References

Details

In bug 1525603 we validated the 'baseline' ping in the &-browser. However, given its population size, we can't be too sure about some if its results. We need to perform the same analysis on Fenix, which has a bigger test population.

Assignee: nobody → chutten
Type: defect → task
Priority: -- → P1

"baseline" pings, round 3.

Scope

The queries are limited to the pings received between May 8 and 15. (A week's worth of modern pings)

This is about 150,914 pings from 5850 clients.

(( A lot of this will be in comparison to the "baseline" ping analyses done for the &browser in bug 1545603 and bug 1520182 ))

Ping and Client Counts

Aggregate

At just under 2k DAU we're seeing just under 20k pings per day. (query)

Per-client, Per-day

There are more clients sending more pings per day and over all this week compared to &browser counts. This is consistent with an idea that Fenix is used more than &browser is (we'll see if the rest of the analysis is consistent with this idea)

Sequence Numbers

Distribution

If you look at #pings - #clients we're getting a fair number of "half dupes" (duplicate {client_id, seq} pairs). We'll look deeper into that later, but while we're here notice that the distribution of the dupes matches the distribution of the pings overall meaning there's no underlying pattern to them in terms of client ids or seq numbers. In other words, it's more evidence consistent with earlier hypotheses that half dupes don't all happen at high or low seq numbers.

As for the distribution of pings and seq numbers themselves, it's fine. There's a long tail of some really long-lived clients but we're mostly seeing this nice exponential decay curve of clients with low-seq sequences. This is exactly what we've come to expect.

Holes and Dupes

18 clients (0.3%) have any holes in their sequence record at all. This is a nice confirmation that holes aren't really too much of a problem any more.

Dupes, on the other hand, afflict 17% (983) of clients in the sample. Most of those are single-seq dupes, but there's six clients who are sending more than 25 dupes over the course of the week. (looking back at the "pings per client" plots above, this isn't the worst thing as the sequence lengths are getting well over 100, but.)

This data is inconsistent with the idea that dupes are due to "weird clients" that are disproportionately present on &browser builds. This is affecting a more general population than that.

Field Compositions

We're still getting the occasional null duration (very occasional. 50 pings out of the 150k level of occasional). Aside from that, they're distributing in the usual way.

As for the rest of the fields, they've all normalized very well. No worries here.

Delay

No HTTP Date header for fenix's "basline" pings means we're looking at aggregate submission delay only (a refresher: this means delay from "ping created" until "ping received by our servers"). No clock skew adjustment possible. And we're at per-minute resolution.

We see the same ~4% of over-3-hour submission delays which is interesting. It also proves out how darn quick Glean is in getting data to us.

0.3% are received before they're recorded. (time travellers. Including one from nearly 19 years in the future)
83% of pings are received within a minute of their recording.

The 95th %ile (143.3k pings) stretches out past 90min in Fenix. Longer than the 61min of &browser, but still way better than Desktop.

A bit of an artificial hump at the 60-61min mark suggests some potential timezone difficulty, but not with enough volume that it's likely to be a problem with how we're doing our storage and calculations. Time will tell.

More analysis needed (same as before):

  • Clock skew adjustments
  • Checking to see if there are commonalities within the group of long-delayed pings. Maybe they're all sent from certain clients, or at certain times of day, or at certain parts of the app lifecycle. "You only have to wait an hour to get 95% of the pings" is only useful if the 95% we receive in that hour are representative (outside of their delay) of the population of the 100%.

Conclusion

I conclude that "baseline" pings on Fenix look roughly equivalent in quality to "baseline" pings on &browser: quite good with some dupe problems we really need more analysis on (bug 1547234).

Recommendations

  • Add Date headers to the metadata to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things.

Alessio, please take a look and let me know your questions, concerns, and corrections.

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #1)

Holes and Dupes

18 clients (0.3%) have any holes in their sequence record at all. This is a nice confirmation that holes aren't really too much of a problem any more.

Dupes, on the other hand, afflict 17% (983) of clients in the sample. Most of those are single-seq dupes, but there's six clients who are sending more than 25 dupes over the course of the week. (looking back at the "pings per client" plots above, this isn't the worst thing as the sequence lengths are getting well over 100, but.)

This data is inconsistent with the idea that dupes are due to "weird clients" that are disproportionately present on &browser builds. This is affecting a more general population than that.

Gah, this makes me sad :( So duplicates is definitely a problem.

Conclusion

I conclude that "baseline" pings on Fenix look roughly equivalent in quality to "baseline" pings on &browser: quite good with some dupe problems we really need more analysis on (bug 1547234).

\o/

Recommendations

  • Add Date headers to the metadata to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things.

Alessio, please take a look and let me know your questions, concerns, and corrections.

I will try to push for having these before the Firefox-tv data comes in. Thanks for your analysis, looks great.

Flags: needinfo?(alessio.placitelli)

Looks like Date is available on the bigquery tables so that's taken care of. Clock skew's unlikely to be a problem anyway given how low the submission delay is across the board. This is consistent with my mental model of mobile clients being particularly good at keeping time since they're always connected to the network. And since they're always connected to the network our latency is low.

Ultimately we'll want to take a look at it to measure its effect however small it might end up being... but for now I don't have concerns about it affecting glean-transported data's completeness or timeliness.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.