Closed Bug 1520182 Opened 5 years ago Closed 5 years ago

Validate incoming "baseline" ping data

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox66 --- affected

People

(Reporter: Dexter, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:mobilesdk:m6])

Now that we're sending the baseline ping from the reference-browser, it might be a good time for checking that we're receiving data that makes sense.

This bug is about analysis the ingested data, to check:

  • that we're not seeing holes in the sequence numbers;
  • that all the recorded fields have data which we expect (matching format, matching schema, within range, etc.);
  • that we're not receiving too much unexpected stuff.
Blocks: 1520179
Priority: -- → P3
Whiteboard: [telemetry:mobilesdk:m4]

Georg, Chris - Given our past experience with these kind of validations, do you have any other suggestion on what to check?

Mike, Travis - This bug is about checking the incoming data to make sure that everything is all right. Can you think of anything else in addition to the things in comment 0 ?

Flags: needinfo?(tlong)
Flags: needinfo?(mdroettboom)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(chutten)

I thought about this from a couple of different angles and couldn't come up with anything that was directly related to validating the data. I assume that this will be an ongoing thing since certain issues will only crop up once we have a decent sample size to look at, so I may come up with something as we start to see patterns emerge.

Flags: needinfo?(tlong)

Things that come to mind...

  • Frequency: Per-client and in aggregate are we seeing them coming in at about the speed we expect?
  • Contents: Do pings contain all the things we expect and nothing else?
  • Contents distribution: Do the contents follow an expected distribution (ie do sequence numbers follow an expected downward curve of value frequencies, are app_build and telemetry_sdk_build both real releases, are the oses all real)?
  • Times: Do the timestamps "make sense" (are they within a day or so of receipt, are they from timezones we expect to see, are they mostly from the day time), are durations "reasonable" (are they less than the length of time the &-browser has existed? Are they supposed to be less than a day?), and those related timestamps start_time and end_time: are they always in the right order? Is meta/Date close to end_time? (does glean attach the Date HTTP header?) Is meta/Timestamp close to meta/Date?
Flags: needinfo?(chutten)

I can't think of anything to add to the comments above.

Flags: needinfo?(mdroettboom)
Assignee: nobody → mdroettboom
Assignee: mdroettboom → nobody

Taking this

ni?Dexter for the min version number after which data is ready for validation.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Flags: needinfo?(alessio.placitelli)
Priority: P3 → P1

(In reply to Chris H-C :chutten from comment #5)

Taking this

ni?Dexter for the min version number after which data is ready for validation.

Ah! Turns out today's build is the minimum build number we should aim for!

Display Version: 1.0.1904
Build #10261219

We should also take a look at the latency of incoming data, to have a number for future reference.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #6)

Display Version: 1.0.1904
Build #10261219

Forget about the previous version numbers. The final data is from the build right after that.

Display Version: 1.0.1905
Build #10281206

Let's also have Megan and Frank review this, in addition to me.

So, funny story. Turns out "baseline" pings aren't available in the Dataset API at all. (This is because we're transitioning that all to GCP anyway, so why go through the effort).

The pings that validate against the schema are put into a parquet table, though: org_mozilla_reference_browser_baseline_parquet

(The pings that don't validate should be shunted to the telemetry-errors stream which is a source that Dataset can grab records from. So there's that.)

Unfortunately the format of the org_mozilla_reference_browser_baseline_parquet table uses incompatible types to be read via spark: https://github.com/apache/spark/pull/1737

You can see this in redash as well if you try to query all the columns. If you omit the metadata column entirely, queries will still go through.

So it looks as though the verification analysis will have to be done via redash for now. :frank and :robotblake are confident they can think of a couple of workarounds if I need them, but I think I should be able to do a lot of it with clever applications of SQL.

Whiteboard: [telemetry:mobilesdk:m4] → [telemetry:mobilesdk:m6]

I took a preliminary look into available "baseline" pings. Here's my report.

Scope

I was able to look at over 21 thousand pings from clients running builds at least as new as 10281206 with pings received after 2019-01-28.

Ping and Client Counts

Aggregate

https://sql.telemetry.mozilla.org/queries/61238#157776
In the cohort we've not yet hit 1k DAU, but we can receive nearly 5k pings in a given day.
The curve is consistent with an adoption curve of builds >= 10281206 intersecting with a weekend slump.

Also of note is Frank's query of *AU. Originally I thought it was alarming how many WAU and MAU it's showing, but it makes sense that &browser's engagement rate is lower that Firefox's. The raw numbers (WAU flattening at 2.5k) are a little higher than expected, I'm told. Apparently we have many fewer than that enrolled in the Beta.

It's not that client_ids are cycling on build updates, as hundreds of clients are popping up with the same client_id across multiple builds.

Per-client, Per-day

There are some outliers when looking at the number of pings sent per-client and per-day, but for the most part they're also both the expected exponential decay curves.

Sequence Numbers

Distribution

Aggregate distribution of sequence numbers is exactly the exponential decay we'd expect: https://sql.telemetry.mozilla.org/queries/61239/source#157777
We expect to see exactly one client reporting one ping with a given seq, and we see that. There are only two extra pings, one with seq 0 another with seq 17, accounting for less than 0.01% of received pings. Nice.

The highest seq is over 400. I originally though this was Alessio, but the pings have start_time timezones of -05:00 which points to an Eastern Standard Time culprit. So maybe it's Mike.

It was tempting to try and draw a conclusion between the number of pings/clients with seq of 0 and WAU/DAU across the builds but it doesn't make sense to do so until modern-enough builds have hit saturation in the population.

Holes

Here's some bad news: at least 16.8% of clients have one or more hole in their seq sequence.

The query undercounts because is doesn't attempt to detect holes at the beginning of the sequence (ie, doesn't start at 0) because I'm not sure that the Scope is clean enough to have caught the beginning of everyone's seq record. It also can't detect holes at the end of the sequence.

"Holey" clients are most likely to have a sequence of length 4. This doesn't mean anything, it's just the intersection between lower seq values being more frequent and longer sequence lengths having more opportunities for holes. (though it might be fun to try and determine what distribution holes follow (e.g. uniformly random?) by taking it as a proportion of the number of pings we expect to see with seq values that high)

The sum of the lengths of all holes in the record is most likely to be 1. This is unsurprising given the short lengths of sequences overall and the relative rarity of seq holes.

There is a hole of length -1 from the client who sent two pings with seq 0. Since we didn't catch the duplicate with seq of 17, it either means they have two holes and their sum is undercounted, or they had one hole nullified precisely by the single dupe and thus were missed by the query altogether. This is of curiosity-level value.

Field Compositions

Note: I initially found it difficult to find the "baseline" ping's metrics. The docs identify, for example, the duration metric. But it needs to be found at metrics.timespan['glean.baseline.duration']. Both the timespan and the glean.baseline. namespacing were unclear from the docs.

The distribution of durations is a lightly-sloping exponential. There's a bump around the one-minute (60s) mark suggesting maybe there's some automation at play already? Or do apps get sent to the background when the screen dims and 60s is just a common length for that setting on Android?

Most of the "baseline" metrics check out. I worked the query in a way that might be adaptable to regular alerting using re:dash's tooling.

  • duration 4 pings have NULL duration. All the rest have unit 'second' as expected. There were also 299 pings with 0 seconds of duration.
  • os All of the pings have 'Android' for their os. (Note: The docs use 'android' without the capital A. May wish to update that.)
  • os_version All pings have an os_version of some value or another. Moreover they're all >= 16 (in fact they're all >= 21). I couldn't find out the minimum system requirements of &browser (even Google Play Store won't tell me) so I went with "at least newer than the API version we check for Fennec".
  • device, device_manufacturer, and device_model. bug 1522552 was merged after build 10281206 so I was confused by so many NULL values. But no, all pings have valid device information. The top manufacturers are Xiaomi, Samsung, Google, and OnePlus. The models are scattered to the winds and have no clear winners, really. A quick scan showed nothing too strange (though I thought TP-Link was a router manufacturer...)
  • architecture All pings have architecture information. Moreover they all start with arm or x86, with arm64-v8a the overwhelming favourite (though a couple of token x86 and x86_64).
  • locale No pings contain locale. Which is weird for a field we're including in the "baseline" ping.

Delay

I didn't study ping delays as it requires the use of the metadata fields which I can't reach using available tooling. Alas.

Conclusion

I conclude that these pings are mostly complete but should not yet be used for any decision-making analyses.

Recommendations

It appears as though there's a widespread problem affecting the ability of hundreds of clients to send their "baseline" pings (or for us to receive them). I did check the telemetry-errors stream for "baseline" pings and found none, suggesting there's a problem in transmit. This is the primary problem holding up validation.

duration isn't as reliable as we'd probably like. 4 pings with NULL duration is odd. 299 pings with 0s of duration may throw off analyses. I recommend looking a little into NULL duration values, especially if they increase in number. I also recommend that we consider rounding fractional time units to the next whole second value so that adding pings to an analysis always increases the time over which the measurements were taken. (I presume duration will be used as a denominator for many decision-making metrics)

locale should have a value set, even if it's und-ZZ ("unknown language" ISO 639 + "Unknown or Invalid Territory" Common Locale Data Repository). Otherwise it should be omitted from the docs like other not-yet-implemented fields (I'm looking at you "field that should not be called 'profile_age'").


Please let me know if you have any questions/concerns/corrections.

Flags: needinfo?(gfritzsche)

@dexter i wonder if recording the SDK version (?) like 21 is a good idea for os_version

there really are three values possible for a devicve it seems: FireOS 6.3.0.1, SDK Level 21, Android 7.0.1

and I bet we want to understand all of them in reporting

Flags: needinfo?(alessio.placitelli)
Blocks: 1525603

(In reply to Chris H-C :chutten from comment #9)

I took a preliminary look into available "baseline" pings. Here's my report.

First of all, thank you for your efforts and this great in-depth analysis!

Scope

I was able to look at over 21 thousand pings from clients running builds at least as new as 10281206 with pings received after 2019-01-28.

Should we time-box these pings? Should we add a "no older than" clause so that queries are reproducible and frozen in time?

Ping and Client Counts

Aggregate

Also of note is Frank's query of *AU. Originally I thought it was alarming how many WAU and MAU it's showing, but it makes sense that &browser's engagement rate is lower that Firefox's. The raw numbers (WAU flattening at 2.5k) are a little higher than expected, I'm told. Apparently we have many fewer than that enrolled in the Beta.

Do you mean that there are fewer users on &-browser compared to Firefox Beta (Fennec)?

It's not that client_ids are cycling on build updates, as hundreds of clients are popping up with the same client_id across multiple builds.

So is this just a matter of users churning out/re-installing?

Sequence Numbers

Distribution

It was tempting to try and draw a conclusion between the number of pings/clients with seq of 0 and WAU/DAU across the builds but it doesn't make sense to do so until modern-enough builds have hit saturation in the population.

Yes, I think it makes sense to wait at least a couple more weeks for that to happen.

Holes

Here's some bad news: at least 16.8% of clients have one or more hole in their seq sequence.

Nice. I think we can explain this by the fact that we currently attempt to send every ping once, if we fail, ping is kaputt.

There is a hole of length -1 from the client who sent two pings with seq 0. Since we didn't catch the duplicate with seq of 17, it either means they have two holes and their sum is undercounted, or they had one hole nullified precisely by the single dupe and thus were missed by the query altogether. This is of curiosity-level value.

The only explaination that I have for the double-seq-0 is that we were always sending "seq=0" before sequence numbers were implemented, so some rougue seq=0 might have been sent with the update? This, however, doesn't really make any sense since the "always seq=0" ping would have an older buildId and version.

Field Compositions

Note: I initially found it difficult to find the "baseline" ping's metrics. The docs identify, for example, the duration metric. But it needs to be found at metrics.timespan['glean.baseline.duration']. Both the timespan and the glean.baseline. namespacing were unclear from the docs.

This could use some more clarity in the docs, agreed. Filed bug 1525578.

The distribution of durations is a lightly-sloping exponential. There's a bump around the one-minute (60s) mark suggesting maybe there's some automation at play already? Or do apps get sent to the background when the screen dims and 60s is just a common length for that setting on Android?

This is interesting and I have no clue. Intuitively, the data makes sense :\

Most of the "baseline" metrics check out. I worked the query in a way that might be adaptable to regular alerting using re:dash's tooling.

  • duration 4 pings have NULL duration. All the rest have unit 'second' as expected. There were also 299 pings with 0 seconds of duration.

Filed bug 1525600.

  • os All of the pings have 'Android' for their os. (Note: The docs use 'android' without the capital A. May wish to update that.)

I added a note in bug 1525578.

  • os_version All pings have an os_version of some value or another. Moreover they're all >= 16 (in fact they're all >= 21). I couldn't find out the minimum system requirements of &browser (even Google Play Store won't tell me) so I went with "at least newer than the API version we check for Fennec".

The minimum API level supported by a-c components is API21, so that's expected :)

  • locale No pings contain locale. Which is weird for a field we're including in the "baseline" ping.

Gah, that's another thing that slipped through.

Delay

I didn't study ping delays as it requires the use of the metadata fields which I can't reach using available tooling. Alas.

Conclusion

I conclude that these pings are mostly complete but should not yet be used for any decision-making analyses.

Recommendations

It appears as though there's a widespread problem affecting the ability of hundreds of clients to send their "baseline" pings (or for us to receive them). I did check the telemetry-errors stream for "baseline" pings and found none, suggesting there's a problem in transmit. This is the primary problem holding up validation.

Bug 1508965 is going to deal with this.

duration isn't as reliable as we'd probably like. 4 pings with NULL duration is odd. 299 pings with 0s of duration may throw off analyses. I recommend looking a little into NULL duration values, especially if they increase in number. I also recommend that we consider rounding fractional time units to the next whole second value so that adding pings to an analysis always increases the time over which the measurements were taken. (I presume duration will be used as a denominator for many decision-making metrics)

Filed bug 1525600.

locale should have a value set, even if it's und-ZZ ("unknown language" ISO 639 + "Unknown or Invalid Territory" Common Locale Data Repository). Otherwise it should be omitted from the docs like other not-yet-implemented fields (I'm looking at you "field that should not be called 'profile_age'").

I filed bug 1525540 (locale) and bug 1525045 (firstRunTime aka profile_age)


Please let me know if you have any questions/concerns/corrections.

I filed bug 1525603 as a follow-up validation bug. I'll make all the newly filed bugs block it, so that we'll know once we're ready to validate again.

(In reply to stefan from comment #10)

@dexter i wonder if recording the SDK version (?) like 21 is a good idea for os_version

there really are three values possible for a devicve it seems: FireOS 6.3.0.1, SDK Level 21, Android 7.0.1

and I bet we want to understand all of them in reporting

Cool, I filed bug 1525606 for discussing this.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #11)

(In reply to Chris H-C :chutten from comment #9)

Scope

I was able to look at over 21 thousand pings from clients running builds at least as new as 10281206 with pings received after 2019-01-28.

Should we time-box these pings? Should we add a "no older than" clause so that queries are reproducible and frozen in time?

I think this will be more important for later verification, but yes. For the purposes of reproducing this analysis the queries can be assumed to have a AND submission_date_s3 < '20190205' and not be too far off.

Ping and Client Counts

Aggregate

Also of note is Frank's query of *AU. Originally I thought it was alarming how many WAU and MAU it's showing, but it makes sense that &browser's engagement rate is lower that Firefox's. The raw numbers (WAU flattening at 2.5k) are a little higher than expected, I'm told. Apparently we have many fewer than that enrolled in the Beta.

Do you mean that there are fewer users on &-browser compared to Firefox Beta (Fennec)?

I mean that for the number of clients we have on the &browser, they use it less. Engagement Rate is DAU/MAU and on Firefox Desktop it's usually 0.4-ish (or it was some years ago when I looked into it). We don't have a steady MAU yet on this population (too early) but DAU/WAU is around 0.4, so what I'm concluding is that we have a user base that's not very engaged with the &browser. Which isn't surprising.

It's not that client_ids are cycling on build updates, as hundreds of clients are popping up with the same client_id across multiple builds.

So is this just a matter of users churning out/re-installing?

I need solid install/instance numbers to make comparisons between how many users we expect and how many clients we're seeing. Right now I'm only ruling out that the client_id has this particular fault in it that would inflate things.

I'd love to have churn numbers, but as far as I can tell we have no way to tell if multiple client_ids are from the same device (and we probably don't need to tell that, so we shouldn't).

--

I'll close this out as FIXED (Verification analysis complete) and we can take future conversations to future bugs.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.