Closed Bug 1552508 Opened 5 years ago Closed 5 years ago

Validate the 'events' ping in Fenix

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: chutten)

References

Details

This bug is about validating the 'events' ping in Fenix. Some data is available in the &-browser too, if needed for comparison.

Documentation can be found here.

Type: defect → task

Chris, can you take a look at when this could fit into your schedule?
Or alternatively, if you have to many things on your plate, guide Alessio through it and review his analysis?

Flags: needinfo?(chutten)

Validation conditions I can read out from the docs:

  • Number of events must not exceed 500 (the default of Glean.configuration.maxEvents which I think Fenix has left unchanged)
  • Are event timestamps >= 0?
  • What distribution do the numbers of events follow? There's no validation step here, just exploration as its distribution is controlled more by instrumentation than by the mechanics of glean sending the pings.
  • Similarly, what are the most common categories we're seeing?
  • Ping delay and clock skew.
  • Cross-comparison: do the proportions of client_info and ping_info fields seem reasonable compared to those seen in "baseline" pings. (e.g. do we see the same proportion of seq dupes/holes? Do we see the same client count?)

There's not much here to look at (which is a nice change of pace from the "everything but the kitchen sink" approach in some Desktop pings). Alessio, did I miss anything?

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Flags: needinfo?(chutten) → needinfo?(alessio.placitelli)
Priority: -- → P1

(In reply to Chris H-C :chutten from comment #2)

  • Are event timestamps >= 0?

In addition to this, I'd also check if the "timespamp gaps" we see are reasonable: we don't expect events in the same events ping to be thousands of days apart.

All the rest looks great, thanks for taking on this!

Flags: needinfo?(alessio.placitelli)

Here's what I saw when I looked into Fenix "events" pings from May 8-15.

Scope

The period under analysis had 112,365 pings from 6,152 clients which is about the same number of pings and clients we saw from the "baseline" analysis over pings from the same period. This seems reasonable as backgrounding the app sends both an "events" ping (if there are events) and a "baseline" ping (unconditionally).

Ping and Client Counts

We seem to average under 15k pings from under 2k clients.

Most clients send few "events" pings, with over half sending fewer than 7 over the week. Seen another way, we see per-client average ping rates mostly at 2 or 1 pings per day. This is consistent with most clients using the browser twice per day over the work week, or once per day evenly.

The outliers are always fun. 386 clients (6.3%) sent more than 75 pings over the period, and the highest average ping rate was 103 pings per day. Both of these are ridiculous, but only the former is of any concern: why so many with so many pings? Test clients? If so I'd expect their proportion to decrease if the population increases.

Duplicates

"events" pings seem to have more or less the same magnitude of duplicates problem that "baseline" pings does. 3,148 pings (2.8%) coming from 769 clients (12.5%) are any kind of dupe. About half are "full" dupes (where even the document_id is duplicate), and the other half are only "half" dupes (where the document_id differs, but the {client_id, seq} tuple are repeated).

What will be interesting is if the "metrics" ping exhibits the same proportion of duplicates, given its different triggering mechanisms.

Event counts

There are no pings with over 500 events. There are very few with over 50 (there's precisely one ping that went as high as 450).

You should be used to these decay curves by now. What I find neat are the discontinuities between 1 and 2-3 and 4+. It's like they're three separate cohorts from three other plots. For whatever reason "1" is a very popular count of events. Somewhat less so, 2 and 3 are less popular than 1 but about as popular as each other. And then 4+ is this smooth curve.

I was worried there for a while that it might be a symptom that events weren't being cleared and were being resent. Some further probing of the data (mostly visual inspection) has cleared my mind of this. I also noted that app_opened isn't always the first event, coincidentally confirming that we're successfully sending "events" pings from multi-foreground sessions.

Event Timestamps

0 event timestamps are less than 0.

( I'm at EOD, so I'll have to take the rest tomorrow )

(Wait, this isn't tomorrow. Whoops.)

Event Timestamps (part 2)

Originally I tried looking at the distribution of event timestamps as minutes, but that didn't go very far. Even if you exclude the first event of 0, you still get swamped by the sheer number of events that happen within the first minute. That makes it hard to see the long tail of "longer than it should be... right?" times, even if you cap at a day's number of minutes (1440).

So then I looked at something that I could definitely say was wrong: How many events pings contain events that cover a longer span than the ping itself covers? The answer is 1.4% of pings across 12.6% of clients. If I add an extra minute of wriggle room it only drops to 1.0% of pings across 9.1% of clients, so it's not just close calls.

In fact, if we graph the distribution of the differences in "event time" to "ping time" from those pings where "event time" is longer than "ping time" we can see half of the pings show times in their events that are over 3min longer than the times reported by the pings.

I'll reiterate that this is a small proportion of pings (1.4%), but it might be worth thinking up theories about by which mechanism events might be coming in over a wider period than the ping covers.

Event Category Exploration

Fenix sends events from few categories: only quick_action_sheet, find_in_page, bookmarks_management, context_menu, custom_tab, and (the imaginatively-named) events are represented in the sample. Since there are so few, I threw the names in as well (though plotly in re:dash didn't handle it terribly well): https://sql.telemetry.mozilla.org/queries/62960/source#161501

One thing that strikes me is that the only event that has more events than clients is quick_action_sheet.closed. I guess those who want it closed close it often?

Delay and Clock Skew

For this analysis I played with BigQuery. So it took a while. Sorry for the delay.

For clock skew I subtracted the submission_timestamp from the Date header metadata.header.date in milliseconds and then used APPROX_QUANTILES to get the deciles:

{
  "skew_deciles": [
    "-172798612",
    "34",
    "611",
    "896",
    "1108",
    "1308",
    "1522",
    "1764",
    "2083",
    "2685",
    "21882861"
  ]
}

Clock skew seems to mostly be on the order of 0-3 seconds (which might just be transmission delay, not skew) though there are, as always, the weird ones on the outside.

A quick expansion to 100 quantiles shows that ~6% of pings that are time travelling more than one second ahead (only 2% more than 4s), and on the other side of thing only 2% are skewing longer than 7s late.

We might actually be able to (sometimes, broadly) trust these clients' clocks? Less than 4% of pings being more than 7s out on either direction is pretty nice. (Future analysis should explore whether this is predominantly certain clients or not.)

On the delay front, the 95 %ile is under 2 hours. Once again, these pings get to us quickly. No need to wait days.

Conclusion

"events" pings seem to be operating within reasonable parameters and can be confidently used for analysis with the usual caveats around watching for duplicates and a unique caveat of watching for pings reporting events over longer spans than the ping's reporting.

Recommendations

  • Dupes, still. Already being looked into for "baseline" pings in bug 1547234 and bug 1554729.
  • Investigate what's going on with those events reporting outside of the reporting window of the ping. It's possible this is something that is perfectly allowed, but if so it's a little weird and should at least be documented.

Alessio, please take a look and let me know if you have any questions/concerns/corrections/etc.

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #5)

(Wait, this isn't tomorrow. Whoops.)

Event Timestamps (part 2)

Originally I tried looking at the distribution of event timestamps as minutes, but that didn't go very far. Even if you exclude the first event of 0, you still get swamped by the sheer number of events that happen within the first minute. That makes it hard to see the long tail of "longer than it should be... right?" times, even if you cap at a day's number of minutes (1440).

So then I looked at something that I could definitely say was wrong: How many events pings contain events that cover a longer span than the ping itself covers? The answer is 1.4% of pings across 12.6% of clients. If I add an extra minute of wriggle room it only drops to 1.0% of pings across 9.1% of clients, so it's not just close calls.

In fact, if we graph the distribution of the differences in "event time" to "ping time" from those pings where "event time" is longer than "ping time" we can see half of the pings show times in their events that are over 3min longer than the times reported by the pings.

I'll reiterate that this is a small proportion of pings (1.4%), but it might be worth thinking up theories about by which mechanism events might be coming in over a wider period than the ping covers.

Mike, since you've been working with Events and have touched timestamps, do you have thoughts on this one?

I suspect this might have to do with events that were stored to the disk.

Flags: needinfo?(mdroettboom)

Regarding timestamps falling outside of the duration of the ping. I agree with :Dexter that a good working theory is that these are events that were queued on disk and the process was shutdown before we had a chance to send them (this is a normal and frequent thing on Android). These events are then sent immediately on the next start up of the application, so the ping would have a very short duration, but the events might span some longer period of time.

I don't think it's possible to determine which pings were sent immediately at start up in this case, though they all should have start_time == end_time (unless it spans the minute-rollover moment and end_time - start_time == 1 minute). If you filter for event spans greater than the ping span, but only when the ping span is greater than one minute, there are only 5 pings in the whole data set meeting that criteria. Looking the timespan differences on those 5, it's kind of all over the map.

While that very low number of 5 sounds promising, I don't think it's truly conclusive, since I don't think we can truly separate the events pings sent right at startup and those sent in the normal course of business in under 1 minute. Unless someone has some creative ideas...

Flags: needinfo?(mdroettboom)

(In reply to Chris H-C :chutten from comment #5)

(Wait, this isn't tomorrow. Whoops.)

We uncovered your tricks, time traveller!

Event Timestamps (part 2)

Originally I tried looking at the distribution of event timestamps as minutes, but that didn't go very far. Even if you exclude the first event of 0, you still get swamped by the sheer number of events that happen within the first minute. That makes it hard to see the long tail of "longer than it should be... right?" times, even if you cap at a day's number of minutes (1440).

So then I looked at something that I could definitely say was wrong: How many events pings contain events that cover a longer span than the ping itself covers? The answer is 1.4% of pings across 12.6% of clients. If I add an extra minute of wriggle room it only drops to 1.0% of pings across 9.1% of clients, so it's not just close calls.

In fact, if we graph the distribution of the differences in "event time" to "ping time" from those pings where "event time" is longer than "ping time" we can see half of the pings show times in their events that are over 3min longer than the times reported by the pings.

I'll reiterate that this is a small proportion of pings (1.4%), but it might be worth thinking up theories about by which mechanism events might be coming in over a wider period than the ping covers.

I think that this could could be explained by comment 6 and comment 7. We slightly hint at this in the docs, but we should definitely be more explicit about what that could mean for analysis.

@Chris, one thing I noticed is that you're using ELEMENT_AT(events, -1) to access the last element in the events ping, assuming that that's the one with the highest timestamp. However, we don't really enforce this anywhere. Might be worth checking if events are in the timestamp ascending order too? Maybe the last event isn't the one with the highest timestamp in these cases?

[...]
On the delay front, the 95 %ile is under 2 hours. Once again, these pings get to us quickly. No need to wait days.

This is great!

Conclusion

"events" pings seem to be operating within reasonable parameters and can be confidently used for analysis with the usual caveats around watching for duplicates and a unique caveat of watching for pings reporting events over longer spans than the ping's reporting.

We need to address this duplicates problem. Looks like it's the only real issue, if we can confirm our theories about the longer spans for this ping.

Recommendations

  • Dupes, still. Already being looked into for "baseline" pings in bug 1547234 and bug 1554729.
  • Investigate what's going on with those events reporting outside of the reporting window of the ping. It's possible this is something that is perfectly allowed, but if so it's a little weird and should at least be documented.

Alessio, please take a look and let me know if you have any questions/concerns/corrections/etc.

Everything looks great, thank you for validating this. What do you think about the theories for the spans discussed above?

Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

(In reply to Michael Droettboom [:mdroettboom] from comment #7)

I don't think it's possible to determine which pings were sent immediately at start up in this case, though they all should have start_time == end_time (unless it spans the minute-rollover moment and end_time - start_time == 1 minute). If you filter for event spans greater than the ping span, but only when the ping span is greater than one minute, there are only 5 pings in the whole data set meeting that criteria. Looking the timespan differences on those 5, it's kind of all over the map.

While that very low number of 5 sounds promising, I don't think it's truly conclusive, since I don't think we can truly separate the events pings sent right at startup and those sent in the normal course of business in under 1 minute. Unless someone has some creative ideas...

Thanks for the confirmation on this, it makes me feel much better. I also have another theory, see comment 8 :)

I looked into the theory from comment 8, but I found evidence inconsistent with it. I used ARRAY_MAX(TRANSFORM(events, x -> x.timestamp)) to get the max timestamp instead of ELEMENT_AT(events, -1).timestamp and got the exact same counts.

Comment 7 seems plausible given mdboom's evidence. I kinda/sorta miss the "reason" field when performing these analyses. I liked being able to tell why the ping was sent, and it would help answer questions like these without guessing.

It seems as though "be clear in the documentation that this is the case" is the best way to handle that bit.

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #10)

I looked into the theory from comment 8, but I found evidence inconsistent with it. I used ARRAY_MAX(TRANSFORM(events, x -> x.timestamp)) to get the max timestamp instead of ELEMENT_AT(events, -1).timestamp and got the exact same counts.

Comment 7 seems plausible given mdboom's evidence. I kinda/sorta miss the "reason" field when performing these analyses. I liked being able to tell why the ping was sent, and it would help answer questions like these without guessing.

Thank you for following up on this. I think we can call this complete!

It seems as though "be clear in the documentation that this is the case" is the best way to handle that bit.

Mike, do you mind updating the docs both in glean-ac and glean-core, since you already have a PR open for the second? :)

Flags: needinfo?(mdroettboom)

One outstanding ni for docs seems close enough to done for me.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Flags: needinfo?(mdroettboom)
You need to log in before you can comment on or make changes to this bug.