Closed Bug 1503988 Opened 2 years ago Closed 2 years ago

Review query and claims around Subsession Gaps

Categories

(Data Science :: Review, task, P3)

x86_64
Linux
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chutten, Assigned: flawrence)

Details

I was looking into gaps in the subsession record and developed what I -think- is a metric of proportion of clients having at least one missing main ping in the past week's record of pings.

I would like this Weekly Subsession Gap Rate metric's concept (as in, does it even make sense) and its implementation (as in, does the SQL implement the metric) reviewed.

The query's here: https://sql.telemetry.mozilla.org/queries/59750/source#154467

You'll notice from the query that I am comparing pre-pingsender builds and pingsender builds on the basis of this rate. I would like that reviewed as well: are the differences between the two cohorts real?

(Helpful to validation may be these figures on pingsender saturation over the examined timeframe: https://sql.telemetry.mozilla.org/queries/59766/source#154498 )

Background:
Each "main" ping[1] has a profileSubsessionCounter that monotonically increases for each subsession in a profile's lifetime. It is expected that we should usually receive a perfect subsession record within a week, as pings that failed to send would be retried until they succeed. 

Pingsender[2] is a small application that ships with Firefox and sends "main" pings with reason "shutdown" as of Firefox 55. This is to reduce latency in receiving pings.[3]

[1]: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html
[2]: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/internals/pingsender.html?highlight=pingsender
[3]: https://blog.mozilla.org/data/2017/09/19/two-days-or-how-long-until-the-data-is-in/
Priority: -- → P3
We do not currently have anyone available to work on this request this week. I will keep it on our list to triage again.
Assignee: nobody → flawrence
Status: NEW → ASSIGNED
I have never seen SQL taken to such heights before, this felt like an algorithms test! Love it!

There were a few things I wouldn't call "gaps" that caused your metric to be so high:
1) Pings aren't being delivered sequentially. A ping for subsession 3 might have come in on Monday night, and pings for subsessions 2 and 4 might have come in on Tuesday morning. IMO this isn't really a "gap", though it is something we should be aware of.

It looks like you tried to avoid this by partitioning over the preceding 7 days and turning it into a weekly rate, but really you're just pushing the problem back (this week's Monday/Tuesday data collection problem won't be a problem on you graph for Tuesday this week, it'll be a problem for next week when it's at the back of the window). And you're also ignoring the forward looking problem where some of today's gaps will be filled tomorrow.

2) Duplicate pings were being counted as gaps - if we have two pings for one subsession_id then that was being counted as a gap

3) Split sessions were being counted as gaps - if a session "forks" so that two subsessions with different subsession_id both share a previous_subsession_id, then this was counted as a gap.


I fixed all these problems here (and turned it into daily, not weekly):
https://sql.telemetry.mozilla.org/queries/60325/
And here's a dupe-hating version where dupes still count as "gaps":
https://sql.telemetry.mozilla.org/queries/60355/


I discovered these problems (and more!) by sampling raw data from five allegedly-gappy (client_id, submission_date_s3) pairs and working out what was actually going on in those cases.

I also discovered another non-gappy problem that my query doesn't ignore:
4) One client sent 3 subsessions with a 2016 subsession_start_date, as well as some contemporary subsessions.


Are we OK with <1% of clients having a gap on a given day? Should I dive back in and sample more gappy clients to see what's going on now that we filtered out non-sequentials, splits and dupes?

I'm curious what happened around September 10 that seems to have increased the gap rate for pingsender clients.
Let me see if I understand the query... it collects the subsessions reported on a given day, then uses the two days before and after to try and fill the gaps. It reduces over the array of those five days' subsessions and, looking only at the subsession range for the given submission date, checks to see if there are gaps (and, optionally, dupes).

Looks good. I'm generally glad to see things below 1%, though for a daily metric like this I was hoping for lower. For the same reasons MAU is bigger than DAU I expect a weekly version of this metric would be higher than a daily? 

To the points in order:

1) So the edges are likely to be holey? So for a given Monday where we have subsessions 3-6, we often receive 3,4, and 6 but not 5 (which we receive on Tuesday)? That's interesting. I was expecting the holes to be more likely on on the edge (receiving 3-5 on Monday and 6 on Tuesday).

2) Ah, right, whoops. :S

3) Forked sessions shouldn't have been an issue since the profile subsession counter should monotonically increase for both independently (since a forked session implies a copied profile, no?)

4) Yeah, we get some weird stuff. I think we filter out the dates with 5-digit years now, but we didn't always :S

> Are we OK with <1% of clients having a gap on a given day? Should I dive back in and sample more gappy clients to see what's going on now that we filtered out non-sequentials, splits and dupes?

I'd like this number to be lower (who wouldn't) but 1% is a useful rule of thumb for "should we care about this now". I wouldn't mind a knock-on analysis checking:
a) the distribution of these gappy clients. Is it a uniform 1% chance a given client will have a gap today, or is it the same 1% each day?
b) the influence of these gaps on core metrics (dare I say KPIs?) like search counts and aDAU. Is it likely these gappy clients would be considered aDAU if not for the gaps? How many searches are we likely to be missing?
c) the rate calculated weekly so I can more easily see exactly _how_ wrong my initial analysis was :)

> I'm curious what happened around September 10 that seems to have increased the gap rate for pingsender clients.

On Sept 4 we switched our enterprise-focused Extended Support Release (ESR) channel from a pre-pingsender version (52) to a has-pingsender version (60)[1]. It wouldn't surprise me to learn that ESR-using users have a different expected gap rate.

[1]: https://wiki.mozilla.org/Release_Management/Calendar
I also had a thought. My original hypothesis was that a dearth of "shutdown" pings would cause holes in the subsession record. This wouldn't manifest itself on a daily metric as a Monday with subsessions 3, 4, 5, 6 but losing 6 would report 3, 4, 5 which would appear gap-free.

So I reworked your query slightly to use a two-day window so that Monday with 3-6 and Tuesday with 7-9 would catch a missing 6 because it would be looking for 3-9 instead of 3-5: https://sql.telemetry.mozilla.org/queries/60381/source#155923

(( By reworking it this way it means that it's checking the previous and subsequent two _windows_ (2-day periods) of subsessions instead of previous and subsequent two days, but since it didn't affect the query performance I didn't try to change it. ))

The result is more-or-less what I expect from the daily metric: everything's slightly magnified. The gap between has pingsender and no pingsender seems to be closing, so I developed a variant checking the week... which exhausted available resources. So I narrowed the range to the last few weeks and: https://sql.telemetry.mozilla.org/queries/60384/source#155931

Looks like the rate's higher, as expected... but the "has pingsender" crowd is now exceeding the "no pingsender" crowd for gap rates.
1) Receiving 3-5 on Monday and 6 on Tuesday isn't really a "gap", is it? I thought a "gap" was where we don't receive a ping at all?

3) Nothing too nefarious going on here - I just thought it worth calling out "forked sessions" and "duplicated pings" as separate things. If we ignore duplicated counters then we ignore both (2) and (3).

> I wouldn't mind a knock-on analysis
OK before taking these new graphs too seriously as anything other than an upper bound, I'll start sampling again and see if there are any other weird artifacts dominating the data.

> So I reworked your query slightly to use a two-day window so that Monday with 3-6 and Tuesday with 7-9 would catch a missing 6 because it would be looking for 3-9 instead of 3-5: https://sql.telemetry.mozilla.org/queries/60381/source#155923
Ah, good point. I think a more-direct way to do this might be to find the largest pre-day counter, and check whether the day's first counter sequentially follows it. (It's better to have day n be responsible for gaps between n-1 and n, rather than being responsible for gaps between n and n+1, because otherwise we unnecessarily introduce lag).

This query is currently running here: https://sql.telemetry.mozilla.org/queries/60392

Before looking into this further I'll go back to sampling and investigating clients, to see if there's still anything non-gappy being reported as "gaps", or whether the real gaps fall into distinct categories, etc.
I had a look at the gaps being reported by the new query - I randomly sampled 10 clients with gaps with `submission_date_s3 = '20181101' [1].

* 8 of the gaps followed an 'aborted-session' ping.
* There were at least 4 cases where there was an aborted-session ping with an identical `subsession_id` and `profile_subsession_counter` to a non-aborted-session ping, followed by a gap in `profile_subsession_counter`. This smells like a bug.
* The two clients with gaps that did not follow an 'aborted-session' ping had very little activity. One of them only had one ping on 20181115, with the most recent ping before that being on 20181105 (only 1 missing ping according to the counter) - neither pings were aborted-session. The client only had one ping on 20181115 with the most recent previous ping being on 20181106; the newer ping was an aborted-session but the older ping wasn't.
* I did observe a case where an 'aborted-session' ping did not lead to a gap
* All but 1 case above was from a 'has pingsender' version of firefox; the `no pingsender` people are now probably quite different to the 'has pingsender' users.

So it looks like the main cause of gaps is related to aborted sessions; perhaps the following is happening:
* Firefox crashes
* When we try to send the ping, we sometimes re-send a ping that has already been sent (without recomputing its `profile_subsession_counter` value
* When we send this ping we increment the profile_subsession_counter (again)
This could be mitigated by not storing a `profile_subsession_counter` and instead looking at the most recent ping's value and adding 1. Or by tidying up the logic around resending pings?

Or perhaps for some (but not all) aborted-session code-paths, the profile_subsession_counter is incremented before assigning it to the ping, and others where it's incremented after?

Or perhaps if firefox crashes at the wrong moment, an old ping will be re-sent with 'aborted-session' and the current ping will be deleted?

In any event, it looks like most of these 'gaps' are coming from a pretty isolated set of events, and it looks like there's a firefox bug to be fixed - rather than there being some kind of 'indiscriminate packet loss' robbing us of a random sample of our data. Since many of the gaps immediately follow duplicate subsession_ids, I don't think this would be negatively affecting aDAU, search, etc?

[1] https://sql.telemetry.mozilla.org/queries/60397/source
aborted-session pings use the current ping payload (if available) so it would make sense that is has an identical profile_subsession_count.

It doesn't seem too surprising that we might miss ones immediately afterwards. I'd need to dive back in to "aborted-session" to remember exactly how it works.

If gap rates are proportional to aborted-session then I guess we need not investigate further since the presence of an aborted-session in the subsession history is already an indicator that we might be missing things.
Yeah I wonder whether this is a bug where `profile_subsession_count` is incremented unnecessarily (no data loss), or a problem where we lose a subsessions ping in/after an `aborted-session` (has data loss).

So there's a potential bug to chase down that seems to be responsible for the vast majority of "gaps", and we understand the broad circumstances in which this bug occurs. Other gaps seem to be extremely rare.

Do you need anything else from me here?
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Ack, yes, sorry for the lack of reply. (To better ensure I don't forget next time you can apply a needinfo against me and then bugzilla'll nag me until I get around to it :S )

Yeah, it seems as though this is a non-issue (or slight issue with profile_subsession_count) so there's nothing more to be done here.
You need to log in before you can comment on or make changes to this bug.