Closed Bug 1400351 Opened 7 years ago Closed 7 years ago

[Telemetry Health] Investigate missing subsessions as a measure of Telemetry Client Health

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- unaffected

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Let's expand the Telemetry Health dashboard to look at additional measures of Telemetry Client Health. 

Specifically: "Missing Subsessions" (are there holes in our data?).

I see the course of this investigation being:
1) Does subsessionCounter reveal holes in our data
2) How many holes are there, how large are they, from how many clients are they coming
3) Based on 1 and 2, are there metrics we should construct and monitor? Are there indications of problems that need deeper investigation?
Assignee: nobody → chutten
Preliminary visualization: https://sql.telemetry.mozilla.org/queries/27302/source#81241

If we treat the data as correct, we're missing a _lot_ of subsessions over a one-week window. We're missing on the order of 200k subsessions every day, from about 10-20k clients.

So each client is missing 10 subsessions? I think not.

Since the measure of "how many subsessions are missing" is so sensitive to what I like to call "shenanigans" (where a client may just submit a subsession counter of 1 then another of MAX_INT to really mess with numbers), I added a count of the number of breaks in the subsession record, instead of the sum of missing subsessions. Turns out it tracks nearly 1:1 with the number of clients reporting any number of missing subsessions at all.

So a few (on the order of 0.01%) clients report few ( < 1.2 ) holes in the subsession record per day. Not a big deal, probably?

I guess it might be worth looking in to how the subsession breaks are distributed amongst the clients.
Distribution of subsession breaks amongst clients: https://sql.telemetry.mozilla.org/queries/43945/source#118779

Much like the distribution of pings per client, this looks like a Power Law distribution again. More than 80% of the clients reporting missing subsessions over the last 20 days are reporting precisely one break. The distribution of the size of the breaks is somewhat more distributed (70% reporting exactly one subsession missing) but still Power Law to the hilt.
Blocks: 1177737
On :mreid's suggestion I looked into whether the gaps could be partially explained by TELEMETRY_DISCARDED_* probes. These probes increment when we decide to discard pings for being too big without even trying to send them.

It turns out that this happens so rarely that it cannot be the primary, or even a significant driver of missing subsessions or subsession gaps.
Work has completed. Documented write-up is here: https://docs.google.com/document/d/1o3r2wdi8ndFDgSj7HAWyVL8A1BzpddI51eN1Lc9RptU/edit?ts=59df66a1#

tl;dr - Nothing actionable. The number of gaps is small, and they are typically of a small size. They aren't caused by any obvious mechansim, but maybe future investigations can tease out some commonalities.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.