Closed Bug 1380298 Opened 7 years ago Closed 7 years ago

[Telemetry Latency] Determine how long it takes main pings to get to us

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

Details

The Summary is willfully inaccurate, so let's get some specifics:

"main" pings carry most of the information that our analyses rely on. (Other pings we like (crash, new-profile) come in more quickly (ie, immediately) so their latencies aren't as relevant anyway) They come in at various speeds based on channel, day of week, presence of pingsender... and no doubt other variables.

The goal here is to measure how quickly we get some "critical mass" number (mean? 95%ile? 99%ile?) of main pings and to provide a dashboard that measures it.

(I fully expect this to be consumed by Mission Control once its training wheels come off.)

This will require exploratory work to see what ranges of values we "typically" receive and what variables are most likely predictors (channel, day of week, and presence of pingsender are big ones. But if we still can't get stable numbers, we may need to go deeper)

The result here is a collection of times it takes certain proportions of certain populations of "main" pings to be received.

For instance, one thing we should be able to say with this is "Yesterday we received 95% of release-channel "main" pings within 23.7 hours" 

This is to provide a concrete resource for people writing or using recurring analyses which lets them know how far back our "incomplete information" window stretches for their population.
Current progress: https://sql.telemetry.mozilla.org/queries/5522

I've been trying different x axes: submitted date, received date, created date, session_start_date

The problem with created/session started is that the more recent data will change, worsening, over time.
The problem with submitted date is that it is a client clock.
The problem with received date (ie, submission_date_s3) is that it reflects when we received the data more than when the data was actually about (ie, we lose information about the time the ping was created)

Of course, with pingsender (and adjusting for clock skew) you'd think all of these should be fuzzy-close enough it wouldn't matter. My fiddling suggests that there are still differences (the nice client submission delay cliff doesn't look so obvious when plotted vs. submission date, for instance)

I think I will go for submission date to align with mission control and to encourage the view that this information is immutable over time. Today we already have all of yesterday's data, so the graph shouldn't budge.

Now that I have an idea of how to work with this data, it's time to narrow down to just latest versions (this is problematic around merge days, but I think I can make it work with some thresholds. All else fails, display all we have appreciable data for and let the viewer sort 'em out)
Alrighty, this part is pretty much handled: https://sql.telemetry.mozilla.org/dashboard/telemetry-health

ni?Dexter, gfritzsche for f? or r? (whichever you prefer)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
(In reply to Chris H-C :chutten from comment #2)
> Alrighty, this part is pretty much handled:
> https://sql.telemetry.mozilla.org/dashboard/telemetry-health

This looks really good and useful, thanks for the good work on this!

> The problem with created/session started is that the more recent data will
> change, worsening, over time.
> The problem with submitted date is that it is a client clock.
> The problem with received date (ie, submission_date_s3) is that it reflects
> when we received the data more than when the data was actually about (ie, we
> lose information about the time the ping was created)
> 
> Of course, with pingsender (and adjusting for clock skew) you'd think all of
> these should be fuzzy-close enough it wouldn't matter. My fiddling suggests
> that there are still differences (the nice client submission delay cliff
> doesn't look so obvious when plotted vs. submission date, for instance)

Do you mean submission date vs creation date being fuzzy-close?
They should be really, really close, unless:

- dates are wrong (unlikely, as we're checking with server-side dates, right?);
- the ping was generated at shutdown, the pingsender failed, it was sent again at the next startup;

I have one question about the SQL query (I'm probably missing something!).
When computing the 95th percentile approximation, why are you doing
|creation_date - submission_date + subsession_length|? I'm particularly interested
about the last addend, subsession_length.
Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)
This looks clear and informative, cheers.

Some things:
- Lets not show aurora, we killed it already.
- Why is it using only the latest versions per channel, can you expand?
- The ESR 54 delay is significantly lower than release 54.
  Is this due to too bias from small sample size? Should we remove it for now?
Flags: needinfo?(gfritzsche)
(In reply to Alessio Placitelli [:Dexter] from comment #3)
> (In reply to Chris H-C :chutten from comment #2)
> > Alrighty, this part is pretty much handled:
> > https://sql.telemetry.mozilla.org/dashboard/telemetry-health
> 
> This looks really good and useful, thanks for the good work on this!
> 
> > The problem with created/session started is that the more recent data will
> > change, worsening, over time.
> > The problem with submitted date is that it is a client clock.
> > The problem with received date (ie, submission_date_s3) is that it reflects
> > when we received the data more than when the data was actually about (ie, we
> > lose information about the time the ping was created)
> > 
> > Of course, with pingsender (and adjusting for clock skew) you'd think all of
> > these should be fuzzy-close enough it wouldn't matter. My fiddling suggests
> > that there are still differences (the nice client submission delay cliff
> > doesn't look so obvious when plotted vs. submission date, for instance)
> 
> Do you mean submission date vs creation date being fuzzy-close?
> They should be really, really close, unless:
> 
> - dates are wrong (unlikely, as we're checking with server-side dates,
> right?);
> - the ping was generated at shutdown, the pingsender failed, it was sent
> again at the next startup;

Yup

> I have one question about the SQL query (I'm probably missing something!).
> When computing the 95th percentile approximation, why are you doing
> |creation_date - submission_date + subsession_length|? I'm particularly
> interested
> about the last addend, subsession_length.

subsession_length is the Recording Delay. An example: We count tab open events. How long from the tab open event a user performs at the beginning of the session until we are told about it? session_length + Submission Delay.

I can remove it if you'd like this to measure just Submission Delay.

(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> This looks clear and informative, cheers.
> 
> Some things:
> - Lets not show aurora, we killed it already.

It isn't dead, it's just a subpopulation of Beta :)

But, yes, it is of limited value. Do you prefer I merge it with the rest of beta or omit it completely?

> - Why is it using only the latest versions per channel, can you expand?

Clarity. Showing _all_ versions is noisy. I can expand to include current version - 1.

Do you want this on the table, too, or just the time series?

> - The ESR 54 delay is significantly lower than release 54.
>   Is this due to too bias from small sample size? Should we remove it for
> now?

And higher, depending on the day. As expected, ESR shows a much higher weekday/weekend variance due to the number of users who are only using it at work. Their Monday numbers are atrocious.

Slight issue: ESR is on 52.2.1, not 54. We must have some significant population reporting ESR update path on a Firefox 54 install. I can try to heuristic my way around it... or maybe I can hardcode ESR's version to make my life easier.

Let me see how hard it is to get ESR showing the current version. If it's too hard, we can remove it. If after fixing it, the information's not interesting, we can remove it.
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #5)
> > I have one question about the SQL query (I'm probably missing something!).
> > When computing the 95th percentile approximation, why are you doing
> > |creation_date - submission_date + subsession_length|? I'm particularly
> > interested
> > about the last addend, subsession_length.
> 
> subsession_length is the Recording Delay. An example: We count tab open
> events. How long from the tab open event a user performs at the beginning of
> the session until we are told about it? session_length + Submission Delay.
> 
> I can remove it if you'd like this to measure just Submission Delay.

Nope, I think that's fine! Maybe we could just have 2 different plots: recording delay + submission delay and submission delay only.
To be honest, I don't care too much, as long that we're clearly explaining what recording delay is. Otherwise, it could be misleading.
I've fixed up the query for the table so it has the correct ESR version.

Trying to extend the timeseries plot to current_version - 1 is somewhat trickier. The queries are taking a long time to execute (at 48min and counting), so iterating is a slow process. May have to pick this one up after I get back from PTO.
Depends on: 1390936
So, now the long query takes 4h to complete. So that's a thing.

Not sure what's making it so awful of late. It used to be sub-1h and I don't remember doing anything that would have quadrupled its ire.

At any rate, now it shows version and version - 1 for non-ESR channels up to around 60 days in the past. It seems to be informational and mostly complete.

Next step: "Client Chattiness"... or... how many pings _should_ we be receiving per client per day?
Blocks: 1399153
With the general review and acceptance of the document[1], the release of the blog post[2], and the continued operation of the dashboard[3], I think this particular bug is done.

I'll split the Client Chattiness and Missing Subsession Telemetry Health metrics to a follow-up.

[1]: https://docs.google.com/document/d/1_4szwRs4RGeKhkvlIFCHFNJgp9hQ30iAZGoiArtaVDU/edit?ts=59ae93a1#
[2]: https://chuttenblog.wordpress.com/2017/09/12/two-days-or-how-long-until-the-data-is-in/
[3]: https://sql.telemetry.mozilla.org/dashboard/telemetry-health
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1400351
You need to log in before you can comment on or make changes to this bug.