Validate incoming "baseline" ping data after fixes have landed
Categories
(Toolkit :: Telemetry, enhancement, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox67 | --- | affected |
People
(Reporter: Dexter, Assigned: chutten)
References
Details
(Whiteboard: [telemetry:mobilesdk:m7])
+++ This bug was initially created as a clone of Bug #1520182 +++
Bug 1520182 provided an initial report of the data reported by glean and identified a few issues. We should run the analysis again once all blocker bugs have landed/are fixed.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
Time to give "baseline" pings another look.
Scope
The queries are limited to the pings coming from builds after April 11 (app_build > '11010000'
) (ie, contains the client_id and first_run_date fix) received between April 12 and 24.
This is about 6200 pings and 520 clients.
First things first, let's see what's changed since last time.
Ping and Client Counts
Aggregate
- Last time: https://sql.telemetry.mozilla.org/queries/61238#157776
- This time: https://sql.telemetry.mozilla.org/queries/62437#160364
We're still seeing about 100 DAU and in the range of 500-600 pings per day. Nothing much has changed since March, after the Fenix call for testers.
This is a limited population, but that's what we have to work with so let's get to it.
Per-client, Per-day
- Last time
- This time
Nothing much to say here. We're looking at far fewer clients than before, and they appear to be more dedicated: slightly elevated pings-per-client and average-pings-per-day rates. We seem to be out of the "one and done" clients who pop in for a single session and are never seen again (though there are still a few of those in the population), and we seem to have fewer outrageous ping-per-day numbers (though we have a higher proportion of outrageously-high ping-per-client clients. Best guess: testing profiles)
Sequence Numbers
Distribution
- Last time: https://sql.telemetry.mozilla.org/queries/61239/source#157777
- This time: https://sql.telemetry.mozilla.org/queries/62441/source#160373
Now that's what I'm talking about. Look at those sequence numbers distribute... there are two clear cohorts: the old guard with seq
above 480, and then wealth of fresher clients mostly below 200.
And if you flip over to #pings - #clients
we see that there are a few more dupes of {client_id, seq} pairs. The only pattern here seems to mimic the overall population distribution from the first graph (there are more dupes where there are more users sending more pings). This means it's unlikely that there's an underlying bias causing dupes to happen at different parts of the sequence (ie, we're just as likely to see these dupes from a deep in a long-sequence client's lifecycle as we are from the first seq
from a fresh client. Seems to be random.)
Holes
- Last time: https://sql.telemetry.mozilla.org/queries/61259/source#157858 (the "bad news" analysis)
- This time: https://sql.telemetry.mozilla.org/queries/62442/source#160375
We're still seeing holes, but many fewer holes. And now we're starting to see dupes. But overall, holes + dupes rates are lower than the holes rates were last time.
Still a bit high, though. 59 clients in the population of 523 had dupes or holes in their sequence record (11.3%, of which 8.6% are dupes). And these numbers should be after being reduced by ingest's deduping.
I hope it's because of population effects (we don't exactly have a lot of clients running &browser, so it's not unreasonable to expect outsized effects of rare and weird clients). To look into this hypothesis we'd probably want to take a look at Fenix (which at the very least I use like a normal browser so should generate reasonable-looking data).
Field Compositions
- Last time
- durations: https://sql.telemetry.mozilla.org/queries/61278/source#157883
- client_info stuff, and locale: https://sql.telemetry.mozilla.org/queries/61273/source
- This time
- durations: https://sql.telemetry.mozilla.org/queries/62444/source
- client_info stuff, and locale: https://sql.telemetry.mozilla.org/queries/62445/source
Now, it's not quite as easy to compare durations' distributions because the bucket layout of NUMERIC_HISTOGRAM
isn't stable. However we can tell that it's still an exponential with a heavy emphasis on short (under 14s) durations. There are fewer high values (in the thousands of seconds (20s of minutes) range). Seems reasonable to me.
Also, we're starting to see far more null durations. 140 out of the 6200 pings is 2%. Zero-length durations are up there as well, at 2.5%.
On the plus side, all the other fields (os
and os_version
, device_manufacturer
and device_model
, and architecture
) are all very-well-behaved now with no obvious faults.
Delay
- Last time: was impossible due to data weirdness
- This time: https://sql.telemetry.mozilla.org/queries/62449/source#160390
Due to reasons, there is no HTTP Date
header in org_mozilla_reference_browser_baseline_parquet
so the delay verification is limited to submission delay (delay from the ping being created until it's on our servers) without adjustments due to clock skew (so if a client's clock is really out to lunch, it'll distort the delay calculation).
And the delay is limited to per-minute resolution, as that's the resolution of end_time
.
2% of pings are received 2-3min before they're recorded. (time travellers)
85% of pings are received within a minute of their recording.
To get to 95% (5900 pings) we need to go out to 61min. Quite a lot quicker than Desktop's "Wait 2 days" rule of thumb.
In fact, under 4% of pings take over 3 hours to be received... Now, given the slight clumping around 60min it isn't unreasonable to assume that there's some artificiality here. Maybe a misconfigured timezone here, an artificially-truncated timestamp there...
In short, I'm not sure how close this is to a real distribution and given the outsized effects weird clients can have in a sample of this size, all I'm willing to venture is that the aggregate delay is a lot lower than Desktop's and doesn't appear to have a systematic issue.
More analysis needed:
- Clock skew adjustments
- Checking to see if there are commonalities within the group of long-delayed pings. Maybe they're all sent from certain clients, or at certain times of day, or at certain parts of the app lifecycle. "You only have to wait an hour to get 95% of the pings" is only useful if the 95% we receive in that hour are representative (outside of their delay) of the population of the 100%.
Conclusion
I conclude that "baseline" pings are almost ready, though I wouldn't trust analyses against the &browser population to be helpful.
Recommendations
- Figure out what's with these dupes. We're getting too many of them.
- Consider how we might be able to judge another (say, Fenix's) population for suitability as a test bed for further verification analyses.
- Larger is immediately better, but is there some way to take the population distribution out of the equation so we can evaluate the pings without wondering how much is due to "weird clients"?
- Not saying that verifying that pings act properly in the face of "weird clients" isn't valuable on its own (it is), but we should try to split that verification from the more pressing matter of "Do the pings work?"
- Add
Date
headers to themetadata
to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things. - Unless &browser's population inflates to at least 1k DAU or we find some way to explore the client population's composition of weirdness, perform no further validation analyses against it.
- Fenix's population has grown to 500 DAU already. But it has an even larger dupe problem (19.2% dupes and holes) so maybe I'm overstating the population effect...
Alessio, please take a look and let me know your questions, concerns, and corrections.
Assignee | ||
Comment 2•6 years ago
|
||
Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.
Reporter | ||
Comment 3•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #2)
Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.
Mh, interesting. I wonder if de-duping is catching stuff on the pipeline at all.
Reporter | ||
Comment 4•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #1)
Sequence Numbers
Distribution
- Last time: https://sql.telemetry.mozilla.org/queries/61239/source#157777
- This time: https://sql.telemetry.mozilla.org/queries/62441/source#160373
Now that's what I'm talking about. Look at those sequence numbers distribute... there are two clear cohorts: the old guard with
seq
above 480, and then wealth of fresher clients mostly below 200.And if you flip over to
#pings - #clients
we see that there are a few more dupes of {client_id, seq} pairs. The only pattern here seems to mimic the overall population distribution from the first graph (there are more dupes where there are more users sending more pings). This means it's unlikely that there's an underlying bias causing dupes to happen at different parts of the sequence (ie, we're just as likely to see these dupes from a deep in a long-sequence client's lifecycle as we are from the firstseq
from a fresh client. Seems to be random.)
This dupes thing is starting to concern me a bit. I think I need to check more in depth that the pipeline is working as we expect and how many dupes it is catching.
Holes
- Last time: https://sql.telemetry.mozilla.org/queries/61259/source#157858 (the "bad news" analysis)
- This time: https://sql.telemetry.mozilla.org/queries/62442/source#160375
[..]
I hope it's because of population effects (we don't exactly have a lot of clients running &browser, so it's not unreasonable to expect outsized effects of rare and weird clients). To look into this hypothesis we'd probably want to take a look at Fenix (which at the very least I use like a normal browser so should generate reasonable-looking data).
Yes, this seems like a good hypothesis that needs to be verified with a bigger population.
Field Compositions
- Last time
- durations: https://sql.telemetry.mozilla.org/queries/61278/source#157883
- client_info stuff, and locale: https://sql.telemetry.mozilla.org/queries/61273/source
- This time
- durations: https://sql.telemetry.mozilla.org/queries/62444/source
- client_info stuff, and locale: https://sql.telemetry.mozilla.org/queries/62445/source
Now, it's not quite as easy to compare durations' distributions because the bucket layout of
NUMERIC_HISTOGRAM
isn't stable. However we can tell that it's still an exponential with a heavy emphasis on short (under 14s) durations. There are fewer high values (in the thousands of seconds (20s of minutes) range). Seems reasonable to me.Also, we're starting to see far more null durations. 140 out of the 6200 pings is 2%. Zero-length durations are up there as well, at 2.5%.
With respect to null durations, we should wait until we further transition to GCP to see if that gets fixed.
Regarding the zero-length durations, which can be actionable now, I see a different figure: 80 pings over 6210... so 1.2%? This seems to be fairly stable compared to the old analysis, which reported 1.1% of pings with 0 duration.
Given the size of the effect, I'm not too concerned. It would still be interesting to see how start_time
and end_time
behave compared to duration
, especially in these weird cases of "null" or "0".
Delay
- Last time: was impossible due to data weirdness
- This time: https://sql.telemetry.mozilla.org/queries/62449/source#160390
Due to reasons, there is no HTTP
Date
header inorg_mozilla_reference_browser_baseline_parquet
so the delay verification is limited to submission delay (delay from the ping being created until it's on our servers) without adjustments due to clock skew (so if a client's clock is really out to lunch, it'll distort the delay calculation).
Gah, that's sad :( Sorry for not catching this earlier.
In short, I'm not sure how close this is to a real distribution and given the outsized effects weird clients can have in a sample of this size, all I'm willing to venture is that the aggregate delay is a lot lower than Desktop's and doesn't appear to have a systematic issue.
WHOZAA! Great news :)
More analysis needed:
- Clock skew adjustments
- Checking to see if there are commonalities within the group of long-delayed pings. Maybe they're all sent from certain clients, or at certain times of day, or at certain parts of the app lifecycle. "You only have to wait an hour to get 95% of the pings" is only useful if the 95% we receive in that hour are representative (outside of their delay) of the population of the 100%.
These are good points for follow-up analyses, maybe on a bigger population.
Conclusion
I conclude that "baseline" pings are almost ready, though I wouldn't trust analyses against the &browser population to be helpful.
I second your conclusions. Your analysis looks sound.
Recommendations
- Figure out what's with these dupes. We're getting too many of them.
I filed bug 1547234 for tracking the problem down.
- Consider how we might be able to judge another (say, Fenix's) population for suitability as a test bed for further verification analyses.
- Larger is immediately better, but is there some way to take the population distribution out of the equation so we can evaluate the pings without wondering how much is due to "weird clients"?
- Not saying that verifying that pings act properly in the face of "weird clients" isn't valuable on its own (it is), but we should try to split that verification from the more pressing matter of "Do the pings work?"
We "might" have something lined up for FFTV. Or, worst case, there's Fenix Beta lined up.
- Add
Date
headers to themetadata
to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things.
@Frank, how hard would it be to do that? Given the GCP transition, does it make sense to do it?
- Unless &browser's population inflates to at least 1k DAU or we find some way to explore the client population's composition of weirdness, perform no further validation analyses against it.
- Fenix's population has grown to 500 DAU already. But it has an even larger dupe problem (19.2% dupes and holes) so maybe I'm overstating the population effect...
I agree, unless there's any follow-up analysis that might help us with the dupes
Assignee | ||
Comment 5•6 years ago
|
||
You're right about the 0 durations (1.2%). You're right to ask about durations vs [start_time, end_time] (though IIRC the _times are per-minute so I don't know that we'll be able to get the necessary resolution) which I should make a note of to check for future analyses.
Frank filed https://github.com/mozilla-services/mozilla-pipeline-schemas/issues/323 for adding the Date header.
Looks like we're done... FOR NOW.
Comment 6•6 years ago
•
|
||
Regarding the zero-length durations, which can be actionable now, I see a different figure: 80 pings over 6210... so 1.2%? This seems to be fairly stable compared to the old analysis, which reported 1.1% of pings with 0 duration.
Keep in mind this just means < 1 second. So I don't think zero-length duration should really be considered anything special.
There was discussion about whether to add +1 to durations in this bug but ultimately it was decided not to.
Comment 7•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #1)
Holes
- Last time: https://sql.telemetry.mozilla.org/queries/61259/source#157858 (the "bad news" analysis)
- This time: https://sql.telemetry.mozilla.org/queries/62442/source#160375
We're still seeing holes, but many fewer holes. And now we're starting to see dupes. But overall, holes + dupes rates are lower than the holes rates were last time.
Still a bit high, though. 59 clients in the population of 523 had dupes or holes in their sequence record (11.3%, of which 8.6% are dupes). And these numbers should be after being reduced by ingest's deduping.
I hope it's because of population effects (we don't exactly have a lot of clients running &browser, so it's not unreasonable to expect outsized effects of rare and weird clients). To look into this hypothesis we'd probably want to take a look at Fenix (which at the very least I use like a normal browser so should generate reasonable-looking data).
I know this is closed, but I believe I can possibly explain some of the holes. Since we added the "ping tagging" capability in GleanDebugActivity
, it diverts pings to the Debug View on GCP. Then if the application is run again without a tagged ping, then the pings are sent to AWS, as normal. This would make it appear from the AWS perspective where the validation queries were run, that the diverted pings show up as holes.
Description
•