Validate new pre-account ping
Categories
(Toolkit :: Telemetry, task, P1)
Tracking
()
People
(Reporter: janerik, Assigned: janerik)
References
Details
Once the pre-account ping landed we need to validate it.
- Is the interval as expected?
- Does it contain the right metrics?
Assignee | ||
Comment 1•6 years ago
•
|
||
[removed, that comment wasn't ready]
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 2•6 years ago
|
||
Known issues: validation errors because of missing speedMhz. Will fix that in a PR to the schemas repository.
Summary:
We have ~64k pings currently in (Is this low? high? I have no idea).
The majority of pings are with reason "shutdown" or "periodic". IMO that is as expected. We have a few logins and even fewer logouts.
80% of the pings contain the one scalar we collect.
Of the ones without the scalar a majority are from short sessions (<10 minutes), so ... reasonable that they might just not have opened an URI?
For the durations:
We see a number of pings with negative durations (1.3% of all pings). This is worrisome and I filed bug 1545365.
We see a number of pings with durations over 48 hours, but only "shutdown" and "periodic" pings.
It's less than 1% and even less for builds after we switched from idle-daily
to our own scheduler (though this might also be due to volume).
It's also all* from Windows, where sleep times are included, so it's a little bit expected (maybe we should eventually do something about that...)
(* except one single ping from a Linux machine)
:chutten, could you give this a look and double-check that my analysis makes sense? Did I miss anything or misinterpret the data? Do we need to fix anything besides the 2 bugs mentioned above?
Comment 3•6 years ago
|
||
Context-free validation notes:
- Please pin the submission dates that you're using. Either use the same ones everywhere in
WHERE
clauses or parameterize them or restrict your view. - Drop Cmd 12 in favour of Cmd 13
- I'd like to see if the scheduler refactor introduced a change in the distribution of negative durations
- No client_id checks? We could see how many pings per client per day we're receiving (maybe all the nonsense is from a single client?)
(In reply to Jan-Erik Rediger [:janerik] from comment #2)
Known issues: validation errors because of missing speedMhz. Will fix that in a PR to the schemas repository.
What impact does this have on your analysis?
Summary:
We have ~64k pings currently in (Is this low? high? I have no idea).
Taking a look at the number of "main" pings over the same builds and dates (buildid > 20190326, 20190410 < submission_date_s3 < 20190418) we see about 1M of them. Now, we don't send "main" pings at the same time as "pre-account" pings... but at the very least the number of "shutdown"-reason "main" pings (676K) should be within an order of magnitude of the number of "shutdown"-reason "pre-account" pings (51K) (minus the proportion of users who are logged in to FxA (just under 12%)...
I'm thinking this is rather (an order of magnitude) low. Enough lower that it might impact the entire analysis.
The majority of pings are with reason "shutdown" or "periodic". IMO that is as expected. We have a few logins and even fewer logouts.
We could (and probably should) cross-compare with "sync" pings to see what the rates of logins/logouts were over the population and period under study.
80% of the pings contain the one scalar we collect.
Of the ones without the scalar a majority are from short sessions (<10 minutes), so ... reasonable that they might just not have opened an URI?
Perhaps. It's still a little lower than I'd expect. We can cross-check between counts of total_uri_count
-missing "pre-account"/"shutdown" and "main"/"shutdown" to ensure we're within an order of magnitude of correct.
For the durations:
We see a number of pings with negative durations (1.3% of all pings). This is worrisome and I filed bug 1545365.
Good choice. It's higher than we'd like. Though, given the overall low number of pings we might find upon reexamination that the proportion is much smaller and that this subpopulation is unusually-likely to contain weird pings.
We see a number of pings with durations over 48 hours, but only "shutdown" and "periodic" pings.
It's less than 1% and even less for builds after we switched fromidle-daily
to our own scheduler (though this might also be due to volume).
It's also all* from Windows, where sleep times are included, so it's a little bit expected (maybe we should eventually do something about that...)(* except one single ping from a Linux machine)
Is this far outside the expected proportion of Linux users compared to Windows? 1% seems about right (though maybe it's higher on Nightly)
:chutten, could you give this a look and double-check that my analysis makes sense? Did I miss anything or misinterpret the data? Do we need to fix anything besides the 2 bugs mentioned above?
One big problem, a couple of small omissions, and some possibilities for tightening up a future edition.
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #3)
- Please pin the submission dates that you're using. Either use the same ones everywhere in
WHERE
clauses or parameterize them or restrict your view.
Done. Now having one view, limited to the right dates.
- Drop Cmd 12 in favour of Cmd 13
Uhm, damn. Now I mixed and edited some. I think it was the "Percentage of pings with negative duration across all pings" I should drop and keep the one separated by reason?
- I'd like to see if the scheduler refactor introduced a change in the distribution of negative durations
I think we don't have enough data for that (17k with negative durations before the change, 600k with negative durations after the change).
- No client_id checks? We could see how many pings per client per day we're receiving (maybe all the nonsense is from a single client?)
We ... don't have a client ID, that's the whole point.
Known issues: validation errors because of missing speedMhz. Will fix that in a PR to the schemas repository.
What impact does this have on your analysis?
Summary:
We have ~64k pings currently in (Is this low? high? I have no idea).Taking a look at the number of "main" pings over the same builds and dates (buildid > 20190326, 20190410 < submission_date_s3 < 20190418) we see about 1M of them. Now, we don't send "main" pings at the same time as "pre-account" pings... but at the very least the number of "shutdown"-reason "main" pings (676K) should be within an order of magnitude of the number of "shutdown"-reason "pre-account" pings (51K) (minus the proportion of users who are logged in to FxA (just under 12%)...
I'm thinking this is rather (an order of magnitude) low. Enough lower that it might impact the entire analysis.
turns out: merged doesn't mean deployed. I'll file a bug with the right team and see to get that more clearly documented (and maybe have a way to know what is deployed and running?)
We now have for the pre-account ping in the period 20190417 to 20190423:
- 628k pings total
- 523k reason=shutdown
- 103k reason=periodic
For the main ping in the same period:
- 546k reason=shutdown
- 106k reason=daily
- 228k reason=environment-change
The majority of pings are with reason "shutdown" or "periodic". IMO that is as expected. We have a few logins and even fewer logouts.
We could (and probably should) cross-compare with "sync" pings to see what the rates of logins/logouts were over the population and period under study.
- We saw 1.2k pre-accounts with reason "login".
- For the same timeframe there where 1.8k sync events with
why: login
That's at least in the same order of magnitude
80% of the pings contain the one scalar we collect.
Of the ones without the scalar a majority are from short sessions (<10 minutes), so ... reasonable that they might just not have opened an URI?Perhaps. It's still a little lower than I'd expect. We can cross-check between counts of
total_uri_count
-missing "pre-account"/"shutdown" and "main"/"shutdown" to ensure we're within an order of magnitude of correct.
- For
pre-account/shutdown
we have a ratio of 78%/22% for having/missing thetotal_uri_count
. - For
main/shutdown
we have a ratio of 76%/24% for having/missing thetotal_uri_count
.
We see a number of pings with durations over 48 hours, but only "shutdown" and "periodic" pings.
It's less than 1% and even less for builds after we switched fromidle-daily
to our own scheduler (though this might also be due to volume).
It's also all* from Windows, where sleep times are included, so it's a little bit expected (maybe we should eventually do something about that...)(* except one single ping from a Linux machine)
Is this far outside the expected proportion of Linux users compared to Windows? 1% seems about right (though maybe it's higher on Nightly)
Ah right, proportional this is correct.
Apart from that looking at the scheduler change is not meaningful, we just don't have enough data from before that (as mentioned above)
Comment 5•6 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #4)
- I'd like to see if the scheduler refactor introduced a change in the distribution of negative durations
I think we don't have enough data for that (17k with negative durations before the change, 600k with negative durations after the change).
Ah well, preliminary counts look like it's a lower proportion, but it doesn't go completely away.
- No client_id checks? We could see how many pings per client per day we're receiving (maybe all the nonsense is from a single client?)
We ... don't have a client ID, that's the whole point.
facepalm doi. I knew that, why didn't I think about that. I was too busy hoping we could explain it all away with a single, noisy client.
Known issues: validation errors because of missing speedMhz. Will fix that in a PR to the schemas repository.
What impact does this have on your analysis?
Summary:
We have ~64k pings currently in (Is this low? high? I have no idea).Taking a look at the number of "main" pings over the same builds and dates (buildid > 20190326, 20190410 < submission_date_s3 < 20190418) we see about 1M of them. Now, we don't send "main" pings at the same time as "pre-account" pings... but at the very least the number of "shutdown"-reason "main" pings (676K) should be within an order of magnitude of the number of "shutdown"-reason "pre-account" pings (51K) (minus the proportion of users who are logged in to FxA (just under 12%)...
I'm thinking this is rather (an order of magnitude) low. Enough lower that it might impact the entire analysis.
turns out: merged doesn't mean deployed. I'll file a bug with the right team and see to get that more clearly documented (and maybe have a way to know what is deployed and running?)
We now have for the pre-account ping in the period 20190417 to 20190423:
- 628k pings total
- 523k reason=shutdown
- 103k reason=periodic
For the main ping in the same period:
- 546k reason=shutdown
- 106k reason=daily
- 228k reason=environment-change
Looking good.
The majority of pings are with reason "shutdown" or "periodic". IMO that is as expected. We have a few logins and even fewer logouts.
We could (and probably should) cross-compare with "sync" pings to see what the rates of logins/logouts were over the population and period under study.
- We saw 1.2k pre-accounts with reason "login".
- For the same timeframe there where 1.8k sync events with
why: login
That's at least in the same order of magnitude
So few it's hard to say for certain, but it's not inconsistent with the hypothesis that both are tracking the same events.
80% of the pings contain the one scalar we collect.
Of the ones without the scalar a majority are from short sessions (<10 minutes), so ... reasonable that they might just not have opened an URI?Perhaps. It's still a little lower than I'd expect. We can cross-check between counts of
total_uri_count
-missing "pre-account"/"shutdown" and "main"/"shutdown" to ensure we're within an order of magnitude of correct.
- For
pre-account/shutdown
we have a ratio of 78%/22% for having/missing thetotal_uri_count
.- For
main/shutdown
we have a ratio of 76%/24% for having/missing thetotal_uri_count
.
This is not at all what I expected. Shows that pre-account
is doing as well as main
, but... wow. Hm. Didn't realize so few main
pings contained that scalar.
We see a number of pings with durations over 48 hours, but only "shutdown" and "periodic" pings.
It's less than 1% and even less for builds after we switched fromidle-daily
to our own scheduler (though this might also be due to volume).
It's also all* from Windows, where sleep times are included, so it's a little bit expected (maybe we should eventually do something about that...)(* except one single ping from a Linux machine)
Is this far outside the expected proportion of Linux users compared to Windows? 1% seems about right (though maybe it's higher on Nightly)
Ah right, proportional this is correct.
Not so much any more. Only 5 too-long pings from Linux and ~2k from Windows is disproportionate. Might be a "bug 1535632 and friends"-type problem.
Apart from that looking at the scheduler change is not meaningful, we just don't have enough data from before that (as mentioned above)
Fair.
So it seems we now have enough data to say that negative durations are definitely a problem we should look into.
In addition, we have some outrageous claims of numbers of uris and lengths of sessions which should at least be documented (in the in-tree docs?).
Aside from those, looks like it's behaving well: the validation checks out.
Now.... if you wanted to confirm that multi-store is working well, we're still missing the piece where we make sure the distribution of total_uri_count
didn't change as reported by the main
ping over the period where pre-account
started being sent. For that we can just check TMO, which seems to report findings consistent with that hypothesis: https://mzl.la/2ZrCsYO
Assignee | ||
Updated•6 years ago
|
Description
•