1554860 - Investigate the low number clients sending the 'metrics' ping

Assignee

Description

•

6 years ago

We're seeing a low number of clients sending the metrics ping, much lower than the number of clients sending events and baseline ping. We should figure out why. See bug 1552507 comment 3 for more details.

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

Priority: -- → P1

Whiteboard: [telemetry:mobilesdk:m8]

Chris H-C :chutten

Comment 1

•

6 years ago

•

Edited

If it's actually an uptake curve, looking at this in terms of buildids would be a quick way to check... so I did that.

Here's a look at the client counts over 3 months reported by "metrics" pings per-build-day: https://sql.telemetry.mozilla.org/queries/62922/source#161431

For comparison, here's the same thing with the same scale from the "events" ping and from the "baseline" ping.

Note that even with modern builds the client count according to "metrics" pings aren't quite at the levels that the "baseline" and "events" pings agree on.

[Edited to add] This slight deflation seen in modern (May 26+) builds could be due to the "metrics" pings' relative infrequency of sending, so maybe we're seeing proper numbers now... but still.

Also of note is there's no obvious "Ah-ha, builds from -this- date are when we start seeing the correct numbers", suggesting that maybe this doesn't have as much to do with builds as I originally surmised.

Alessio Placitelli [:Dexter]

Assignee

Comment 2

•

6 years ago

(In reply to Chris H-C :chutten from comment #1)

If it's actually an uptake curve, looking at this in terms of buildids would be a quick way to check... so I did that.

Here's a look at the client counts over 3 months reported by "metrics" pings per-build-day: https://sql.telemetry.mozilla.org/queries/62922/source#161431

For comparison, here's the same thing with the same scale from the "events" ping and from the "baseline" ping.

Note that even with modern builds the client count according to "metrics" pings aren't quite at the levels that the "baseline" and "events" pings agree on.

Thanks for following up on this.

[Edited to add] This slight deflation seen in modern (May 26+) builds could be due to the "metrics" pings' relative infrequency of sending, so maybe we're seeing proper numbers now... but still.

Also of note is there's no obvious "Ah-ha, builds from -this- date are when we start seeing the correct numbers", suggesting that maybe this doesn't have as much to do with builds as I originally surmised.

Yes. These pings have radically different schedules: baseline is sent every time we go to background, events too but only if some data is available. On the other hand, metrics is expected to be sent at 4AM local or next time app is opened, if overdue and data was available.

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.
If Fenix is closed and uninstalled before getting to 4AM, we don't receive a metrics ping.
We had a bug two months ago which made it impossible to send metrics pings when the scheduler hit 4AM. It's now fixed and unlikely to still wreck havock, right?

I'm not sure the above justify the data we see, though. Thoughts?

Chris H-C :chutten

Comment 3

•

6 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

If Fenix is closed and uninstalled before getting to 4AM, we don't receive a metrics ping.

This is an interesting one. If we look at "events" pings we find that over half of clients only send a single day's data. For "metrics" pings it's closer to 11%. This is consistent with a theory that the clients missing from the "metrics" sample might be "one and done" clients who only show up once and never again.

We had a bug two months ago which made it impossible to send metrics pings when the scheduler hit 4AM. It's now fixed and unlikely to still wreck havock, right?

Unlikely indeed. The per-buildid analysis shows that even modern builds show this deflation, and looking at the buildid split in the May 8-15 "events" ping sample we see that basically no one's running a build older than 1 month

I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

Travis Long [:travis_]

Comment 4

•

6 years ago

(In reply to Chris H-C :chutten from comment #3)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

By "restarting" I think the best way to think of this is that the application process is stopped and then launched again. Android is weird in that just because you tell it to kill a process doesn't mean it really has to... but in most cases that I have seen it happens when you open the task switcher and swipe the app away or restart the device). Technically, we shouldn't have to instrument an "app restarted" metric, as the app_opened event is functionally equivalent and should fire any time the app was restarted. There is an Android onRestart event but that means that the app does not go through the onCreated state where the metrics Alessio mentioned get set, so it would imply that the process must have been killed and the application started again for those metrics to be set.

I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

Just thinking through this, but I think if the user used Fenix and then didn't use Fenix for a day or more, there could be perfectly legit 'holes' (holes as in missing day(s) of metrics pings) in the metrics ping and we may not see one for every 24 hour period for every user, so just because we have an events ping but no metrics ping for that day doesn't mean a user was one and done, they could also be a "I used Fenix today, but did not use it again for a few days". So maybe be careful of saying that a missing metrics ping means uninstalled and rather say that it hasn't been used again.

Alessio Placitelli [:Dexter]

Assignee

Comment 5

•

6 years ago

(In reply to Travis Long from comment #4)

(In reply to Chris H-C :chutten from comment #3)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

By "restarting" I think the best way to think of this is that the application process is stopped and then launched again. Android is weird in that just because you tell it to kill a process doesn't mean it really has to... but in most cases that I have seen it happens when you open the task switcher and swipe the app away or restart the device). Technically, we shouldn't have to instrument an "app restarted" metric, as the app_opened event is functionally equivalent and should fire any time the app was restarted. There is an Android onRestart event but that means that the app does not go through the onCreated state where the metrics Alessio mentioned get set, so it would imply that the process must have been killed and the application started again for those metrics to be set.

Correct.

I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

Just thinking through this, but I think if the user used Fenix and then didn't use Fenix for a day or more, there could be perfectly legit 'holes' (holes as in missing day(s) of metrics pings) in the metrics ping and we may not see one for every 24 hour period for every user, so just because we have an events ping but no metrics ping for that day doesn't mean a user was one and done, they could also be a "I used Fenix today, but did not use it again for a few days". So maybe be careful of saying that a missing metrics ping means uninstalled and rather say that it hasn't been used again.

We're talking about clients (multiple or one ping referring to the same client_id) not about ping counts. So even if there was a hole of multiple days after one metrics ping was received, you'd still count 1 client there. Same if we received 10 metrics pings for 10 days from that client. Always 1 :)

No-Jun Park [:njpark]

Comment 6

•

6 years ago

Alessio, the PI request for this issue is being handled by Andrei Joltan and Andrei Bodea. Could you tell us which scenarios can we try out to narrow down the cause of it?

Flags: needinfo?(andrei.bodea)

Flags: needinfo?(alessio.placitelli)

Andrei Bodea[:andrei]

Comment 7

•

6 years ago

Adding Andrei Joltan as together we will work on this issue.

Flags: needinfo?(andrei.joltan)

Alessio Placitelli [:Dexter]

Assignee

Comment 8

•

6 years ago

(In reply to No-Jun Park [:njpark] from comment #6)

Alessio, the PI request for this issue is being handled by Andrei Joltan and Andrei Bodea. Could you tell us which scenarios can we try out to narrow down the cause of it?

Sure! I had a conversation with Andrei Bodea to go through this a bit ago. In general, I'd start by making sure that the documented schedule for the metrics ping is respected:

we want to send the ping every day at local 4AM;
after the application is started, at whatever time, if no metrics ping was sent on the previous day.

We want to make sure both cases are covered. We want to uncover potential edge cases that compromise the above scheduling.

Flags: needinfo?(alessio.placitelli)

Alessio Placitelli [:Dexter]

Assignee

Comment 9

•

6 years ago

Chris, since you looked at the data, is there any other thing you'd suggest to see tested in addition to the scheduling from comment 8?

Flags: needinfo?(chutten)

Chris H-C :chutten

Comment 10

•

6 years ago

It might be nice to also confirm that no "metrics" ping is sent until either 4AM local or the second time the application is opened.

On top of that I don't think there's much else unless we want them to take my laundry list of verification items :)

Flags: needinfo?(chutten)

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

Blocks: 1552884

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

Assignee: nobody → alessio.placitelli

Alessio Placitelli [:Dexter]

Assignee

Comment 11

•

6 years ago

Thanks to the work of Andrei Bodea and Andrei Joltan, we were able to identify at least one issue that might contribute to the low number of 'metrics' pings by Fenix.

On a fresh install, where no "last date the metrics ping was sent" is available, we're setting that date to the current day, triggering the case of the ping being already sent today, thus postponing sending the ping of at least 24 hours.

If the user never opens Fenix again, no ping will be sent.

The correct behaviour should be:

if due time has not passed, schedule the ping for the due time;
if the due time has passed, collect and send immediately.

Flags: needinfo?(andrei.joltan)

Flags: needinfo?(andrei.bodea)

Alessio Placitelli [:Dexter]

Assignee

Comment 12

•

6 years ago

Attached file GitHub Pull Request — Details

Alessio Placitelli [:Dexter]

Assignee

Comment 13

•

6 years ago

The behaviour within the referenced PR was validated by QA. They are performing a last batch of tests before we can close this off.

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

Depends on: 1556684

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

Blocks: 1552507

Andrei Jolțan [:Andi]

Comment 14

•

6 years ago

Hi all, just performed the last batch of tests and can conclude the following:

• For fresh installs:

if metrics are available, they will will be sent accordingly logcat
if metrics aren't available, the 'metrics' ping will be scheduled for immediate collection and will not be sent because "No content for ping 'PingType(name=metrics, includeClientId=true).name', therefore no ping queued." logcat

•Next day, but due time has not passed:

The Metrics will be scheduled for the next day
"The 'metrics' collection scheduled for the next calendar day"
"MetricsPingWorker doWork()"

•Next day, due time has passed:

The Metrics Ping is successfully sent
Google Pixel 3XL logcat
Samsung Galaxy Note9 logcat

Alessio Placitelli [:Dexter]

Assignee

Comment 15

•

6 years ago

(In reply to andrei.joltan from comment #14)

Hi all, just performed the last batch of tests and can conclude the following:

Thank you very much for the help, the tempestivity and the great work validating this. It helped us uncover a couple of issues in the SDK. I'm calling this bug as closed, as all found issues were validated & fixed. If the 'metrics' ping analysis finds some other problem, we'll file new bugs!

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Alessio Placitelli [:Dexter]

Assignee

Updated

•

6 years ago

No longer blocks: 1552507