Closed Bug 1554860 Opened 5 years ago Closed 5 years ago

Investigate the low number clients sending the 'metrics' ping

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: Dexter)

References

Details

(Whiteboard: [telemetry:mobilesdk:m8])

Attachments

(1 file)

We're seeing a low number of clients sending the metrics ping, much lower than the number of clients sending events and baseline ping. We should figure out why. See bug 1552507 comment 3 for more details.

Priority: -- → P1
Whiteboard: [telemetry:mobilesdk:m8]

If it's actually an uptake curve, looking at this in terms of buildids would be a quick way to check... so I did that.

Here's a look at the client counts over 3 months reported by "metrics" pings per-build-day: https://sql.telemetry.mozilla.org/queries/62922/source#161431

For comparison, here's the same thing with the same scale from the "events" ping and from the "baseline" ping.

Note that even with modern builds the client count according to "metrics" pings aren't quite at the levels that the "baseline" and "events" pings agree on.

[Edited to add] This slight deflation seen in modern (May 26+) builds could be due to the "metrics" pings' relative infrequency of sending, so maybe we're seeing proper numbers now... but still.

Also of note is there's no obvious "Ah-ha, builds from -this- date are when we start seeing the correct numbers", suggesting that maybe this doesn't have as much to do with builds as I originally surmised.

(In reply to Chris H-C :chutten from comment #1)

If it's actually an uptake curve, looking at this in terms of buildids would be a quick way to check... so I did that.

Here's a look at the client counts over 3 months reported by "metrics" pings per-build-day: https://sql.telemetry.mozilla.org/queries/62922/source#161431

For comparison, here's the same thing with the same scale from the "events" ping and from the "baseline" ping.

Note that even with modern builds the client count according to "metrics" pings aren't quite at the levels that the "baseline" and "events" pings agree on.

Thanks for following up on this.

[Edited to add] This slight deflation seen in modern (May 26+) builds could be due to the "metrics" pings' relative infrequency of sending, so maybe we're seeing proper numbers now... but still.

Also of note is there's no obvious "Ah-ha, builds from -this- date are when we start seeing the correct numbers", suggesting that maybe this doesn't have as much to do with builds as I originally surmised.

Yes. These pings have radically different schedules: baseline is sent every time we go to background, events too but only if some data is available. On the other hand, metrics is expected to be sent at 4AM local or next time app is opened, if overdue and data was available.

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

  • If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.
  • If Fenix is closed and uninstalled before getting to 4AM, we don't receive a metrics ping.
  • We had a bug two months ago which made it impossible to send metrics pings when the scheduler hit 4AM. It's now fixed and unlikely to still wreck havock, right?

I'm not sure the above justify the data we see, though. Thoughts?

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

  • If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

  • If Fenix is closed and uninstalled before getting to 4AM, we don't receive a metrics ping.

This is an interesting one. If we look at "events" pings we find that over half of clients only send a single day's data. For "metrics" pings it's closer to 11%. This is consistent with a theory that the clients missing from the "metrics" sample might be "one and done" clients who only show up once and never again.

  • We had a bug two months ago which made it impossible to send metrics pings when the scheduler hit 4AM. It's now fixed and unlikely to still wreck havock, right?

Unlikely indeed. The per-buildid analysis shows that even modern builds show this deflation, and looking at the buildid split in the May 8-15 "events" ping sample we see that basically no one's running a build older than 1 month


I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

(In reply to Chris H-C :chutten from comment #3)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

  • If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

By "restarting" I think the best way to think of this is that the application process is stopped and then launched again. Android is weird in that just because you tell it to kill a process doesn't mean it really has to... but in most cases that I have seen it happens when you open the task switcher and swipe the app away or restart the device). Technically, we shouldn't have to instrument an "app restarted" metric, as the app_opened event is functionally equivalent and should fire any time the app was restarted. There is an Android onRestart event but that means that the app does not go through the onCreated state where the metrics Alessio mentioned get set, so it would imply that the process must have been killed and the application started again for those metrics to be set.

I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

Just thinking through this, but I think if the user used Fenix and then didn't use Fenix for a day or more, there could be perfectly legit 'holes' (holes as in missing day(s) of metrics pings) in the metrics ping and we may not see one for every 24 hour period for every user, so just because we have an events ping but no metrics ping for that day doesn't mean a user was one and done, they could also be a "I used Fenix today, but did not use it again for a few days". So maybe be careful of saying that a missing metrics ping means uninstalled and rather say that it hasn't been used again.

(In reply to Travis Long from comment #4)

(In reply to Chris H-C :chutten from comment #3)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Theory

Given that the first non-event metric was added 2 months ago (metrics.default_browser), all builds should have something. However, by looking at the code, I can see the following:

  • If some Fenix user disabled and then enabled again telemetry, without restarting the app, we'd not set the fields again. They would not make it into the ping, given that they have ping lifetime and not application lifetime. Not until the application is restarted.

I don't know Android application lifecycles that well. What does "application is restarted" mean in this case? Is that a common thing? I've looked at some "events" pings by now and notice we have some app lifecycle events... do we instrument "application is restarted"? If so, we can get a handle on its frequency and cross-correlate clients if we have to.

By "restarting" I think the best way to think of this is that the application process is stopped and then launched again. Android is weird in that just because you tell it to kill a process doesn't mean it really has to... but in most cases that I have seen it happens when you open the task switcher and swipe the app away or restart the device). Technically, we shouldn't have to instrument an "app restarted" metric, as the app_opened event is functionally equivalent and should fire any time the app was restarted. There is an Android onRestart event but that means that the app does not go through the onCreated state where the metrics Alessio mentioned get set, so it would imply that the process must have been killed and the application started again for those metrics to be set.

Correct.

I think your Theory Part 2 might be a reasonable explanation for what we're seeing. To confirm it maybe we could compile a list of client_ids present in the "events" ping sample from 2+ days and see if any of them are absent from the "metrics" sample?

Just thinking through this, but I think if the user used Fenix and then didn't use Fenix for a day or more, there could be perfectly legit 'holes' (holes as in missing day(s) of metrics pings) in the metrics ping and we may not see one for every 24 hour period for every user, so just because we have an events ping but no metrics ping for that day doesn't mean a user was one and done, they could also be a "I used Fenix today, but did not use it again for a few days". So maybe be careful of saying that a missing metrics ping means uninstalled and rather say that it hasn't been used again.

We're talking about clients (multiple or one ping referring to the same client_id) not about ping counts. So even if there was a hole of multiple days after one metrics ping was received, you'd still count 1 client there. Same if we received 10 metrics pings for 10 days from that client. Always 1 :)

Alessio, the PI request for this issue is being handled by Andrei Joltan and Andrei Bodea. Could you tell us which scenarios can we try out to narrow down the cause of it?

Flags: needinfo?(andrei.bodea)
Flags: needinfo?(alessio.placitelli)

Adding Andrei Joltan as together we will work on this issue.

Flags: needinfo?(andrei.joltan)

(In reply to No-Jun Park [:njpark] from comment #6)

Alessio, the PI request for this issue is being handled by Andrei Joltan and Andrei Bodea. Could you tell us which scenarios can we try out to narrow down the cause of it?

Sure! I had a conversation with Andrei Bodea to go through this a bit ago. In general, I'd start by making sure that the documented schedule for the metrics ping is respected:

  • we want to send the ping every day at local 4AM;
  • after the application is started, at whatever time, if no metrics ping was sent on the previous day.

We want to make sure both cases are covered. We want to uncover potential edge cases that compromise the above scheduling.

Flags: needinfo?(alessio.placitelli)

Chris, since you looked at the data, is there any other thing you'd suggest to see tested in addition to the scheduling from comment 8?

Flags: needinfo?(chutten)

It might be nice to also confirm that no "metrics" ping is sent until either 4AM local or the second time the application is opened.

On top of that I don't think there's much else unless we want them to take my laundry list of verification items :)

Flags: needinfo?(chutten)
Blocks: 1552884
Assignee: nobody → alessio.placitelli

Thanks to the work of Andrei Bodea and Andrei Joltan, we were able to identify at least one issue that might contribute to the low number of 'metrics' pings by Fenix.

On a fresh install, where no "last date the metrics ping was sent" is available, we're setting that date to the current day, triggering the case of the ping being already sent today, thus postponing sending the ping of at least 24 hours.

If the user never opens Fenix again, no ping will be sent.

The correct behaviour should be:

  • if due time has not passed, schedule the ping for the due time;
  • if the due time has passed, collect and send immediately.
Flags: needinfo?(andrei.joltan)
Flags: needinfo?(andrei.bodea)
Attached file GitHub Pull Request

The behaviour within the referenced PR was validated by QA. They are performing a last batch of tests before we can close this off.

Depends on: 1556684
Blocks: 1552507

Hi all, just performed the last batch of tests and can conclude the following:

• For fresh installs:

  • if metrics are available, they will will be sent accordingly logcat
  • if metrics aren't available, the 'metrics' ping will be scheduled for immediate collection and will not be sent because "No content for ping 'PingType(name=metrics, includeClientId=true).name', therefore no ping queued." logcat

•Next day, but due time has not passed:

  • The Metrics will be scheduled for the next day
    "The 'metrics' collection scheduled for the next calendar day"
    "MetricsPingWorker doWork()"

•Next day, due time has passed:

  • The Metrics Ping is successfully sent
    Google Pixel 3XL logcat
    Samsung Galaxy Note9 logcat

(In reply to andrei.joltan from comment #14)

Hi all, just performed the last batch of tests and can conclude the following:

Thank you very much for the help, the tempestivity and the great work validating this. It helped us uncover a couple of issues in the SDK. I'm calling this bug as closed, as all found issues were validated & fixed. If the 'metrics' ping analysis finds some other problem, we'll file new bugs!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
No longer blocks: 1552507
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: