Closed Bug 1257589 Opened 4 years ago Closed 4 years ago

Unexpected low number of Fennec clients seen via "core" pings on release

Categories

(Firefox for Android :: General, defect)

defect
Not set

Tracking

()

RESOLVED DUPLICATE of bug 1244295
Tracking Status
fennec 46+ ---

People

(Reporter: gfritzsche, Assigned: mcomella)

References

(Blocks 1 open bug)

Details

The analysis in bug 1255458 [0] found a surprisingly low number of clients, only about ~201,000 as of today.

Possibilites:
(1) slow uptake on Android
(2) we don't actually send the "core" ping opt-out, requiring opt-in to Telemetry
(3) bugs preventing us from sending "core" pings
(4) clientId bugs, making our get_one_ping_per_client() approach fail for counting clients


0: https://gist.github.com/georgf/c9f4824e5d3bbf112db7
Re (1): Going by the Play Store install numbers, we should see a lot more clients (if i read that chart right?).

Re (2): From Michaels first rough look, the opt-out check seems fine. More details coming?
We're most of the way through our usual adoption curve, so yes, this should be in the millions by now.
I see that the ping-to-client ratio is roughly the same as the numbers from the former analysis run on beta:
https://gist.github.com/Dexterp37/e5176ac397b5ea37ad17
We double-checked things on the pipeline side, it seems ok there:
* mreid looked at the raw data in S3 and we see about 500MB raw data for "core" pings in release yesterday (which matches the numbers of the size analysis i ran on release).
* Overall it's about 1.1GB of raw "core" pings yesterday, wich roughly matches the numbers i have for beta + the release numbers i see.
* There are apparently no "core" pings in the error stream
tracking-fennec: --- → ?
I took another look here "core" pings vs. "saved-session" pings on 45 release:
https://gist.github.com/georgf/4fa2b2cf1223f17e0aaf

* clients submitting "core" pings: 214147
* clients submitting "saved-session" pings: 207868
* clients seen in "core" pings but not in "saved-session": 16613, ~8%
* clients seen in "saved-session" pings but not in "core": 10334, ~5%
Michael, can you recap your findings for whether this could be a settings problem (e.g. opt-out vs. opt-in) here?
Are there other things to investigate client-side, maybe taking it through some manual testing?
Flags: needinfo?(michael.l.comella)
I updated both analysis today, the numbers still look similar (257961 clients seen on release).
tracking-fennec: ? → 46+
Assignee: nobody → michael.l.comella
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> Michael, can you recap your findings for whether this could be a settings
> problem (e.g. opt-out vs. opt-in) here?
> Are there other things to investigate client-side, maybe taking it through
> some manual testing?

New hypothesis: We don't upload if Java can't fetch the client ID and Java doesn't create client IDs without bug 1244295 (which is not on 45) – it waits for Gecko to create client IDs. However, in my experience, client IDs are not created by Gecko without opt-in telemetry or users activating sync (I'll investigate for other reasons why this might occur).

The numbers for core ping might mirror saved-session because only these users have enabled opt-in telemetry and thus have client IDs (plus those that have enabled Sync).

We may not have seen this issue on Beta because Gecko telemetry is opt-out (and thus pretty much everyone has client IDs).

Action items:
* see if there is a way to confirm this hypothesis (bad one: use local 45 install to see if a telemetry ID is ever created)
* If this is the error:
  - Identify what we can do so this doesn't happen again
  - Finish integration tests (bug 1249491) so we can uplift Java creates client IDs (bug 1244295) to 46

fwiw, other reasons we might not upload (following code from [1]):
* User FHR opt-out
* Background data uploads are disabled by user
* Programmer error
 - Compile-time disable
 - Malformed Intent to upload
   - Intent is null
   - Intent is missing either doc id, seq number, profile name, or profile path
   - Intent action is not ACTION_UPLOAD_CORE

[1]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryUploadService.java?rev=a805bd0f0a35#64
---

Initial investigation recap on FHR preference status (copy-pasta from email):

Investigation results that I didn't have time to get confident on: it looks like we're opt-out. That being said, we're re-using an old preference so the users may have already opted-out.

Reasoning:

Places where we write the pref and thus could set "opt-out" vs. "opt-in":
    default value via xml: https://mxr.mozilla.org/mozilla-central/search?string=healthreport.uploadenabled&find=mobile%2Fandroid&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
    with parental controls enabled: https://mxr.mozilla.org/mozilla-central/search?string=healthreport.uploadenabled&find=mobile%2Fandroid&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
    unclear what this does, but it only sets it to true: https://mxr.mozilla.org/mozilla-central/search?string=putBoolean\%28.*PREFS_HEA&regexp=1&find=mobile%2Fandroid&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central

Additionally, when we access the pref for determining if we're going to upload, we default to `true` if the preference is not stored yet:

if (!GeckoPreferences.getBooleanPref(context, GeckoPreferences.PREFS_HEALTHREPORT_UPLOAD_ENABLED, true)) {
  return false;

https://mxr.mozilla.org/mozilla-central/search?string=healthreport.uploadenabled&find=mobile%2Fandroid&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
Flags: needinfo?(michael.l.comella)
fwiw, I do print out a log statement in 45 when the client ID could not be accessed:
  https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryUploadService.java#175

Perhaps we can look at crash logs from FF45 and see if the majority of them contain the log statement.
(In reply to Michael Comella (:mcomella) from comment #9)
> Perhaps we can look at crash logs from FF45 and see if the majority of them
> contain the log statement.

Plot twist! iirc, crash logs only include logcat output if you opt-in to telemetry so we'd only be able to get the logs from users who, under my hypothesis, don't have the problem of null client IDs.
Thoughts:
 * I tested on FF45 locally by doing normal browsing patterns for 5 minutes and have left it idle for another few minutes – I could not get the client ID to generate.
 * afaik, there is no way to access the logs of the potentially affected users
 * We're left to theorize what is possible – e.g. by tracing the call stack of getClientID to see where the client ID could get initialized (and thus used by Java)

To follow where getClientID (and thus client ID creation) is called from [1]:
  TelemetryController.jsm - setupTelemetry [2]
   TelemetryController.setup
    (only seems to be called in tests – maybe it's called from C++ and I'm not following the code correctly?)
   (TelemetryController.jsm) Impl.observe (for "profile-after-change" events, "... is only registered for chrome processes." – I'm guessing this is an e10s thing mobile does not do? "profile-after-change" is called after the profile is selected [3])

Georg, do you think you can help me out? Where can getClientID (and thus Gecko client ID creation) get called from?

This hypothesis seems likely so in the mean time, I will continue finishing the integration tests (bug 1249491) so we can more confidently uplift client ID creation in Java (bug 1244295).

[1]: https://mxr.mozilla.org/mozilla-central/search?string=getClientID&case=on&find=&findi=&filter=%5E%5B%5E%5C0%5D*%24&hitlimit=&tree=mozilla-central
[2]: https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#676
[3]: https://developer.mozilla.org/en-US/docs/Observer_Notifications
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #11)
> Thoughts:
>  * I tested on FF45 locally by doing normal browsing patterns for 5 minutes
> and have left it idle for another few minutes – I could not get the client
> ID to generate.

*This* is odd.

>  * We're left to theorize what is possible – e.g. by tracing the call stack
> of getClientID to see where the client ID could get initialized (and thus
> used by Java)

This only happens in one location in the JSM, the one you already found in |TelemetryController.setupTelemetry|. The other function, |TelemetryController.setup|, is only called in XPCSHELL tests (this is probably going to change to make things clearer in bug 1145188).

I *think* "profile-after-change" is also called on mobile, otherwise we wouldn't be receiving "saved-session" pings:

- TelemetryController.setupTelemetry inits TelemetrySend [1]
- TelemetrySend.setup gets the address of the server from the prefs [2]
- We use the address in this._server to send all our pings, including the "saved-session" ping

I think we're bailing out of |setupTelemetry| for other reasons. One could be that |TelemetryController.enableTelemetryRecording| is returning false [3]. That would prevent getClientID() from being called later on.

Does this make any sense to you, Michael?

[1] - https://dxr.mozilla.org/mozilla-central/rev/b942c98f56c4c2926b8b81b98425072a091bbf7b/toolkit/components/telemetry/TelemetryController.jsm#673
[2] - https://dxr.mozilla.org/mozilla-central/rev/b942c98f56c4c2926b8b81b98425072a091bbf7b/toolkit/components/telemetry/TelemetrySend.jsm#584
[3] - https://dxr.mozilla.org/mozilla-central/rev/b942c98f56c4c2926b8b81b98425072a091bbf7b/toolkit/components/telemetry/TelemetryController.jsm#645
Oh, i wasn't aware that bug 1244295 is not on 45.

(In reply to Alessio Placitelli [:Dexter] from comment #12)
> (In reply to Michael Comella (:mcomella) from comment #11)
> > Thoughts:
> >  * I tested on FF45 locally by doing normal browsing patterns for 5 minutes
> > and have left it idle for another few minutes – I could not get the client
> > ID to generate.
> 
> *This* is odd.
[...]
> I think we're bailing out of |setupTelemetry| for other reasons. One could
> be that |TelemetryController.enableTelemetryRecording| is returning false
> [3]. That would prevent getClientID() from being called later on.

This is probably the reason and could explain the strong correlation between the "saved-session" and "core" ping population.
In short: we don't generate a client id from Gecko unless users did opt-in to Telemetry.
(We don't have the "unified" Telemetry pref enabled on mobile [0], which means we bail out at [1] and don't trigger generating a client id from the Gecko side - unless a user opted into Telemetry.)

However, i'd expect most clients to already have a client id on disk (state.json) from FHR.
Does Java read that from disk on 45?
If so then we'd either:
* have a bug with Java reading state.json that affects most clients, or
* state.json was cleaned up for some reason, or
* we are looking at a different bug affecting "core" ping generation or upload

0: https://dxr.mozilla.org/mozilla-central/rev/d5f3da0cfe7ccf846c354014c9b059fad6ba0de5/mobile/android/app/mobile.js#909
1: https://dxr.mozilla.org/mozilla-central/rev/d5f3da0cfe7ccf846c354014c9b059fad6ba0de5/toolkit/components/telemetry/TelemetryController.jsm#605
Flags: needinfo?(gfritzsche)
Thanks for the input, Alessio & Georg.

(In reply to Georg Fritzsche [:gfritzsche] from comment #13)
> However, i'd expect most clients to already have a client id on disk
> (state.json) from FHR.
> Does Java read that from disk on 45?

No – it appears we just read the datareporting/state.json:
 http://mxr.mozilla.org/mozilla-release/source/mobile/android/base/java/org/mozilla/gecko/GeckoProfile.java?rev=1088f9454fc9#619

And it's added with the rest of the client ID code in bug 1244295:
 https://reviewboard.mozilla.org/r/34875/diff/3#index_header

So bug 1244295 should fix this issue, which I'd feel confident uplifting with the integration tests landed (bug 1249491, they're currently under review, passing locally, and failing on try but I suspect emulator issues). Alternatively (or additionally), we could cause the client ID profile init to happen sooner. I'll work on uplifting for now but do you think we should also make client ID init happen sooner, Georg?
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #14)
> Thanks for the input, Alessio & Georg.
> 
> (In reply to Georg Fritzsche [:gfritzsche] from comment #13)
> > However, i'd expect most clients to already have a client id on disk
> > (state.json) from FHR.
> > Does Java read that from disk on 45?
> 
> No – it appears we just read the datareporting/state.json:
>  http://mxr.mozilla.org/mozilla-release/source/mobile/android/base/java/org/
> mozilla/gecko/GeckoProfile.java?rev=1088f9454fc9#619

We migrated to the datareporting/state.json a while ago (before FHR was removed), i think most of the existing population that regularly updates should have that file .
Bug 1244295 should mostly affect new installs/profiles, so i think there is another bigger issue here (per the points in comment 13).

> So bug 1244295 should fix this issue, which I'd feel confident uplifting
> with the integration tests landed (bug 1249491, they're currently under
> review, passing locally, and failing on try but I suspect emulator issues).
> Alternatively (or additionally), we could cause the client ID profile init
> to happen sooner. I'll work on uplifting for now but do you think we should
> also make client ID init happen sooner, Georg?

I'm not sure what this means, can you expand a bit?
I was assuming that currently Fennec activations immediately trigger a "core" ping - except without an existing state.json which would trigger the client id to be generated and then send a "core" ping.
Flags: needinfo?(gfritzsche)
> I was assuming that currently Fennec activations immediately trigger a
> "core" ping - except without an existing state.json which would trigger the
> client id to be generated and then send a "core" ping.

Perhaps this is where a possible misunderstanding lies – without bug 1244295, which is not in release, Java will not create state.json if it does not exist.

To summarize and make sure we're on the same page:

Shorthand:
* dr/state = datareporting/state.json
* hr/state = healthreport/state.json

In 45: Java reads existing dr/state but not hr/state. If dr/state does not exist, Java does not create it and does not upload a telemetry core ping.

(via comment 14) We believe dr/state is never created in most of the release population because we leave telemetry init before getClientID() is called (which creates the file). Since dr/state does not exist, Java does not submit a telemetry core ping.
---

However, there is a possibility that dr/state should exist because it was migrated from FHR:

(In reply to Georg Fritzsche [:gfritzsche] from comment #15)
> We migrated to the datareporting/state.json a while ago (before FHR was
> removed), i think most of the existing population that regularly updates
> should have that file .

Georg, where does the migration logic occur? If it uses getClientID(), the migration probably didn't happen.

> Bug 1244295 should mostly affect new installs/profiles,

bug 1244295 should mostly affect new installs/profiles if dr/state already exists. This means:
  1) getClientID was called
  2) hr/state was migrated to dr/state.

However:
  1) Unless the telemetry init logic has changed, this does not happen.
  2) Awaiting on an answer on the migration logic above.

> so i think there is
> another bigger issue here (per the points in comment 13).

I don't think I understand how the points on comment 13 show there is a bigger issue:

The first paragraph in comment 13 seems to agree with my summary above while the second paragraph assumes Java reads the FHR data (which it doesn't in our release population since bug 1244295 has not landed).

Georg, can you elaborate on why you think there is a bigger issue?

> > So bug 1244295 should fix this issue, which I'd feel confident uplifting
> > with the integration tests landed (bug 1249491, they're currently under
> > review, passing locally, and failing on try but I suspect emulator issues).

To summarize: assuming the FHR migration logic wasn't called, bug 1244295 sounds like a likely fix to me. I would feel confident uplifting it when the tests are landed (bug 1249491, under review).

> > Alternatively (or additionally), we could cause the client ID profile init
> > to happen sooner. I'll work on uplifting for now but do you think we should
> > also make client ID init happen sooner, Georg?
>
> I'm not sure what this means, can you expand a bit?

If bug 1244295 doesn't work out (or it's too tentative to rely on for 46), another possible solution is to have getClientID called for every client in the JS code. Is this something you'd like to explore, Georg?
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #16)
> > I was assuming that currently Fennec activations immediately trigger a
> > "core" ping - except without an existing state.json which would trigger the
> > client id to be generated and then send a "core" ping.
> 
> Perhaps this is where a possible misunderstanding lies – without bug
> 1244295, which is not in release, Java will not create state.json if it does
> not exist.
> 
> To summarize and make sure we're on the same page:
> 
> Shorthand:
> * dr/state = datareporting/state.json
> * hr/state = healthreport/state.json
> 
> In 45: Java reads existing dr/state but not hr/state. If dr/state does not
> exist, Java does not create it and does not upload a telemetry core ping.
> 
> (via comment 14) We believe dr/state is never created in most of the release
> population because we leave telemetry init before getClientID() is called
> (which creates the file). Since dr/state does not exist, Java does not
> submit a telemetry core ping.
> ---
> 
> However, there is a possibility that dr/state should exist because it was
> migrated from FHR:
> 
> (In reply to Georg Fritzsche [:gfritzsche] from comment #15)
> > We migrated to the datareporting/state.json a while ago (before FHR was
> > removed), i think most of the existing population that regularly updates
> > should have that file .
> 
> Georg, where does the migration logic occur? If it uses getClientID(), the
> migration probably didn't happen.

I think this is where the misunderstanding comes from - the FHR code (that we removed recently) already did that migration.
So, for 45:
(1) most of the release population should already have a datareporting/state.json file from the FHR code migrating it
(2) a few have it from opt-in Telemetry generating datareporting/state.json
(3) a (presumably pretty small) number, updating from old code that used healthreport/state.json to 45 have no datareporting/state.json
(4) new users don't have a state.json unless they opted into Telemetry

Because of (1) i assume that we have identified one issue, but not the one that affects most users.

> > > Alternatively (or additionally), we could cause the client ID profile init
> > > to happen sooner. I'll work on uplifting for now but do you think we should
> > > also make client ID init happen sooner, Georg?
> >
> > I'm not sure what this means, can you expand a bit?
> 
> If bug 1244295 doesn't work out (or it's too tentative to rely on for 46),
> another possible solution is to have getClientID called for every client in
> the JS code. Is this something you'd like to explore, Georg?

I think it would be good to focus on getting bug 1244295 to work and then bug 1255657 to resolve remaining issues.
Given the above points i don't think this is the main issue to focus on at the moment.
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #17)

> I think this is where the misunderstanding comes from - the FHR code (that
> we removed recently) already did that migration.

But remember that we did not ship any JS FHR code on Android. If that's the migration you're referring to…

> So, for 45:
> (1) most of the release population should already have a
> datareporting/state.json file from the FHR code migrating it

… I imagine that most of our Android release population *does not*, and…

> (2) a few have it from opt-in Telemetry generating datareporting/state.json

… these are the only users who do.

Can you point to the migration code?
(In reply to Richard Newman [:rnewman] from comment #18)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #17)
> 
> > I think this is where the misunderstanding comes from - the FHR code (that
> > we removed recently) already did that migration.
> 
> But remember that we did not ship any JS FHR code on Android. If that's the
> migration you're referring to…

Right, that was the part i was missing, thanks for clearing that up.
So bug 1244295 is probably the main issue here.
Depends on: 1244295
I believe we're starting to see the numbers we were expecting. mfinkle noticed a hitch in today's numbers, however, let's do that in a separate bug.

This was fixed by bug 1244295.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1244295
You need to log in before you can comment on or make changes to this bug.