Closed Bug 1599999 Opened 4 years ago Closed 4 years ago

Investigate why ~2.5% of daily clients have not migrated from glean-ac to glean-core yet

Categories

(Data Platform and Tools :: Glean: SDK, task, P3)

task

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: brizental, Assigned: brizental)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

As can be verified from this query not all clients that should have recorded a successful migration from glean-ac to glean-core (clients that were in previous versions and updated to version 19 or later) have.

Depends on: 1585587
Summary: Investigate why not all clients have migrated from glean-ac to glean-core yet → Investigate why ~2.5% of daily clients have not migrated from glean-ac to glean-core yet
Assignee: nobody → beatriz.rizental
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m11]
Assignee: beatriz.rizental → brizental

Ok, I started working on this and my initial discoveries can be seen on this new query: https://sql.telemetry.mozilla.org/queries/66758/source#169217

The first thing I did was add the baseline pings to the sample and see if that changed the results, it did nothing but add (a lot) more pings to the overall results, which made it harder to analyze, so I removed it. If you wish I can put it back so you can see.

Next I decided to count how many metrics pings a client sends before actually making a successful migration. I would expect the client to send one ping without the migration done and then multiple pings with it done, but turns out that is not the case. Clients will send multiple metrics pings before making the migration. One question: as I understand it the metrics ping is sent once a day, why do we see so many ping for each client if my sample is from the 24th of October on? Shouldn't we have maximum ~40 pings per client?

You can uncomment the bottom part of the query to see how many pings the clients that haven't yet migrated sent, but it is not so concerning if you look at the number of pings the clients that have succesfully migrated have sent before migrating...

So, what I am thinking is that there is the 4am trigger to send the metrics ping, right? When the app in woken to send the metrics ping at 4am, will migration occur? Or will it happen only when the user actually opens the app? Because then it might be the case that we are seeing users that have not opened the app yet and show up as "not migrated".

If that is the case, here is another question: shouldn't the 4am trigger happen only when there is data to send? If the user hasn't opened the app for multiple days, it should trigger only on the day or day after the user has opened the app (because there will be data), and then once again the next day (but not send anything because there won't be any data) and then never again until the app is opened, right?

One way to verify I guess would be to check if all this non-migrated pings were sent without any data in them, to validate my hypothesis of these being triggered at 4am without the user having opnened the app. I do know that sending empty pings is a new feature just added by [:janerik], so I am not sure about this approach.

Okay, I'll leave this questions here for now :)

Flags: needinfo?(alessio.placitelli)

Next I decided to count how many metrics pings a client sends before actually making a successful migration. I would expect the client to send one ping without the migration done and then multiple pings with it done, but turns out that is not the case. Clients will send multiple metrics pings before making the migration. One question: as I understand it the metrics ping is sent once a day, why do we see so many ping for each client if my sample is from the 24th of October on? Shouldn't we have maximum ~40 pings per client?
You can uncomment the bottom part of the query to see how many pings the clients that haven't yet migrated sent, but it is not so concerning if you look at the number of pings the clients that have succesfully migrated have sent before migrating...

Okay strach that, there was a bug in my query, which is now fixed and the numbers are kind of like I predicted. The successfully migrated have between 0 or 2 pings before migration, with three outliers with the largest one having 6 pings before migration.

The non successfully migrated are much different, having between 1 and 73 pings each with an average of ~12 pings each. Currently, there 135 clients in this situation.

So, what I am thinking is that there is the 4am trigger to send the metrics ping, right? When the app in woken to send the metrics ping at 4am, will migration occur? Or will it happen only when the user actually opens the app? Because then it might be the case that we are seeing users that have not opened the app yet and show up as "not migrated".
If that is the case, here is another question: shouldn't the 4am trigger happen only when there is data to send? If the user hasn't opened the app for multiple days, it should trigger only on the day or day after the user has opened the app (because there will be data), and then once again the next day (but not send anything because there won't be any data) and then never again until the app is opened, right?
One way to verify I guess would be to check if all this non-migrated pings were sent without any data in them, to validate my hypothesis of these being triggered at 4am without the user having opnened the app. I do know that sending empty pings is a new feature just added by [:janerik], so I am not sure about this approach.

This, can also be scratched now that the numbers (at least for the successfully migrated clients) look kind of like I expected them too (there still many clients with more than ~40 pings each).

My strategy though is still to look more closely at the data from the non migrated clients. One thing I will check first is if they are really not migrated or if they actually just haven't recorded the successfull migration. I'll report back here, anyways.

Talking with [:Dexter] on slack, I came to the conclusion that the fact that the user sent multiple pings after not recording a successful migration, probably means that this user has indeed migrated succesfully, but just hasn't recorded that. We are certain about this, because if a user doesn't migrate, glean will generate a new client id for them, thus making it impossible to relate the client id from previous verisons with the new one.

One thing I realize, is that ideally users shoud be sending 0 pings with migration_succesful = false after update, and the fact that they actually send sometimes 1 or 2 is a problem. Looking closely at that data, I think that is related to another issue that I reported on Bug 1601960.

Finally I will now try and understand what could possibly make a user not add the migration successful to their ping after actually succesfully migrating.

Flags: needinfo?(alessio.placitelli)
Blocks: 1601960

Finally I will now try and understand what could possibly make a user not add the migration successful to their ping after actually succesfully migrating.

Been trying to figure this out without much success. Although I have many hypothesis, I wasn't able to figure out a way to prove any of them. So here are the facts:

  1. After discussion with [:jan-erik], we came to the conclusion that it is possible that users might be sending a metrics ping before finishing the migration (or more than one due to Bug 1601960). We arrived at that conclusion because the scheduling of the ping happens "before" (it's async) the final stage of the migration;
  2. Because of this, we can assume that all clients that haven't recorded a successful migration, but have at most 2 pings in versions above or equal to 19 are users that either didn't migrate succesfully and recreated their client id, or users that migrated succesfully but haven't yet sent the ping with that info. If we remove these clients from the "not migrated" sample, we end up with only 1.4% of all the clients that should have migrated in a "really not migrated" state. Which is not as bad as 2.5%, I suppose;
  3. After finding out that a client that doesn't migrate succesfully will just recreate the client_id, I revisited this query, to see if the number of unique client ids grew abruptly after 24th of October (when we released v19), which could be an indication that many clients had unsuccesful migrations. There is a small increase in the slope of the graph on that date, but not very expressive and could also be due to Skyline, so not too much to worry there;

(Sidebar: I changed the PARTITION BY clause to use the sequence number. Turns out that is more reliable because end_time is weird sometimes. I changed that in all queries, which didn't change much, but did remove the outliers with more than 2 pings prior to migration from the "succesfully migrated" sample.)

Just for means of documentation, my non proven hypothesis as to why clients don't record the successful migration sometimes:

  • Something could be wrong with set. I say specifically set, because it seems we are indeed entering the markAsMigrated function and saving the wasMigrated pref, as these users seem not to be retrying to migrate at later times. But then again, the users could be retrying and failing everytime.
  • Upon collection, ping data is cleared. Say the ping scheduler, which runs async, clears the migration_successful: true before it is collected for the next ping? Already disproven, rkv wouldn't let that happen.
  • Since migration occurs in many steps, and not all at the same time, maybe it runs into problems at some point and fails. (Yes, this one is vague).

Thoughts, [:Dexter] and [:chutten]?

For reference:

Flags: needinfo?(chutten)
Flags: needinfo?(alessio.placitelli)

(In reply to Beatriz Rizental from comment #4)

Finally I will now try and understand what could possibly make a user not add the migration successful to their ping after actually succesfully migrating.

Been trying to figure this out without much success. Although I have many hypothesis, I wasn't able to figure out a way to prove any of them. So here are the facts:

  1. After discussion with [:jan-erik], we came to the conclusion that it is possible that users might be sending a metrics ping before finishing the migration (or more than one due to Bug 1601960). We arrived at that conclusion because the scheduling of the ping happens "before" (it's async) the final stage of the migration;

Oh, snag. That's true. And this is a bug in the code that should be happening 100% of the times if we generate a metrics ping at startup! That's because triggering the 'metrics' ping collection is dispatched to our API Dispatcher, which enqueues stuff, at startup. All API calls are dispatched as well, to the same Dispatcher. This means we have the following behaviour when generating startup 'metrics' pings:

  1. We migrate sequence numbers first, here;
  2. We init the Glean singleton in Rust.
  3. Then we migrate the user lifetime metrics (including the client_id!), here.
  4. Then we check if we need to send the metrics ping at startup, and we dispatch its collection, here
  5. THEN we mark migration as completed, here. Since we dispatch this after the collection of the metrics ping, and since dispatched actions are guaranteed to be executed in order, then we're guaranteed to never have the boolean set, even for successful migrations, on that ping.

However, we should probably see that value in following pings, which appears to not be the case.

  1. Because of this, we can assume that all clients that haven't recorded a successful migration, but have at most 2 pings in versions above or equal to 19 are users that either didn't migrate succesfully and recreated their client id, or users that migrated succesfully but haven't yet sent the ping with that info. If we remove these clients from the "not migrated" sample, we end up with only 1.4% of all the clients that should have migrated in a "really not migrated" state. Which is not as bad as 2.5%, I suppose;

Yeah, not that bad, but still a tiny bit above the ~1% threshold. Not too concerning if we have a solid explanation, though.

  1. After finding out that a client that doesn't migrate succesfully will just recreate the client_id, I revisited this query, to see if the number of unique client ids grew abruptly after 24th of October (when we released v19), which could be an indication that many clients had unsuccesful migrations. There is a small increase in the slope of the graph on that date, but not very expressive and could also be due to Skyline, so not too much to worry there;

Yes, I'm not even sure I can spot the slope there :)

Just for means of documentation, my non proven hypothesis as to why clients don't record the successful migration sometimes:

  • Something could be wrong with set. I say specifically set, because it seems we are indeed entering the markAsMigrated function and saving the wasMigrated pref, as these users seem not to be retrying to migrate at later times. But then again, the users could be retrying and failing everytime.

I think it's unlikely that set is misbehaving, but worth double checking it. It is not fundamentally different compared to other metric types, and we have extensive testing for this.

  • Since migration occurs in many steps, and not all at the same time, maybe it runs into problems at some point and fails. (Yes, this one is vague).

This could be, yes. If we get startup crashes after the client id is migrated but before we execute the dispatched "mark migration as complete", then we'd be migrated but without the flag. I'm afraid this would be hard to validate, as, currently, lib-crash metrics are opt-in.

Unless Chris has other ideas, I'd be ok with closing this investigation.

Flags: needinfo?(alessio.placitelli)

Hm. I decided to approach this problem from a different angle in order to see if I could gain any insights while reproducing the results.

According to my query of the total number of clients who should have migrated (2882) 4.6% (133) have not said they've migrated. (( This might be the same 2.5% we're seeing per day, just counting them all at once instead of per-day. ))

Splitting them by the number of pings they've sent in >= 19, I get 1.9% (54) clients who have sent sufficient (more than 2) pings from newer versions that at least one of them should contain the migrated metric. This number holds steady if I up the threshold to 3.

This is close but not identical to what you've found and can probably be put down to differences in methodology.

Though I'd really like to find an answer to this mystery, the mysterious clients form 0.2% of the total clients in the sample. At this point, the bulk (58.8%) of the population are too new to ever have need of the migration code at all. So let's end it with the educational result that, should there be a next time, we should record migration starts and ends.

Flags: needinfo?(chutten)

This is close but not identical to what you've found and can probably be put down to differences in methodology.

Yes, diferences in methodology. I also got 54 clients that probably did, but didn't record migration. But my "should have migrated" was something like 3640.

Anyways, I agree with both of you that we should close this (with the educational note by [:chutten] :)).

Not happy about it, but will mark this as WONTFIX.

Thanks!

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX

(In reply to Beatriz Rizental from comment #7)

Not happy about it, but will mark this as WONTFIX.

You should be very happy about your investigation, though :) It moved our knowledge of the problem quite a bit forward!

You need to log in before you can comment on or make changes to this bug.