Android: Run the metrics ping scheduler in `onStart` rather than `onCreate`
Categories
(Data Platform and Tools :: Glean: SDK, defect, P4)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Unassigned)
Details
Given what we know in bug 1682085 that metrics pings are frequently sent on days with no user activity and no baseline ping, we should try to correct it.
The hypothesis is that something is causing the process to start in the background without any user interaction, sending metrics pings without any recent user activity. Since the metrics ping scheduler runs inside Glean.initialize, called in onCreate, it will always attempt to send a metrics ping (if overdue) even when the app is running in the background. Moving the metrics ping scheduler check to onStart would limit it to only the cases where there is about to be a visible user interaction, and would match the semantics of the baseline pings foreground reason.
This is a highly risky change, since there is the risk of losing too many metrics pings if this hypothesis isn't correct. I have looked at ways of adding telemetry to learn when the process is being triggered and sending the "unwanted" metrics pings, but have not found an Android API that would provide the "reason" a process started. At a minimum, we should find a way to roll this out to nightly and prevent it from going to beta / release until it's been verified (the recent change to provide the BuildInfo flag to Glean may be a way to release-flag this fix).
Comment 1•4 years ago
|
||
Hmm, this has some weird implications if we make this change on iOS, any notification that appears and is then cleared would trigger this. Let's test this out every way we can think of to ensure it's doing what we expect, but I'm willing to help try with this approach!
| Reporter | ||
Comment 2•4 years ago
|
||
It occurs to me that we also could send a metrics ping on start up only when glean.validation.foreground_count > 0. It would be a weird thing to store more state to control this, but it might give us exactly what we want.
Comment 3•4 years ago
|
||
(In reply to Michael Droettboom [:mdroettboom] from comment #2)
It occurs to me that we also could send a metrics ping on start up only when
glean.validation.foreground_count> 0. It would be a weird thing to store more state to control this, but it might give us exactly what we want.
Can we land some metrics to validate this? It should be fairly straightfoward to instrument all the known paths (events?) and see what sequence of actions leads to the metrics ping scheduler being generated.
I was thinking of something like this:
- Event for onCreate
- Event for onStart
- Event for onPause
- Event for onStop
- Event for the metrics ping collection
To make sure all pertain to the same "sequence", we could generate a "flow id" the first time any of this event is generated for the current run and stick to that as long as the application is open.
What do you think?
Comment 4•4 years ago
|
||
I think seeing the event sequence instrumented as :Dexter mentions would be a good way to see what sort of effect this change might have before we make any big changes to when we send the metrics ping. I do kind of like the idea of using the foreground_count as a sort of sanity check, but adding instrumentation to see if this would correct things is pretty cheap. +1 to Dexter's suggestion from me.
| Reporter | ||
Comment 5•4 years ago
|
||
I think we already know that onCreate (which sends a metrics ping) is called without onStart (because foreground_count isn't set) from the existing telemetry, so I'm not sure what the new telemetry would tell us that we don't already know.
What we don't know (though I can't find the API to find out) is what activity is triggering the onCreate that is never followed by an onStart. I guess since there isn't an API for that, we could instrument all of Fenix' activities (there are ~20 of them). I think an application-lifetime string list metric where each activity adds its name to the list would tell us which activities are triggered right before the empty metrics pings. It might at least help us narrow it down.
| Reporter | ||
Comment 6•4 years ago
|
||
Unfortunately, this is more complicated than it first appears: Application.onCreate (where Glean.initialize is called from) happens before any of the activity onCreate, so that would probably be too late for any startup metrics pings (source). This means (I think), they would end up on the following metrics ping which would complicate analysis, but at the end of the day may still give us the telemetry we need.
Comment 7•4 years ago
|
||
Hmm, I wonder if there is another background task that is causing the app to launch in the background, giving us an onCreate but no onStart?
Updated•4 years ago
|
Comment 8•4 years ago
|
||
(In reply to Michael Droettboom [:mdroettboom] from comment #6)
Unfortunately, this is more complicated than it first appears:
Application.onCreate(whereGlean.initializeis called from) happens before any of the activityonCreate, so that would probably be too late for any startup metrics pings (source). This means (I think), they would end up on the following metrics ping which would complicate analysis, but at the end of the day may still give us the telemetry we need.
Yes, we first hit one of the app onCreate , then the activities. As you said, I agree that it will still give us some valuable intel :)
Comment 9•4 years ago
|
||
Travis, will you be able to push this forward? (I haven't read the full discussion here, so not 100% sure what's next)
Comment 10•4 years ago
|
||
I think the next step is to add some telemetry. Although, now that I re-read :mdroettboom's comment about what's doing an onCreate without an onStart.. maybe it's something running in a background process like the crash-reporter service? I'm pretty tied up with Nimbus but I think I should have time to instrument something like this. What if we added these as events, just hooked up to record when the lifecycle events happen. Since we have the context object within Glean, we should be able to get some information about the current process/thread/activity that might be useful in identifying what is at fault.
Does this sound like a reasonable plan? If so, I expect I could take something like this up by the end of this week or the beginning of next.
Comment 11•4 years ago
|
||
Just as a thought: Is this something we could instrument in Fenix directly? That would saves us from the long glean impl -> glean release -> a-c release -> fenix release cycle, while still giving us the same data.
Otherwise that is a reasonable plan.
| Reporter | ||
Comment 12•4 years ago
|
||
we should be able to get some information about the current process/thread/activity that might be useful in identifying what is at fault.
I haven't been able to find an API to know what activity caused startup (doesn't mean it doesn't exist). If there isn't one, I think the best way is to instrument each of the ~dozen activities in Fenix itself and record an event from there. Then we should know which activity leads to onCreate without onStart.
Updated•4 years ago
|
Description
•