Open Bug 1546430 Opened 9 months ago Updated 2 months ago

Reduce Telemetry background CPU usage

Categories

(GeckoView :: General, enhancement, P2)

Unspecified
Android
enhancement

Tracking

(geckoview66 wontfix, firefox-esr60 wontfix, firefox66 wontfix, firefox67 wontfix, firefox67.0.1 wontfix, firefox68 fix-optional)

Tracking Status
geckoview66 --- wontfix
firefox-esr60 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- fix-optional

People

(Reporter: Agi, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Telemetry serializes state and writes to disk every 60 seconds, even when GeckoView is in the background. This is probably unnecessary and we can do better.

Doing CPU-intensive periodic tasks can wake up the CPU and be detrimental to battery life.

See also discussion in Bug 1539263.

Is Telemetry currently using WorkManager? https://developer.android.com/topic/libraries/architecture/workmanager

It is supposed to be a best practice on Android.

(In reply to Asif Youssuff from comment #1)

Is Telemetry currently using WorkManager? https://developer.android.com/topic/libraries/architecture/workmanager

It is supposed to be a best practice on Android.

There's two parts to this. One thing is Telemetry collected by GeckoView (which is, I think, the one we're talking about in this bug); this is gecko code and can't use WorkManager. The other part is whatever is getting the data from GeckoView, which is the Android app/library/embedder that can use the WorkManager. For Fenix, the latter is glean, which uses the WorkManager to do its things.

(In reply to Agi | :agi | ⏰ EST | he/him from comment #0)

Telemetry serializes state and writes to disk every 60 seconds, even when GeckoView is in the background. This is probably unnecessary and we can do better.

Yup! One option could be to:

  • Add a method to GeckoViewTelemetryController to toggle "background mode" on or off (assuming there's no way to know in Gecko other than this, of which I'm not sure :( ).
  • Change the serialization frequency to something longer than 60s.

We should also audit the telemetry code to make sure we're not attempting to write stuff even though there's nothing available for recording.

Doing CPU-intensive periodic tasks can wake up the CPU and be detrimental to battery life.

See also discussion in Bug 1539263.

Given :esawin comment here, did we measure the impact of the telemetry code recording data every 60s? Having data consistency seems still important, even in background.

Blocks: 1545266

Alessio, can we explore your proposed options?

With bug 1545266 we aim to introduce more structured handling of Android lifecycle events like app backgrounding.
I'm thinking of introducing lifecycle-aware handling of scheduled flushes like this.

One possible strategy that could work for this and bug 1536797 is as follows:

  1. Flush periodically when in foreground (current mechanic).
  2. Force-flush when app is getting backgrounded.
  3. No I/O when app is in background or at an increased timeout.

Assuming we can ensure that flushing in #2 is successful, data integrity should be preserved and we would have options to play nice with Android's power management.

All these states and timers should be controlled by GV events, that would give us the most flexibility for future improvements.

Do you think such mechanics would be feasible for telemetry, are there any risks in the interaction with Gecko?

Flags: needinfo?(alessio.placitelli)
Type: defect → enhancement
OS: All → Android
Priority: -- → P2
Whiteboard: [geckoview:fenix:p2]

(In reply to Eugen Sawin [:esawin] from comment #3)

Alessio, can we explore your proposed options?

Sure! See my comments below.

With bug 1545266 we aim to introduce more structured handling of Android lifecycle events like app backgrounding.

This would be very, very useful.

I'm thinking of introducing lifecycle-aware handling of scheduled flushes like this.

One possible strategy that could work for this and bug 1536797 is as follows:

  1. Flush periodically when in foreground (current mechanic).
  2. Force-flush when app is getting backgrounded.
  3. No I/O when app is in background or at an increased timeout.

I'm afraid we should go for "increased timeout", as we can't guarantee that nobody will attempt to record telemetry while in background (think of GC). For "how increased", that depends on two things:

  • how frequent can we be without being bad to Android power management?
  • how much data (being accumulated while in background) are we willing to sacrify?

Assuming we can ensure that flushing in #2 is successful, data integrity should be preserved and we would have options to play nice with Android's power management.

All these states and timers should be controlled by GV events, that would give us the most flexibility for future improvements.

Do you think such mechanics would be feasible for telemetry, are there any risks in the interaction with Gecko?

Yes, it would be feasible. I can imagine the Telemetry core to listen to some event like "GeckoViewGoingToBackground", then we internally increase the timeout for the I/O, and restore it back once we're back to foreground.

One other short-term option, if this is creating problems for the Fenix launch, is to completely disable the mechanism as Fenix is not currently fetching GeckoView telemetry.

I'm happy to have a vidyo chat next week if that helps!

Flags: needinfo?(alessio.placitelli)

It looks like we don't need to introduce a new signal, but could reuse "application-background" and "application-foreground" to notify GV telemetry of app lifecycles. The event is currently not fired in GV, but that would be an easy fix.

It also looks like we've already similar mechanics in place for Fennec in TelemetrySession.

For GV, we would ideally want the TelemetryController to flush to disk on "application-background" and discontinue the periodic flushing until "application-foreground" is observed or GV is restarted.

If possible, we want to prioritize a low CPU/IO profile over consistent telemetry reporting while in background. Essentially, a backgrounded GV app should be considered dead/inactive. We don't think that increasing the flushing timeout would realistically give us more consistent telemetry reports, given that GV would most likely get killed before the conclusion of the initial timeout (for timeouts > 60s).

Georg, would you agree with such GV telemetry lifecycle handling mechanics (also see comment 3)?:

Flags: needinfo?(gfritzsche)

(In reply to Eugen Sawin [:esawin] from comment #5)

It looks like we don't need to introduce a new signal, but could reuse "application-background" and "application-foreground" to notify GV telemetry of app lifecycles. The event is currently not fired in GV, but that would be an easy fix.

It also looks like we've already similar mechanics in place for Fennec in TelemetrySession.

For GV, we would ideally want the TelemetryController to flush to disk on "application-background" and discontinue the periodic flushing until "application-foreground" is observed or GV is restarted.

If possible, we want to prioritize a low CPU/IO profile over consistent telemetry reporting while in background. Essentially, a backgrounded GV app should be considered dead/inactive. We don't think that increasing the flushing timeout would realistically give us more consistent telemetry reports, given that GV would most likely get killed before the conclusion of the initial timeout (for timeouts > 60s).

Georg, would you agree with such GV telemetry lifecycle handling mechanics (also see comment 3)?:

So, going back to comment 3, let me recap to make sure that i got things right here:

  1. Flush periodically when in foreground (current mechanic).
  2. Force-flush when app is getting backgrounded.
  3. No I/O when app is in background or at an increased timeout.
  1. & 2. are clear requirements for any useful Telemetry from GV.

The question is what do about 3., when the app is backgrounded. A backgrounded app might get killed in a few seconds, a few minutes or in a few hours.
If we persist Telemetry data to disk in a regular interval in backgrounded apps, we get Telemetry data from GV in the background.
If we don't persist Telemetry data regularly in backgrounded, we will lose a proportion of the Telemetry data generated in the background (i.e. when the app was killed before it was foregrounded again).

Two questions here to help us make an informed decision:

  • Do we have data on app background lifecycles that would help here?
  • Who will analyze that data or depend on it? Would they be affected by the data quality change?

Some examples that i can think of of what kind of data may be affected, to make it more concrete:

  • GC_MS: "Time spent running JS GC (ms)." This collects many measurements and is presumably mostly interesting in aggregate. If we hit bad edge cases only in the background, that signal may be affected?
  • APPLICATION_REPUTATION_SERVER_VERDICT_2: "Application reputation remote verdict, keyed by file extension." If we hit certain file extension only on some background scenarios, do we care about possibly not seeing them?
  • URLCLASSIFIER_COMPLETE_TIMEOUT2 "This metric is recorded every time a gethash lookup is performed, true is recorded if the lookup times out. Keyed by provider." If this only happens when backgrounded, we're less likely to get that signal.
Flags: needinfo?(gfritzsche)

Also, is this bug a GV 68 blocker?
If yes, maybe we should approach this in two phases - unblocking GV first, then follow up with the proper solution.

(In reply to Georg Fritzsche [:gfritzsche] from comment #7)

Also, is this bug a GV 68 blocker?
If yes, maybe we should approach this in two phases - unblocking GV first, then follow up with the proper solution.

Nope! This bug is not a GV 68 blocker if it is just an enhancement.

Depends on: 1551184

We saw this appear in some cold page load profiles (the browser must have been waiting for 60+ seconds).

Simpleperf profile (all threads captured, looking at TelemetryGVIO)
https://perfht.ml/2YuiECx

Gecko profile, capturing the TelemetryGVIO thread
https://perfht.ml/2K5Vkqu

From the simpleperf profile it looks like there are many small writes to the disk.

Perhaps it would be possible to serialize the telemetry json to memory and then write that out, all at once?

Rank: 25
Whiteboard: [geckoview:fenix:p2]
You need to log in before you can comment on or make changes to this bug.