Open Bug 1898810 Opened 4 months ago Updated 1 month ago

[meta] Glean telemetry is prohibitively expensive on Android

Categories

(Data Platform and Tools :: Glean: SDK, task, P3)

All
Android
task

Tracking

(Not tracked)

People

(Reporter: mstange, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: meta)

Attachments

(2 obsolete files)

While investigating bug 1892230, it became clear that the Glean Android performance problem is not limited to startup.

Any update to a glean telemetry metric comes with a large CPU time cost and a large disk write.

This happens for two reasons:

  • Telemetry metric updates aren’t batched on Android - each update results in a database write to the Glean database.
  • The Glean database uses rkv in “safe” mode, which might as well be called “destroy perf” mode. No matter how small the update is, on every update, the entire database is serialized to disk.

This means that device disks are getting hammered with writes on all occasions: page load, scrolling, video playback, startup. For example, we have observed hundreds of writes per second during scrolling (before bug 1898515 turned off the metric that was causing most of these updates) and ~32000 writes while loading cnn.com. On the Pixel 6 we were testing on, each write took around 0.5ms of CPU time, so that comes out to 16.5 seconds of extra CPU time for loading cnn.com.

Furthermore, the database serialization allocates and then frees a lot of memory, which causes contention of the malloc lock and slows down allocation in other threads.

The database serialization and write happens on a background thread in the parent process. This thread is not registered with the profiler, so the performance impact is not immediately visible when capturing profiles with the Gecko profiler.

Any time we add a metric or instrument a new code path with telemetry, we need to be aware of two things:

  1. On Android, the cost of the metric update is a lot more than an increment of a number in a histogram.
  2. As more metrics get added, the database grows, and all existing metric collections become more expensive.

We are still in the process of quantifying the impact of this problem. We are also investigating how we haven’t discovered the scope of the problem sooner.

Component: Telemetry → Glean: SDK
Product: Toolkit → Data Platform and Tools

We are still in the process of quantifying the impact of this problem. We are also investigating how we haven’t discovered the scope of the problem sooner.

Probably because this behavior is exceptionally problematic on often-recorded distributions and we recently migrated a bunch of legacy probes to use Glean directly.

(In reply to Markus Stange [:mstange] from comment #0)

We are still in the process of quantifying the impact of this problem. We are also investigating how we haven’t discovered the scope of the problem sooner.

Hey :mstange, is there a bug for the investigation? Would you kindly link it to this bugtree?

Flags: needinfo?(mstange.moz)
Depends on: 1899169

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Hey :mstange, is there a bug for the investigation? Would you kindly link it to this bugtree?

I've filed bug 1899169 for the "quantify impact" part, and I've started this document (currently empty) on the "how should we have noticed it" part.

Flags: needinfo?(mstange.moz)

(In reply to Markus Stange [:mstange] from comment #3)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Hey :mstange, is there a bug for the investigation? Would you kindly link it to this bugtree?

I've filed bug 1899169 for the "quantify impact" part, and I've started this document (currently empty) on the "how should we have noticed it" part.

Shouldn't the second be a bug too, linked to this bugtree? Would definitely help with discoverability and tracking progress on that. I'm happy to file one if you tell me where to!

Flags: needinfo?(mstange.moz)
See Also: → 1899995

Shouldn't the second be a bug too, linked to this bugtree? Would definitely help with discoverability and tracking progress on that. I'm happy to file one if you tell me where to!

Filed bug 1899995

Flags: needinfo?(mstange.moz)
Assignee: nobody → jrediger
Status: NEW → ASSIGNED
Assignee: jrediger → nobody
Status: ASSIGNED → NEW

Comment on attachment 9404989 [details]
Bug 1898810 - Update to Glean v60.1.1 r?TravisLong!

Revision D212253 was moved to bug 1892230. Setting attachment 9404989 [details] to obsolete.

Attachment #9404989 - Attachment is obsolete: true

Comment on attachment 9404990 [details]
Bug 1898810 - Fenix: Test delayPingLifetimeIo with a local Glean r?TravisLong!

Revision D212023 was moved to bug 1892230. Setting attachment 9404990 [details] to obsolete.

Attachment #9404990 - Attachment is obsolete: true
Depends on: 1906664
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: