Open Bug 1899169 Opened 1 year ago Updated 6 months ago

Quantify the performance impact of glean telemetry on Android

Categories

(Core :: Performance Engineering, task)

All
Android
task

Tracking

()

ASSIGNED

People

(Reporter: mstange, Assigned: acreskey)

References

(Blocks 1 open bug)

Details

See bug 1898810. Let's gather some performance measurements in this bug.

App link startup CPU time:

Using this startup_cpu.py script I gathered the amount of CPU time used during the first 10 seconds of startup when opening Firefox with an intent for cnn.com:

Telemetry off, Firefox Nightly (before bug 1898515): 25 seconds of CPU time during the first 10 seconds of startup+pageload
Telemetry on, Firefox Nightly (before bug 1898515): 35 seconds (40% increase)

Telemetry off, Firefox Release (enrolled in disable clone metric experiment): 21 seconds
Telemetry off, Firefox Release (enrolled in disable clone metric experiment): 29 seconds (40% increase)

Using the Fenix startup tests in CI we measured the impact of telemetry on startup.

This try push's base revision has telemetry disabled by default, and the new revision is the Fenix default, telemetry enabled.

High confidence results show that enabling telemetry delays startup:

  • Increases the mean and median of fenix cold_main_first_frame by 3.4% and 4.1% respectively.
  • Increases the mean and median of fenix cold_view_nav_start by 13.6% and 13.9% respectively.

Since Bug 1898515 and Bug 1898649 disable high-volume probes, we ran a second try push after they landed.
This reduced the impact of telemetry on cold_view_nav_start, but still delays startup. Current high confidence results show that telemetry:

  • Increases the mean and median of fenix cold_main_first_frame by 3.8% and 3.7% respectively.
  • Increases the mean and median of fenix cold_view_nav_start by 5.4% and 5.8% respectively.

With the browsertime tests in CI we also measured the impact of telemetry on Fenix pageload.

As in comment 2, this try push's base revision has telemetry disabled by default, and the new revision is the Fenix default, telemetry enabled.

In 11 sites it shows regressions to visual metrics, cpuTime, and fcp, generally in the 2-5% range, most medium-confidence. (And one improvement to ebay-kleinanzeigen cpuTime).

I'm re-running this compare with the high volume probes disabled, bug 1898515, and bug 1898649.

I've done some measurement of IO caused by writing out the glean DB. During 5 minutes of browsing I see glean writing out the 150-200KB DB enough times to get to about 27GB of IO.

To see how much of these writes actually make it to flash I looked at /proc/[pid]/io for the main process. Using a build with the deserialize metric on and leaving cnn.com open for a little while I saw 22GB of write_bytes and 21GB of cancelled_write_bytes which nets to about 1GB of physical writes. So even though 1GB is pretty bad, it looks like the page cache is keeping things from being really bad.

I also tested to scrolling on news.ycombinator.com with the turned on and was able to get about 400MB of physical writes by scrolling a lot.

Testing startup to cnn.com with telemetry turned on vs. off showed a substantial difference in IO writes but I didn't record the actual values.

Assignee: mstange.moz → acreskey

This try push measures the impact of glean on Fenix pageload with the high volume probes disabled, bug 1898515, and bug 1898649.

I identified 13 sites with medium to high confidence visual metric and CPU time regressions, including Amazon and Google (around 5% on each).

Visual metric regressions range up to 14% on the bellwhether Wikipedia test and up to 35% on Sina (less familiar with that test). Notable outliers include 2-3% CPU time improvements on Booking.com and Instagram.

Due to the stretched android device farm, remaining jobs may not complete, but we have enough data for high-confidence results.

In Bug 1892230 jan-erik batched the DB writes in memory.

This greatly reduces the performance cost of running glean on Fenix.

Startup
The mean and median impact of glean on fenix cold_view_nav_start has been reduced from 5.4% and 5.8% respectively to 1.4%
The mean and median impact of glean on fenix cold_main_first_frame has been reduced from 3.8% and 3.7% respectively to 3.6% and 2.8% (Without having profiled this, I believe the bottleneck here may be library initialization and libXul loading dependencies)

Perf compare try push compare here. .

Another way to view the impact of Bug 1892230 is that it improved fenix cold_view_nav_startup by about 6% (high-confidence), visible in this compare.

Pageload
The batching of DB writes in memory also greatly reduced the performance impact of glean on Fenix pageload.
Prior to this patch we can see 8 site tests with medium and high-confidence regressions when enabling glean.
With jan-erik's patch we see no regressions from enabling glean (default view). (Results may shift slightly as jobs are still running. Sites with 1 run are effectively noise).

If we dig in a little deeper with the "replicates" view (which compares all test values, rather than final results per job) we can see that there is still a measurable performance impact on pageload but it has been reduced.
Whereas in comment 5 we saw 13 sites with medium and high confidence regressions we are now down to 5, and of lower magnitude. (e.g. Wikipedia visual metrics regression down from 14% to about 3% with jobs still running.)

Prior to Jan-Erik's batched DB writes, I measured the impact of glean on Speedometer 3.
With bug 1898515, and bug 1898649 landed, glean did not measurable affect on the overall SP3 score. Try push.

Looking at subtests, there is a 40% hit to Editor-TipTap/Highlight/Async, which may end up invoking DB writes. But this needs to be re-run now that bug 1892230 has landed.

I've re-run a pageload comparison with the June 24th binary to quantify any remaining performance impact from having telemetry enabled in Fenix.

The baseline revision has telemetry disabled in Fenix, while the "New" revision (i.e. right hand side) has telemetry enabled.
Performance Compare

With the "replicates" view enabled I am seeing a series of performance impacts, medium confidence unless otherwise noted.

• allrecipes ContentfulSpeedIndex opt warm, 4.43% slower
• amazon-search largestContentfulPaint opt warm 2.32% slower
• cnn visual metrics, 3-4% slower
• cnn fnbpaint opt warm, 6.71% slower, high confidence
• ebay-kleinanzeigen-search ContentfulSpeedIndex opt warm, 3.85% high confidence
• imdb fcp opt warm, 2.95% slower
• sina visual metrics, up to 18% slower, but high noise site

I've added profiling jobs to sites with regressions. e.g. here[(https://treeherder.mozilla.org/jobs?repo=try&revision=c7b93803ca07b9b7d172d33fa2165f48d3bf61ab&selectedTaskRun=UInI-rGNTfiiVTxuMp5SHQ.0)

On ebay I can see 12ms of samples in the parent process spent on telemetry acculumation
https://share.firefox.dev/45T44c1

glean_core::metrics::timing_distribution::TimingDistributionMetric::accumulate_raw_duration
mozilla::glean::impl::DenominatorMetric::Add

I'm going to spin this out into a bug 1906664

You need to log in before you can comment on or make changes to this bug.