Open Bug 1892234 Opened 6 months ago Updated 16 days ago

org.mozilla.experiments.nimbus.Nimbus$applyLocalExperiments$2$1.invoke blocks startup for 110ms while it's loading libxul

Categories

(Fenix :: Experimentation and Telemetry, defect, P2)

All
Android
defect

Tracking

(Performance Impact:high, firefox125 wontfix, firefox126 wontfix, firefox127 fix-optional)

Performance Impact high
Tracking Status
firefox125 --- wontfix
firefox126 --- wontfix
firefox127 --- fix-optional

People

(Reporter: jrmuizel, Unassigned)

References

(Blocks 4 open bugs)

Details

(Keywords: regression, regressionwindow-wanted)

https://share.firefox.dev/3U3VWzN

We block the main thread on nimbus initialization here:
https://github.com/mozilla/application-services/blob/8df35e4c85df62b2cbdbde371e02f7093c4125bb/components/nimbus/android/src/main/java/org/mozilla/experiments/nimbus/NimbusBuilder.kt#L135

During nimbus init we call:
org.mozilla.experiments.nimbus.GleanMetrics.NimbusHealth$applyPendingExperimentsTime$2.invoke

which calls:
mozilla.telemetry.glean.private.TimingDistributionMetricType.<init>

which calls:
com.sun.jna.NativeLibrary.loadLibrary

which loads libxul.

Flags: needinfo?(jrediger)

Nimbus records some metrics very early on.
There's no way to avoid loading libxul for that, because that's where the implementation of Glean actually is.

So the only way to avoid this during startup of Nimbus is to do the measurements in-memory and submit them to Glean at a later point.
Right now Nimbus uses measure, a convenience function, to do the right thing. It would need to measure the time itself and then later use accumulateSingleSample to record that measurement.

(edit 2024-05-06: woops, I linked to Swift code; fixed that by linking to the change for Kotlin, it's about the same)

Flags: needinfo?(jrediger)

From looking at the profile, nimbus isn't the only offender here. If we eliminate the problem there Fenix is likely going to run into it soon after. It might need a more general solution.

Blocks: perf-android
Keywords: regression

This bug has been marked as a regression. Setting status flag for Nightly to affected.

Roger, could you prioritize this bug, thanks

Flags: needinfo?(royang)
Severity: -- → S2
Flags: needinfo?(royang)

I've set the severity. Unfortunately, I don't set the priority.
Hi Chris, can you please help? Thanks

Flags: needinfo?(cpeterson)
Performance Impact: --- → high

Setting priority to P2 since this bug affects app startup performance.

We block the main thread on nimbus initialization here:
https://github.com/mozilla/application-services/blob/8df35e4c85df62b2cbdbde371e02f7093c4125bb/components/nimbus/android/src/main/java/org/mozilla/experiments/nimbus/NimbusBuilder.kt#L135

Is this bug really a regression? git blame for that runBlocking code points to this Nimbus change from 2022: https://github.com/mozilla/application-services/pull/5239

Flags: needinfo?(cpeterson)
Priority: -- → P2

(In reply to Chris Peterson [:cpeterson] from comment #6)

Setting priority to P2 since this bug affects app startup performance.

We block the main thread on nimbus initialization here:
https://github.com/mozilla/application-services/blob/8df35e4c85df62b2cbdbde371e02f7093c4125bb/components/nimbus/android/src/main/java/org/mozilla/experiments/nimbus/NimbusBuilder.kt#L135

Is this bug really a regression? git blame for that runBlocking code points to this Nimbus change from 2022: https://github.com/mozilla/application-services/pull/5239

It might not be a recent regression but we certainly haven't always accidentally loaded libxul before we wanted to.

We discussed the situation with Jan-Erik off-line. As mentioned in comment 1, the change done on the Glean side was a fundamental change that by the requirements only a single instance of Glean is used to record both the embedders and the libraries side. In order to achieve this, some parts of the glean code is embedded inside gecko library (see /toolkit/components/glean). As Nimbus is recording metrics very early it would require the glean code in GV and as a result it would need to load the xul library.

In addition to Nimbus' loading the library, Fenix is also loading xul library at the start up as well. It does this while creating an engine here (which creates GeckoRuntime and launches the GeckoThread that loads native libraries). This happens after Nimbus' loading it so it does not take much time to reload because of this code iiuc.

If we want to get rid of xul library loading at the startup, we will need to redesign calls that requires Glean. See Edit below and also the creation and launching of GeckoThread on the main process so that it does not load the native libraries at the start, if possible and not a breaking change.

As a next step, I'll discuss this situation with Nimbus team and get their opinion as well.

Edit:
The ideal scenario we want to achieve is that GeckoThread dealing with loading the xul library as it does not block the UI thread.

Here's the writeup I promised with some options, so we're all on the same page on what is going on and what we can do to mitigate the libxul loading:

Fenix - Glean libxul loading at startup

I disabled the nimbus metrics locally and it looks like that moves libxul loading to glean init inside of FenixApplication.initialize. So removing the nimbus metrics alone won't help the situation without a solution to avoid glean init loading libxul or postponing glean init until after the Gecko thread has loaded libxul.

Blocks: 1900106

Here's a profile that shows dlopen being caused by calling gleanEnableLogging during Glean init: https://share.firefox.dev/3V8lFaX. Maybe we can avoid doing that?

(In reply to Jeff Muizelaar [:jrmuizel] from comment #11)

Here's a profile that shows dlopen being caused by calling gleanEnableLogging during Glean init: https://share.firefox.dev/3V8lFaX. Maybe we can avoid doing that?

We can delay that until Glean is initialized.

Depends on: 1900308

If I comment out gleanEnableLogging the next use is in applyServerKnobsConfig: https://share.firefox.dev/4bzjMel

Commenting that out moves it down the line to: mozilla.components.lib.crash.service.GleanCrashReporterService.<init>: https://share.firefox.dev/456FNP4

I've opened a PR to adjust Nimbus' use of Glean to move it to a coroutineScope — not sure how to perf test to validate it helps, but wanted to share https://github.com/mozilla/application-services/pull/6255

(In reply to Charlie Humphreys [:charlie] [:jeddai] from comment #15)

I've opened a PR to adjust Nimbus' use of Glean to move it to a coroutineScope — not sure how to perf test to validate it helps, but wanted to share https://github.com/mozilla/application-services/pull/6255

If you can make a patch against mozilla-central, then we can push that try and compare with perfCompare.

Blocks: 1894804

Jan-Erik do you have an update on this issue and an estimate of when it might get resolved?

Flags: needinfo?(jrediger)

Coming back to this now.
I used Charlie's patch as a starting point and hackily added a forced-crash if libxul is loaded by Glean code before Glean gets initialized.
That helped track down further some of the code.

As is Charlie's patch won't help, because it only pushes off things to a thread, but still runs. I hackily added a way to pause that thread and only unpause it after Glean is initialized.
So that would be viable.

I then ran into one more issue: the crash ping potentially is also triggered early and I tried to find a solution. But later then I realized maybe I can punt on that for now; this only happens after a previous crash, so hopefully that's not the usual case anyway.

I'll clean up my code and put it up.

Flags: needinfo?(jrediger)

Finally getting around to actually post this.
My test stack is a bit more complicated:

This currently results in a build that does not crash on my test device on startup. And for all I know libxul is loaded by Gecko code, not by Glean.
And the thread registration works, which is a good indicator that libxul state is set up correctly.

Note that the Glean & UniFFI changes are for testing, for forcing a crash. I wouldn't want to land this, because loading libxul too early is still better than just crashing the app in production of course.

Crashing in release seems like a good way to avoid having this regress again. Could you use Library.OPTION_OPEN_FLAGS with RTLD_NOLOAD to ensure that UniFFI is never loading libxul?

Flags: needinfo?(jrediger)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #21)

Crashing in release seems like a good way to avoid having this regress again. Could you use Library.OPTION_OPEN_FLAGS with RTLD_NOLOAD to ensure that UniFFI is never loading libxul?

Even if we land those fixes as is (and I won't, it's a hack right now) there's still the crash ping that's potentially calling Glean early and thus triggering a load of libxul.
We certainly don't want to crash Fenix at startup while trying to report the crash from the session before it.

That's why the above test is not a viable solution really right now.
I might need to do this differently and make the Glean APIs avoid the call into glean-core if its not yet initialized.

Flags: needinfo?(jrediger)

One more reason we can't just unconditionally add RTLD_NOLOAD: Glean tests run independently. In that mode we of course need to load the library, there is no Gecko around doing that for us.

After some additional work and testing I have a solution that is addressing the current too-early-loading issue, by delaying certain metrics triggered by nimbus until after Glean is initialized.

I have a Glean PR up for that at https://github.com/mozilla/glean/pull/2942
It requires minimal changes for Nimbus, see https://github.com/mozilla/application-services/pull/6364
I have tested that in Fenix by adding additional as above that forces a crash if Glean tries to load the library too early.
With these patches no crash happens.

There's one more potential early-Glean-loading trigger: The GleanCrashReporterService when it triggers a Glean crash ping. However this only happens if it detects a crash from a previous run (which gets written to a glean_crash_counts file).
That's hopefully not the standard case, so we can address that later.

Note that the above patches explicitly only implement this for the very specific usecase as currently done by Nimbus.
It's a first step, though it could regress when things in Nimbus change or things in Fenix startup change.
I'd be willing to land this soon (we're currently blocked on a pending Glean update, should be done this week though), so we can test that out properly in Nightly (and if necessary have time to roll it back)

You need to log in before you can comment on or make changes to this bug.