Open Bug 1933102 Opened 3 months ago Updated 7 days ago

Stuck WDBA holds profile lock, delays startup by 5s

Categories

(Firefox :: Installer, defect, P3)

Firefox 132
x86_64
Windows 10
defect

Tracking

()

Performance Impact low

People

(Reporter: gcp, Unassigned)

References

Details

(Keywords: perf:startup, Whiteboard: [fidedi])

Attachments

(9 files)

I have a Firefox 132.0.2 release install on an AMD 5800X, Windows 10, 32G RAM RTX3050, SSD with plenty of space. There are no addons installed, browser is up to date.

Launching Firefox from the taskbar takes over 5 seconds.

This seems extremely slow: on the same machine Chrome launches in much less than a second. I checked and even Firefox Nightly launches instantly on the same machine.

OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Version: unspecified → Firefox 132
Summary: >5s launcht times on a modern system with SSD → >5s launch times on a modern system with SSD
Attached file aboutsupport.txt

Can you capture a startup profile as outlined here: https://profiler.firefox.com/docs/#/./guide-startup-shutdown

This seems to be

PR_Sleep(unsigned int) [nsprpub/pr/src/threads/prcthr.c]
nsRemoteService::LockStartup() [toolkit/components/remote/nsRemoteService.cpp]
XREMain::XRE_mainStartup(bool*) [toolkit/xre/nsAppRunner.cpp]

We did related changes in 133 (https://bugzilla.mozilla.org/show_bug.cgi?id=1915216), but not 132, confusingly?

Flags: needinfo?(dtownsend)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #4)

This seems to be

PR_Sleep(unsigned int) [nsprpub/pr/src/threads/prcthr.c]
nsRemoteService::LockStartup() [toolkit/components/remote/nsRemoteService.cpp]
XREMain::XRE_mainStartup(bool*) [toolkit/xre/nsAppRunner.cpp]

We did related changes in 133 (https://bugzilla.mozilla.org/show_bug.cgi?id=1915216), but not 132, confusingly?

Even those changes shouldn't have really changed the way we do things on startup. The question is why the attempt to acquire a startup lock is failing. Is there anything interesting about the filesystem your temporary directory is stored on?

Flags: needinfo?(dtownsend) → needinfo?(gpascutto)

It's a normal Windows install, so NTFS. As said, Nightly on the same machine launches instantly.

Flags: needinfo?(gpascutto)

Can you try running Process Monitor and see what Firefox is doing during those 5 seconds?

Flags: needinfo?(gpascutto)

Bug 1933261 sounds somewhat similar.

If I drag the "nothing happens" period in the profiler, it looks to me like it's exactly 5 seconds, which is https://searchfox.org/mozilla-central/source/toolkit/components/remote/nsRemoteService.cpp#35

So I don't think this is computer performance related. That lock isn't getting cleaned up or something?

Flags: needinfo?(gpascutto)

Task manager indicates there are 3 Firefox processes hanging around. They don't stop a new Firefox from launching, but I'm guessing they are stopping it from getting that lock. So that are likely at least 2 bugs here: a sort of shutdown hang where we fail to terminate the process, and old processes affecting this lock. but not the actual Firefox.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #10)

Task manager indicates there are 3 Firefox processes hanging around. They don't stop a new Firefox from launching, but I'm guessing they are stopping it from getting that lock. So that are likely at least 2 bugs here: a sort of shutdown hang where we fail to terminate the process, and old processes affecting this lock. but not the actual Firefox.

We only hold this lock during startup so old processes that haven't shutdown shouldn't impact it. Process monitor could at least tell us what Windows is returning from the CreateFileW call.

Attached image procmonhang.png

I investigated a bit more: two of the Firefox.exe processes are the background update agent, so I suppose that's normal? The other is a content process that seems stuck in a Windows DLL init (interestingly, this is an issue we've seen before in telemetry/crashes). I got WinDbg set up here and got a stack for that.

However, killing the stuck content process didn't resolve the issue, and process monitor says Firefox still hangs on the parent.lock on startup. So the background update agent is what is not freeing the lock correctly?

Attached image windbghang1.png
Attached image windbghang2.png
Attached image windbghang3.png

(In reply to Gian-Carlo Pascutto [:gcp] from comment #13)

I investigated a bit more: two of the Firefox.exe processes are the background update agent, so I suppose that's normal?

I don't think that is normal. Nick?

Flags: needinfo?(nalexander)

(In reply to Dave Townsend [:mossop] from comment #17)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #13)

I investigated a bit more: two of the Firefox.exe processes are the background update agent, so I suppose that's normal?

I don't think that is normal. Nick?

Generally, that's not normal; I'd love to get some state on those tasks. There is a window in which background update tasks take the profile lock on the default profile, but it should be short (absent bad things happening) -- see https://searchfox.org/mozilla-central/rev/dfaf02d68a7cb018b6cad7e189f450352e2cde04/toolkit/mozapps/update/BackgroundTask_backgroundupdate.sys.mjs#253-281.

Oh -- looking at those tasks, I see the defaultagent -- i.e., the WDBA -- and not the background update task. That's possibly even stranger, because the default agent doesn't interact with browsing profiles at all. And it really should never be running twice concurrently.

nrishel: any thoughts?

Flags: needinfo?(nalexander) → needinfo?(nrishel)

Is it possible to reproduce the apparent(?) default agent hang by manually re-running it in the Task Scheduler?

At a glance I would be surprised if the default agent-specific code is the issue here, but if there's some startup resource that isn't being freed then perhaps there's an issue with it's profile? I'm not intimately familiar with those systems so this is just supposition.

Flags: needinfo?(nrishel)
Attached image taskscheduler.png

This is a screenshot of task manager. This machine has Nightly and Release installed. Are there entries here that are not supposed to be there?

Summary: >5s launch times on a modern system with SSD → Stuck WDBA holds profile lock, delays startup by 5s
Attached image proctree.png

If I interpret this correctly, this WBDA task invokes another firefox.exe with the same arguments (is this the launcher process or smth), which then somewhere along the way spawns a socket process that hangs. This hung socket process is never terminated and holds the profile lock alive? Is this a reasonable theory?

Attached file hang-stacks.txt
Breaking into the hanging process doesn't seem to reveal much.
Attached image stackof2.png

Stack of the second firefox.exe in the call tree.

spawns a socket process that hangs

Or maybe it's the main process that is just not doing anything, and the socket process is just something that accidentally gets spawned by default on startup?

(In reply to Gian-Carlo Pascutto [:gcp] from comment #23)

Created attachment 9440473 [details]
stackof2.png

Stack of the second firefox.exe in the call tree.

That seems to be stuck trying to show a modal window stemming from here suggesting that there was some error reading the proxy autoconfig file. That is indeed in the block that the startup lock is held alive for so that is probably the culprit.

If I understand the code and bugs correctly, that code path shouldn't even be hit in release, only on ESR, which could explain why Firefox itself doesn't pop up the error. The warning in the sandboxed path points to a section of our website where I couldn't find the relevant explanation.

I'm not sure if the modal dialog is expected to work in case it's being invoked from Task Scheduler.

Is there some autoconfig file that is messing up things? I'll investigate this part.

I'm not sure what to look for. Network settings appear the same (use system proxy settings) between Nightly and the Release build.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #26)

If I understand the code and bugs correctly, that code path shouldn't even be hit in release, only on ESR, which could explain why Firefox itself doesn't pop up the error. The warning in the sandboxed path points to a section of our website where I couldn't find the relevant explanation.

Looks like it would be hit on anything that isn't "release" or "beta" so that would mean nightly or esr at least.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #27)

I'm not sure what to look for. Network settings appear the same (use system proxy settings) between Nightly and the Release build.

We only hit this codepath if there is a value for the preference general.config.filename. But the default agent uses its own profile so presumably the pref would be set there. The background task profiles are stored in AppData\Roaming\Mozilla\Firefox\Background Tasks Profiles.

Though I think it is also possible to force the use of an autoconfig file via an enterprise policy: https://mozilla.github.io/policy-templates/#proxy

Though I think it is also possible to force the use of an autoconfig file via an enterprise policy: https://mozilla.github.io/policy-templates/#proxy

Can I check this somehow, i.e. whether this is active? I assume I want to find this config file and understand why we're breaking on it (which then causes the hang because we can't pop up the error).

There's a Telemetry.FailedProfileLocks.txt file in those profiles, but I guess that's not unexpected at this point.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #29)

Though I think it is also possible to force the use of an autoconfig file via an enterprise policy: https://mozilla.github.io/policy-templates/#proxy

Can I check this somehow, i.e. whether this is active?

Mike can answer this better than me

Flags: needinfo?(mozilla)

You can go to about:policies and see if one is active.

Flags: needinfo?(mozilla)

The Performance Impact Calculator has determined this bug's performance impact to be low. If you'd like to request re-triage, you can reset the Performance Impact flag to "?" or needinfo the triage sheriff.

Impact on browser: Causes noticeable startup delay
Configuration: Rare

Moving to Fx:General, not sure which component to pick here.

Performance Impact: --- → low
Component: Performance → General
Keywords: perf:startup
Product: Core → Firefox
Component: General → Installer

The severity field is not set for this bug.
:mpohle, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(mpohle)

Hi Nicholas! With the mirrored preferences to the background task: Are these preferences stored in a file which could be used to debug mismatches in general.config.filename (see comment 28) between Firefox and its background task?

Flags: needinfo?(mpohle) → needinfo?(nrishel)

Not likely. IIRC mirrored prefs for WDBA are a small set of telemetry-related prefs mirrored to the Windows registry.

Flags: needinfo?(nrishel)
Severity: -- → S3
Priority: -- → P3
Whiteboard: [fidedi]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: