Closed Bug 1626789 Opened 4 years ago Closed 1 year ago

Firefox becomes extremly unresponsive when doing local gecko build in parallel

Categories

(Core :: Performance, defect, P3)

x86_64
Windows 10
defect

Tracking

()

RESOLVED FIXED
Performance Impact low
Tracking Status
firefox113 --- fixed
firefox114 --- fixed
firefox115 --- fixed

People

(Reporter: saschanaz, Unassigned, NeedInfo)

References

Details

(Whiteboard: [wfh])

No other browsers or apps becomes this unresponsive. Opening a new tab takes several minutes, and even the entire browser freezes time to time. This makes my work-from-home hard, what could be the reasons?

Hi,

I will move this over to a component. If this is not the correct component please feel free to change it to an appropriate one.

Thanks for the report.

Component: General → Gecko Profiler
Product: Firefox → Core

(In reply to Kagami :saschanaz from comment #0)

No other browsers or apps becomes this unresponsive. Opening a new tab takes several minutes, and even the entire browser freezes time to time. This makes my work-from-home hard, what could be the reasons?

I don't know what's different with other browsers, but building Gecko requires a lot of memory and uses a lot of CPU with many processes. Your Firefox could be slow due to memory pressure, or due to inter process communication being slower than usual.

Component: Gecko Profiler → Performance

I don't know what's different with other browsers, but building Gecko requires a lot of memory and uses a lot of CPU with many processes. Your Firefox could be slow due to memory pressure, or due to inter process communication being slower than usual.

For example, Chrome (and the new Edge) just stutters, not becomes fully unresponsive as Firefox does. I mean the whole window is indicated as unresponsive by Windows unlike Chrome.

Hmm, are we perhaps doing too much I/O? For example because of session store. I wonder if profiler could tell where we're spending time.
Could you try to capture a profile https://profiler.firefox.com/

Flags: needinfo?(krosylight)
Priority: -- → P3

Tried: https://perfht.ml/2RiIEjx

Another longer one constantly fails to upload 🤔

Flags: needinfo?(krosylight)

A 853ms long event processing delay in the profile, and the gpu was doing mozilla::layers::LayerManagerComposite::ClearCachedResources. Could it be that the main thread was blocked by the GPU?

The main thread shows a long time spent writing minidumps, under ContentParent::KillHard. This happens when content processes get terminated if they time out during shutdown. If you switch to the "Stack height" mode in the profiler, you can see that most content processes have very few samples, which indicates that their threads didn't get any CPU time allotted to them. It really looks like only three processes (parent, GPU, and web content 2/10) were being regularly scheduled by the OS, and all the other processes were starved. Then it's no wonder that those processes take too long to respond to requests to shut down.
However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.
I don't know why some content processes are completely starved of CPU resources and others are not. It's possible that other tools (e.g. ETW) could shed light on this, but I've never used them.

(In reply to Markus Stange [:mstange] from comment #7)

However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.

We sometimes saw the same problem during startup on very slow hardware. See bug 1550754.

Hmm, why are we killing any processes here?
Kagami, were you perhaps closing some tabs when profiling.
I guess that could trigger this code path.

I wonder if changing https://searchfox.org/mozilla-central/rev/a707541ff423ade0d81cef6488e6ecfa09273886/modules/libpref/init/StaticPrefList.yaml#1919 makes any difference. Try some significantly longer time there. 60?

Kagami, were you perhaps closing some tabs when profiling.
I guess that could trigger this code path.

Yes, as closing tabs when busy frequently causes a freeze.

Try some significantly longer time there. 60?

Will try.

(In reply to Markus Stange [:mstange] from comment #7)

The main thread shows a long time spent writing minidumps, under ContentParent::KillHard. This happens when content processes get terminated if they time out during shutdown. If you switch to the "Stack height" mode in the profiler, you can see that most content processes have very few samples, which indicates that their threads didn't get any CPU time allotted to them. It really looks like only three processes (parent, GPU, and web content 2/10) were being regularly scheduled by the OS, and all the other processes were starved. Then it's no wonder that those processes take too long to respond to requests to shut down.
However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.
I don't know why some content processes are completely starved of CPU resources and others are not. It's possible that other tools (e.g. ETW) could shed light on this, but I've never used them.

The "content process takes too long to spawn, so we kill it and make a minidump of it, and everything becomes even worse" sounds a lot like what I was seeing in bug 1623609.

Whiteboard: [wfh]

Try some significantly longer time there. 60?

Seems it's better but still freezing happens. This one is too big so here is the link: https://1drv.ms/u/s!ArDhQNmhsUj04cMQwwgAVMKfG1FOzw?e=Y23xay

I wonder if profiles would be a bit smaller without screenshots.

Whiteboard: [wfh] → [wfh][qf]

Hmm, various things waiting for something. Parent process waiting for OS accessibility API to return.

Bas, since this is Windows, perhaps you have ideas what else could be improved here.

Flags: needinfo?(bas)

(In reply to Olli Pettay [:smaug] from comment #13)

I wonder if profiles would be a bit smaller without screenshots.

Removing screenshots saved about 1MB, removing a few hidden idle tracks saved some more, and I could upload it: https://perfht.ml/3b8q7y1

(In reply to Olli Pettay [:smaug] from comment #14)

Hmm, various things waiting for something. Parent process waiting for OS accessibility API to return.

mozilla::a11y::AccessibleHandler::get_accChild(tagVARIANT, IDispatch**) is a very frequent hang on BHR data; I think it's the first time I see a profile of it: https://perfht.ml/2Vuq2Ox

Hey Morgan, is there any chance you have noticed this in the a11y code? Can you scope this profile and comment?

Flags: needinfo?(mreschenberg)

(In reply to Benjamin De Kosnik [:bdekoz] from comment #17)

Hey Morgan, is there any chance you have noticed this in the a11y code? Can you scope this profile and comment?

huh... I haven't seen this, but might be something Jamie's seen, he poked around a few performance issues recently. NI'ing him and will ping our channel too. thanks for the tag! Is the profile expired? it hangs at "downloading" for me :(

Flags: needinfo?(mreschenberg) → needinfo?(jteh)

I'm still not very good at matching up hangs between two processes using the profiler. :(

It looks like focus moves in the content process, so we send a focus event via IPC which then gets dispatched to Windows in the parent process. Some client (I think Windows Text Services Framework?) catches that event and tries to get the target accessible using IAccessible::get_accChild. That needs to make a cross-process call to content. And it ends up waiting quite a while.

My guess is that because building Firefox is so intensive and parallel, cross-process calls get a lot slower; it takes a lot longer for a process to get its turn to run. This is probably true for all IPC calls. However, because Windows a11y calls are synchronous, they end up blocking the parent process while they execute, so you notice it a lot more with a11y.

Was this with an installed Nightly or a local build? If a local build, does it happen as much with an installed Nightly?
Can you please provide the information from the accessibility section of about:support?

Flags: needinfo?(jteh)
Flags: needinfo?(krosylight)

It was on installed Nightly. about:support says:

Accessibility
Activated 	true
Prevent Accessibility 	0
Accessible Handler Used 	true
Accessibility Instantiator 	UNKNOWN|
Flags: needinfo?(krosylight)

The about:support information doesn't look all that informational; Jamie, can you advise on how we could get more information about why accessibility is enabled here? Is there an easy way to turn it off?
See also bug 1633543, where loading a big page causes a similar freeze in the parent process.

Flags: needinfo?(jteh)

(In reply to Markus Stange [:mstange] from comment #21)

The about:support information doesn't look all that informational; Jamie, can you advise on how we could get more information about why accessibility is enabled here?

Unfortunately, there are some clients which inject dlls into Firefox itself and we can't detect those without some tricky work with a debugger. Again, in bug 1633543, I see msctf (Microsoft Text Services Framework) in the stack.

:saschanaz, are you by any chance using an IME; e.g. for an East Asian language? Also, does your system have a touch screen?

Is there an easy way to turn it off?

You can disable a11y by going to Options and checking "Prevent accessibility services from accessing your browser". However, this is really a last resort and is not something we want to publicly recommend. This will probably move to about:support eventually.

See also bug 1633543, where loading a big page causes a similar freeze in the parent process.

While a11y is definitely what's blocking the parent process here, it's pretty worrying that the content process is taking that long (20+ seconds!) to respond. Certainly, that would impact other things; e.g. if you scroll outside of the area rendered for APZ, try to click a link, etc.

Flags: needinfo?(jteh) → needinfo?(krosylight)

:saschanaz, are you by any chance using an IME; e.g. for an East Asian language? Also, does your system have a touch screen?

I do use Korean and Japanese IME and also a touchscreen as I'm using Surface Pro 7.

Flags: needinfo?(krosylight)

If you disable a11y (by going to Options and checking "Prevent accessibility services from accessing your browser"), does that negatively impact your IME or touch screen in any way?

Flags: needinfo?(krosylight)

does that negatively impact your IME or touch screen in any way?

I guess I need more test but so far not any negative impact. (It does solve Bug 1633543.)

I think it still shouldn't be the default, right?

Flags: needinfo?(krosylight)
Whiteboard: [wfh][qf] → [wfh][qf:p3]

I think it's important to find out why accessibility is being used here. Aaron, can you help (or forward to somebody who can)?

We shouldn't be comfortable shipping a configuration that can freeze the UI this easily to more users than absolutely necessary.
Just today I saw a profile from another Dark Reader user who also had accessibility freezing their parent process.

Flags: needinfo?(aklotz)

Oh, the previous comment was written with bug 1633543 in mind, not so much about the "freeze while building" issue.

(In reply to Markus Stange [:mstange] from comment #26)

I think it's important to find out why accessibility is being used here. Aaron, can you help (or forward to somebody who can)?

That would be me. Aaron doesn't directly work on a11y any more.

This seems to be specific to certain systems and I don't have a system on which I can reproduce it. If I can get hold of such a system, I can debug this further. Markus, do you have any idea how I might go about getting hold of such a system?

That said, while I agree we shouldn't enable a11y when we don't need to, I think we should also be asking why the content process is taking so long to respond in these cases. That is not an a11y problem, and as I noted in comment 22, that's going to have other impacts as well even without a11y causing the parent process to freeze. To put it another way, the a11y problem here is a symptom of the content process taking forever to respond.

Flags: needinfo?(aklotz)

(In reply to Kagami :saschanaz from comment #25)

I guess I need more test but so far not any negative impact. (It does solve Bug 1633543.)

I think it still shouldn't be the default, right?

A11y needs to be enabled by default because there are users that depend on it. Preventing/disabling it is essentially a troubleshooting step. That said, the fact that you see no negative impact with a11y enabled suggests that you don't need it for your use case, and so it really shouldn't get enabled. The reason for my questions about impact is that if we can figure out why it gets enabled on systems like yours, we now know we can safely block this case.

I also have a patch locally which caches Accessible event targets in the parent process. That means a11y doesn't have to make cross-process calls to request information when responding to an event. I abandoned work on this because it didn't give me the perf benefits I hoped for. However, it might be interesting to see whether this helps here.

However, it might be interesting to see whether this helps here.

Is there anything I can help to get the patch landed?

BTW, Firefox still freezes without a11y, while other apps are fine: https://perfht.ml/3dzOaqv

Flags: needinfo?(jteh)
Depends on: 1637048

(In reply to Kagami :saschanaz from comment #31)

BTW, Firefox still freezes without a11y, while other apps are fine: https://perfht.ml/3dzOaqv

Interesting! This profile no longer shows the accessibility problem, but it still shows the problems from comment 7. I filed bug 1637048 about the issue.

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3

(In reply to Kagami :saschanaz from comment #31)

However, it might be interesting to see whether this helps here.

Is there anything I can help to get the patch landed?

It turns out that this patch wasn't actually achieving what it was intended to achieve, and fixing that is a pretty big lift with no guarantee it will even help here. :(

I still really want to figure out why a11y gets enabled on systems like yours and see if we can prevent that. I guess I'll talk to my manager about trying to get hold of such a machine.

Flags: needinfo?(jteh)

There is reason to believe that bug 1687535 will stop accessibility from being enabled on this machine.

Depends on: 1687535
Depends on: 1737192
Performance Impact: --- → P3
Whiteboard: [wfh][qf:p3] → [wfh]

This is resolved by Cache the World, which is enabled by default in Firefox 113.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.