1626789 - Firefox becomes extremly unresponsive when doing local gecko build in parallel

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Description

•

5 years ago

No other browsers or apps becomes this unresponsive. Opening a new tab takes several minutes, and even the entire browser freezes time to time. This makes my work-from-home hard, what could be the reasons?

Alin Ilea, Desktop Test Engineering [pto until 01.09]

Comment 1

•

5 years ago

Hi,

I will move this over to a component. If this is not the correct component please feel free to change it to an appropriate one.

Thanks for the report.

Component: General → Gecko Profiler

Product: Firefox → Core

Florian Quèze [:florian]

Comment 2

•

5 years ago

(In reply to Kagami :saschanaz from comment #0)

No other browsers or apps becomes this unresponsive. Opening a new tab takes several minutes, and even the entire browser freezes time to time. This makes my work-from-home hard, what could be the reasons?

I don't know what's different with other browsers, but building Gecko requires a lot of memory and uses a lot of CPU with many processes. Your Firefox could be slow due to memory pressure, or due to inter process communication being slower than usual.

Component: Gecko Profiler → Performance

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 3

•

5 years ago

•

Edited

I don't know what's different with other browsers, but building Gecko requires a lot of memory and uses a lot of CPU with many processes. Your Firefox could be slow due to memory pressure, or due to inter process communication being slower than usual.

For example, Chrome (and the new Edge) just stutters, not becomes fully unresponsive as Firefox does. I mean the whole window is indicated as unresponsive by Windows unlike Chrome.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

5 years ago

Hmm, are we perhaps doing too much I/O? For example because of session store. I wonder if profiler could tell where we're spending time.
Could you try to capture a profile https://profiler.firefox.com/

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

5 years ago

Flags: needinfo?(krosylight)

Priority: -- → P3

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 5

•

5 years ago

Tried: https://perfht.ml/2RiIEjx

Another longer one constantly fails to upload 🤔

Flags: needinfo?(krosylight)

Sean Feng [:sefeng211]

Comment 6

•

5 years ago

A 853ms long event processing delay in the profile, and the gpu was doing mozilla::layers::LayerManagerComposite::ClearCachedResources. Could it be that the main thread was blocked by the GPU?

Markus Stange [:mstange]

Comment 7

•

5 years ago

The main thread shows a long time spent writing minidumps, under ContentParent::KillHard. This happens when content processes get terminated if they time out during shutdown. If you switch to the "Stack height" mode in the profiler, you can see that most content processes have very few samples, which indicates that their threads didn't get any CPU time allotted to them. It really looks like only three processes (parent, GPU, and web content 2/10) were being regularly scheduled by the OS, and all the other processes were starved. Then it's no wonder that those processes take too long to respond to requests to shut down.
However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.
I don't know why some content processes are completely starved of CPU resources and others are not. It's possible that other tools (e.g. ETW) could shed light on this, but I've never used them.

Florian Quèze [:florian]

Comment 8

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #7)

However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.

We sometimes saw the same problem during startup on very slow hardware. See bug 1550754.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 9

•

5 years ago

Hmm, why are we killing any processes here?
Kagami, were you perhaps closing some tabs when profiling.
I guess that could trigger this code path.

I wonder if changing https://searchfox.org/mozilla-central/rev/a707541ff423ade0d81cef6488e6ecfa09273886/modules/libpref/init/StaticPrefList.yaml#1919 makes any difference. Try some significantly longer time there. 60?

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 10

•

5 years ago

Kagami, were you perhaps closing some tabs when profiling.
I guess that could trigger this code path.

Yes, as closing tabs when busy frequently causes a freeze.

Try some significantly longer time there. 60?

Will try.

Emilio Cobos Álvarez (:emilio)

Comment 11

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #7)

The main thread shows a long time spent writing minidumps, under ContentParent::KillHard. This happens when content processes get terminated if they time out during shutdown. If you switch to the "Stack height" mode in the profiler, you can see that most content processes have very few samples, which indicates that their threads didn't get any CPU time allotted to them. It really looks like only three processes (parent, GPU, and web content 2/10) were being regularly scheduled by the OS, and all the other processes were starved. Then it's no wonder that those processes take too long to respond to requests to shut down.
However, it's not good that writing out minidumps happens on the parent process main thread. When we're already in such a bad state, we don't want to make the situation worse.
I don't know why some content processes are completely starved of CPU resources and others are not. It's possible that other tools (e.g. ETW) could shed light on this, but I've never used them.

The "content process takes too long to spawn, so we kill it and make a minidump of it, and everything becomes even worse" sounds a lot like what I was seeing in bug 1623609.

Simon Giesecke [:sg] [he/him]

Updated

•

5 years ago

Whiteboard: [wfh]

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 12

•

5 years ago

•

Edited

Try some significantly longer time there. 60?

Seems it's better but still freezing happens. This one is too big so here is the link: https://1drv.ms/u/s!ArDhQNmhsUj04cMQwwgAVMKfG1FOzw?e=Y23xay

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 13

•

5 years ago

I wonder if profiles would be a bit smaller without screenshots.

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

5 years ago

Whiteboard: [wfh] → [wfh][qf]

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 14

•

5 years ago

Hmm, various things waiting for something. Parent process waiting for OS accessibility API to return.

Bas, since this is Windows, perhaps you have ideas what else could be improved here.

Flags: needinfo?(bas)

Florian Quèze [:florian]

Comment 15

•

5 years ago

(In reply to Olli Pettay [:smaug] from comment #13)

I wonder if profiles would be a bit smaller without screenshots.

Removing screenshots saved about 1MB, removing a few hidden idle tracks saved some more, and I could upload it: https://perfht.ml/3b8q7y1

Florian Quèze [:florian]

Comment 16

•

5 years ago

(In reply to Olli Pettay [:smaug] from comment #14)

Hmm, various things waiting for something. Parent process waiting for OS accessibility API to return.

mozilla::a11y::AccessibleHandler::get_accChild(tagVARIANT, IDispatch**) is a very frequent hang on BHR data; I think it's the first time I see a profile of it: https://perfht.ml/2Vuq2Ox

Benjamin De Kosnik [:bdekoz]

Comment 17

•

5 years ago

Hey Morgan, is there any chance you have noticed this in the a11y code? Can you scope this profile and comment?

Flags: needinfo?(mreschenberg)

Morgan Reschenberg [:morgan]

Comment 18

•

5 years ago

(In reply to Benjamin De Kosnik [:bdekoz] from comment #17)

Hey Morgan, is there any chance you have noticed this in the a11y code? Can you scope this profile and comment?

huh... I haven't seen this, but might be something Jamie's seen, he poked around a few performance issues recently. NI'ing him and will ping our channel too. thanks for the tag! Is the profile expired? it hangs at "downloading" for me :(

Flags: needinfo?(mreschenberg) → needinfo?(jteh)

James Teh [:Jamie]

Comment 19

•

5 years ago

I'm still not very good at matching up hangs between two processes using the profiler. :(

It looks like focus moves in the content process, so we send a focus event via IPC which then gets dispatched to Windows in the parent process. Some client (I think Windows Text Services Framework?) catches that event and tries to get the target accessible using IAccessible::get_accChild. That needs to make a cross-process call to content. And it ends up waiting quite a while.

My guess is that because building Firefox is so intensive and parallel, cross-process calls get a lot slower; it takes a lot longer for a process to get its turn to run. This is probably true for all IPC calls. However, because Windows a11y calls are synchronous, they end up blocking the parent process while they execute, so you notice it a lot more with a11y.

Was this with an installed Nightly or a local build? If a local build, does it happen as much with an installed Nightly?
Can you please provide the information from the accessibility section of about:support?

Flags: needinfo?(jteh)

James Teh [:Jamie]

Updated

•

5 years ago

Flags: needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 20

•

5 years ago

It was on installed Nightly. about:support says:

Accessibility
Activated 	true
Prevent Accessibility 	0
Accessible Handler Used 	true
Accessibility Instantiator 	UNKNOWN|

Flags: needinfo?(krosylight)

Markus Stange [:mstange]

Comment 21

•

5 years ago

The about:support information doesn't look all that informational; Jamie, can you advise on how we could get more information about why accessibility is enabled here? Is there an easy way to turn it off?
See also bug 1633543, where loading a big page causes a similar freeze in the parent process.

Flags: needinfo?(jteh)

James Teh [:Jamie]

Comment 22

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #21)

The about:support information doesn't look all that informational; Jamie, can you advise on how we could get more information about why accessibility is enabled here?

Unfortunately, there are some clients which inject dlls into Firefox itself and we can't detect those without some tricky work with a debugger. Again, in bug 1633543, I see msctf (Microsoft Text Services Framework) in the stack.

:saschanaz, are you by any chance using an IME; e.g. for an East Asian language? Also, does your system have a touch screen?

Is there an easy way to turn it off?

You can disable a11y by going to Options and checking "Prevent accessibility services from accessing your browser". However, this is really a last resort and is not something we want to publicly recommend. This will probably move to about:support eventually.

See also bug 1633543, where loading a big page causes a similar freeze in the parent process.

While a11y is definitely what's blocking the parent process here, it's pretty worrying that the content process is taking that long (20+ seconds!) to respond. Certainly, that would impact other things; e.g. if you scroll outside of the area rendered for APZ, try to click a link, etc.

Flags: needinfo?(jteh) → needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 23

•

5 years ago

:saschanaz, are you by any chance using an IME; e.g. for an East Asian language? Also, does your system have a touch screen?

I do use Korean and Japanese IME and also a touchscreen as I'm using Surface Pro 7.

Flags: needinfo?(krosylight)

James Teh [:Jamie]

Comment 24

•

5 years ago

If you disable a11y (by going to Options and checking "Prevent accessibility services from accessing your browser"), does that negatively impact your IME or touch screen in any way?

Flags: needinfo?(krosylight)

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

5 years ago

Depends on: 1633687

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 25

•

5 years ago

•

Edited

does that negatively impact your IME or touch screen in any way?

I guess I need more test but so far not any negative impact. (It does solve Bug 1633543.)

I think it still shouldn't be the default, right?

Flags: needinfo?(krosylight)

Jim Porter (:squib)

Updated

•

5 years ago

Whiteboard: [wfh][qf] → [wfh][qf:p3]

Markus Stange [:mstange]

Comment 26

•

5 years ago

I think it's important to find out why accessibility is being used here. Aaron, can you help (or forward to somebody who can)?

We shouldn't be comfortable shipping a configuration that can freeze the UI this easily to more users than absolutely necessary.
Just today I saw a profile from another Dark Reader user who also had accessibility freezing their parent process.

Flags: needinfo?(aklotz)

Markus Stange [:mstange]

Comment 27

•

5 years ago

Oh, the previous comment was written with bug 1633543 in mind, not so much about the "freeze while building" issue.

James Teh [:Jamie]

Comment 28

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #26)

I think it's important to find out why accessibility is being used here. Aaron, can you help (or forward to somebody who can)?

That would be me. Aaron doesn't directly work on a11y any more.

This seems to be specific to certain systems and I don't have a system on which I can reproduce it. If I can get hold of such a system, I can debug this further. Markus, do you have any idea how I might go about getting hold of such a system?

That said, while I agree we shouldn't enable a11y when we don't need to, I think we should also be asking why the content process is taking so long to respond in these cases. That is not an a11y problem, and as I noted in comment 22, that's going to have other impacts as well even without a11y causing the parent process to freeze. To put it another way, the a11y problem here is a symptom of the content process taking forever to respond.

Flags: needinfo?(aklotz)

James Teh [:Jamie]

Comment 29

•

5 years ago

(In reply to Kagami :saschanaz from comment #25)

I guess I need more test but so far not any negative impact. (It does solve Bug 1633543.)

I think it still shouldn't be the default, right?

A11y needs to be enabled by default because there are users that depend on it. Preventing/disabling it is essentially a troubleshooting step. That said, the fact that you see no negative impact with a11y enabled suggests that you don't need it for your use case, and so it really shouldn't get enabled. The reason for my questions about impact is that if we can figure out why it gets enabled on systems like yours, we now know we can safely block this case.

James Teh [:Jamie]

Comment 30

•

5 years ago

I also have a patch locally which caches Accessible event targets in the parent process. That means a11y doesn't have to make cross-process calls to request information when responding to an event. I abandoned work on this because it didn't give me the perf benefits I hoped for. However, it might be interesting to see whether this helps here.

Kagami Rosylight [:saschanaz] (they/them)

Reporter

Comment 31

•

5 years ago

However, it might be interesting to see whether this helps here.

Is there anything I can help to get the patch landed?

BTW, Firefox still freezes without a11y, while other apps are fine: https://perfht.ml/3dzOaqv

Flags: needinfo?(jteh)

Markus Stange [:mstange]

Updated

•

5 years ago

Depends on: 1637048

Markus Stange [:mstange]

Comment 32

•

5 years ago

(In reply to Kagami :saschanaz from comment #31)

BTW, Firefox still freezes without a11y, while other apps are fine: https://perfht.ml/3dzOaqv

Interesting! This profile no longer shows the accessibility problem, but it still shows the problems from comment 7. I filed bug 1637048 about the issue.

Firefox Bug Husbandry Bot

Comment 33

•

5 years ago

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3

James Teh [:Jamie]

Comment 34

•

5 years ago

(In reply to Kagami :saschanaz from comment #31)

However, it might be interesting to see whether this helps here.

Is there anything I can help to get the patch landed?

It turns out that this patch wasn't actually achieving what it was intended to achieve, and fixing that is a pretty big lift with no guarantee it will even help here. :(

I still really want to figure out why a11y gets enabled on systems like yours and see if we can prevent that. I guess I'll talk to my manager about trying to get hold of such a machine.

Flags: needinfo?(jteh)

Markus Stange [:mstange]

Comment 35

•

5 years ago

There is reason to believe that bug 1687535 will stop accessibility from being enabled on this machine.

Depends on: 1687535

James Teh [:Jamie]

Updated

•

4 years ago

Depends on: 1737192

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P3

Whiteboard: [wfh][qf:p3] → [wfh]

James Teh [:Jamie]

Updated

•

3 years ago

See Also: → https://github.com/nvaccess/nvda/issues/14123

James Teh [:Jamie]

Comment 36

•

2 years ago

This is resolved by Cache the World, which is enabled by default in Firefox 113.

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox113: --- → fixed

status-firefox114: --- → fixed

status-firefox115: --- → fixed

Resolution: --- → FIXED

Bas Schouten (:bas.schouten)

Updated

•

11 months ago

Flags: needinfo?(bas)

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

6 months ago

Keywords: enterprise

Whiteboard: [wfh]