Open Bug 1572915 Opened 2 years ago Updated 15 hours ago

Intermittent extreme unresponsiveness and high CPU usage after using Firefox for several hours with accessibility enabled

Categories

(Core :: Disability Access APIs, defect, P2)

x86_64
Windows 10
defect

Tracking

()

People

(Reporter: MarcoZ, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, hang, regression)

Crash Data

Attachments

(4 files)

This bug is for crash report bp-a7adeada-5206-4bf6-8043-4cae10190810.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForSingleObject 
1 kernelbase.dll WaitForSingleObjectEx 
2 xul.dll base::Thread::Stop ipc/chromium/src/base/thread.cc:128
3 xul.dll void mozilla::ipc::BrowserProcessSubThread::~BrowserProcessSubThread ipc/glue/BrowserProcessSubThread.cpp:52
4 xul.dll void mozilla::ipc::BrowserProcessSubThread::~BrowserProcessSubThread ipc/glue/BrowserProcessSubThread.cpp:51
5 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:776
6 xul.dll void ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1243
7 xul.dll int XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:4793
8 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp:4847
9 firefox.exe static int NS_internal_main browser/app/nsBrowserApp.cpp:295

Jamie, I got this while shutting down Firefox when it hang once again with the hang we have been chasing for months, if not years. Perhaps you can get something, anything, out of the crash dump for the above report.

Flags: needinfo?(jteh)

Sadly, there doesn't seem to be much of use here. The process is already starting to shut down, so it's hard to know which parts are just the shutdown (albeit very slow) and which parts might be making it slow. One thing that does seem suspicious is this in thread 6:
CROIDTable::ClientBulkUpdateOIDWithPingServer
When I see this problem, I do see a COM service using a lot of CPU and we do mess with COM ping behaviour. That still doesn't explain what triggers this or why, though.

Flags: needinfo?(jteh)

Morphing this into a bug to track this extreme unresponsiveness problem, since this crash is more a symptom of trying to exit when things are already broken rather than the problem itself.

Priority: -- → P2
Summary: Crash in [@ shutdownhang | base::Thread::Stop] → Intermittent extreme unresponsiveness and high CPU usage after using Firefox for several hours with accessibility enabled

Asa captured a profile when this problem was happening:
https://perfht.ml/2MVPoDd
Unfortunately, I can't see anything useful. That's very possibly because the problematic behaviour is happening on a background (COM?) thread. I'm not sure how to profile background (potentially system spawned) threads with the Firefox Profiler.

Attached file profiler stack
I see a significant time spent in this stack. Looks like it originates from content state change events.
(ie. `RecvStateChangeEvent`).

Note that I see also show hide and text change events, but they only spend 1-3 ms total each, while the state change events total 1,707 ms.

Thanks. Curious. That either means we're firing a lot of state change events, or we're only firing a few of them but it's taking a long time for us to make cross-proc COM queries to the content process to get info about them. Do you see anything suspicious in the content process?

The profile seems to start after the state change event(s) was fired from content? which is interesting.

I think it would be helpful if we put in an auto-marker in AccessibleWrap::FireWinEvent to see how often that is called, and how long we block there.

The stack in comment 4 shows that we're spending a lot of time in an NVDA code path which was added to work around bug 420845, which was fixed 10 years ago in Firefox 3.6. I've filed an NVDA PR to remove this. That might make a difference here. That said, that either means we're firing a huge number of state change events (why so many?) or we're blocking for a long time when NVDA tries to respond to them (why?).

Also, some notes from Asa on Slack:

I've seen this Firefox + NVDA + RPC hang several times now and while I don't know what's causing it, I can describe it more clearly.
After using Firefox with NVDA for three to four hours, a Windows process called "Service Host: Remote Procedure Call" starts chewing fifteen to twenty percent of my CPU. At that point, all Windows apps start to become intermittently but frequently unresponsive (not just Firefox and NVDA.) This RPC process, which has two services listed in its tree, "Remote Procedure Call (RPC)" and "RPC Endpoint Mapper," is using a fair bit of CPU but it doesn't seem like enough that my whole system would be CPU bound and Firefox and NVDA are in the low single-digits of CPU so I don't think CPU alone causing the hangs. It's more likely some functional bottleneck or something like that.
If I kill the NVDA process first, my apps become mostly responsive again, though that RPC process continues to chew about 15% CPU and Firefox will still occasionally jank, but not as often and not as sustained as when NVDA was running. If I then kill the Firefox process, the RPC process winds down to near 0% CPU and the system becomes completely normal. If, on the other hand, when the hangs start, I kill the Firefox process first, my apps become responsive again and the RPC process winds down to near 0% CPU and I can leave NVDA running without any issues. Lastly, if I kill the RPC process at any point, my whole system goes down -- don't bother trying this.
So, it's a three way interaction causing the worst of the jank. Taking only NVDA out of the system returns things to OK but not great functioning. Taking only Firefox out of the equation returns the system to perfect shape.

I just saw this and tried killing all processes except the parent (UI) process. Obviously, that "crashed" all of my tabs, but it also fixed the unresponsiveness. That suggests the content process is triggering this somehow, though I guess it could also be related to the parent process communicating with the offending content process.

(In reply to James Teh [:Jamie] from comment #7)

The stack in comment 4 shows that we're spending a lot of time in an NVDA code path which was added to work around bug 420845, which was fixed 10 years ago in Firefox 3.6. I've filed an NVDA PR to remove this. That might make a difference here.

That PR got merged quite a while ago... and I can still repro this, so it isn't making a difference, as I suspected.

Yes, I can also still reproduce this on two machines. Sometimes, it is enough to just let Firefox sit there, eventually it will start acting up. It is probably a refresh in one of the background tabs that triggers it. I still heavily suspect Twitter PWA to be a reliable way to trigger it if it runs long enough, but it is not confined to that.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

After the fix for bug 1633650, I didn't see this for more than a week, where I was seeing it between once every couple of days and a few times a day before that. Marco hasn't seen it at all since that fix. My suspicion is that we were leaking COM objects in the content process without that fix, which might cause COM to get bogged down in some periodic task.

However, I just saw it again, and Asa still sees it. So, there's more to do here.

Depends on: 1633650
Depends on: 1637474
Depends on: 1650590
Severity: critical → S2

Unfortunately, I just got this hang again a minute ago. So it seems that bug 1650590 does not fix it. :-(

The observation is the same as I described in our 1:1 last week: If I close just the window with the offending content process in it, the rest of Firefox resumes to operate normally. So it is something in a content process, not the parent process, because I can reopen the window and continue to work after the hang was averted.

I caught this yesterday and did some debugging.

  • Most of the time, there was no Firefox process using high CPU, only the RPC service mentioned in comment 1 and comment 7.
  • There was a CPU spike in one of the content processes at one point, so I grabbed a stack dump for that.
  • I broke into each content process with a debugger. When I broke into one of them, the RPC service CPU usage went down to 0. I got all excited... but when I resumed that content process, the CPU usage didn't go back up. Furthermore, the CPU usage went up again later even after I killed that process. So, I suspect that was just a coincidence. Nevertheless, I grabbed a stack dump for that one too.
  • In the parent process, I see several threads (including the main thread) stuck in combase!CROIDTable::ClientBulkUpdateOIDWithPingServer. That function is consistent with my findings in comment 1. It seems this gets called sometimes (maybe always?) when we acquire or release a COM object. In the stack dump I'm attaching, the main thread is releasing a COM object that isn't accessibility related at all. In fact, it doesn't look like any of these are related to accessibility objects. My guess is that we're cluttering some COM table earlier and COM can't handle it when it's updating things.
  • There is no mention of mscom or a11y in any of the threads in any of the processes.

Unfortunately, this means we haven't learnt anything new; there doesn't seem to be anything useful to be gained from these stacks. This is probably due to some COM table being overwhelmed, but we still don't know why. Once we've hit this point, the damage is already done.

If this is indeed related to the COM ping table, one thing I don't understand is why our objects end up in the ping table at all. We explicitly disable COM ping behaviour!

Depends on: 1654969
Depends on: 1654970

I filed bug 1654970 for the UIA leak (based on comment 18 onward).

I was hoping bug 1654969 would fix this, but Asa just encountered this problem in a build which contained this fix, so no joy. :(

I've asked Asa to test with UIA disabled in NVDA to see whether the UIA leak (bug 1654970) is triggering this. Obviously, disabling UIA isn't feasible for users relying solely on NVDA, and unfortunately, this leak is out of our control. However, if we confirm the UIA leak is the trigger, we can try to get Microsoft to prioritise a fix.

I've since encountered the RPC hang too.

Asa encountered this even with UIA disabled in NVDA. We even confirmed that uiautomationcore.dll wasn't running in the Firefox process. I'm not completely ruling out the UIA leak, since it's possible for UIA to run cross-process. This is unlikely because UIA should only do this if injection of uiautomationcore.dll is blocked. Still, unlikely doesn't mean impossible.

Since then, I asked Asa to run NVDA + Firefox for hours during normal usage with three further configurations:

  1. Firefox build with AccessibleHandler tweaked to specify MSHLFLAGS_NOPING when marshaling. Still got the hang after several hours.
  2. Virtual buffers disabled. Still got the hang after several hours.
  3. Virtual buffers disabled and Firefox accessible.handler.enabled set to false. No hang or RPC CPU spikes for 22 hours.

With that, here's what we can surmise :

  1. This doesn't seem to be caused by in-process clients. Reasoning: This has never been seen with JAWS.
  2. This doesn't seem to be triggered by NVDA's vbuf code, since it still happens when NVDA's vbufs are disabled.
  3. Therefore, this is probably related to out-of-process clients.
  4. This doesn't happen when AccessibleHandler is disabled.
  5. Therefore, this would seem to be somehow related to the marshaling of objects via AccessibleHandler destined for out-of-process clients.
    • However, having the handler marshal with NOPING doesn't fix it. That's not surprising, since NOPING is supposed to apply to the object as a whole after the first marshal.

I guess the next step is for me to dig into what else might be weird about our marshaling out from the handler.

I discussed this with aklotz and it occurred to us that the marshaler might be querying other interfaces implemented by the handler which don't map directly to proxied interfaces; e.g. IServiceProvider. I added some code to log any QIs which occur while the handler is marshaling out to a client. No joy: no QIs were logged.

When we marshal from the handler to an out-proc client, we strip the handler (because some clients can't load it), but we also override the IID, changing it from IID_IAccessible2_3 to IID_IAccessible. We do this because a client might not know about IID_IAccessible2_3. I tried removing this SetIID call, since NVDA knows about IID_IAccessible2_3 these days. Unfortunately, this didn't help either.

I also verified that:

  • CoReleaseMarshalData is definitely being called when a DocAccessibleParent dies and thus the PreservedStreamPtr gets destroyed.
  • This CoReleaseMarshalData does seem to be working. If we reset DocAccessibleParent::pParentProxyStream immediately after setting it (causing CoReleaseMarshalData to be called before the parent is ever returned to the client), trying to fetch the parent from content results in HRESULT "Object is not registered".
  • Parent process OuterDocAccessibles (which are returned as the parent of content documents via PassthruProxy) do get destroyed when they should. (I logged OuterDocAccessible construction and destruction.)
You need to log in before you can comment on or make changes to this bug.