Closed Bug 1396039 Opened 7 years ago Closed 1 year ago

get_accChild hangs somewhat frequently

Categories

(Core :: Disability Access APIs, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox113 --- fixed
firefox114 --- fixed
firefox115 --- fixed

People

(Reporter: nika, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [aes+][bhr:get_accChild][bhr-html][fxperf:p3])

This is a (hopefully stable-ish) view into some of the hang's we're seeing in the parent process due to acc_getChild. It makes up about 5% of the parent process main thread hangs over 2 seconds long being detected by BHR.

https://squarewave.github.io/?durationSpec=2048_65536&payloadID=131360179ef146ab85e01052d16ebe89&platform=Windows_NT%3A10%3Ax86-64&range=4_4&runnable=%3F%3F%3F&search=get_acc&thread=0
Putting this on our e10s a11y ship radar.
Priority: -- → P1
Whiteboard: [bhr] → [aes?][bhr]
Blocks: 1365655
Priority: P1 → P3
Adding the [bhr-html] tag to match arewesmoothyet's File a Bug autofilling. Nika, what are you using the [bhr] tag for? Should we consolidate these tags? I'm using [bhr-html] to track things that were filed due to arewesmoothyet.
Whiteboard: [aes?][bhr] → [aes?][bhr][bhr-html]
(In reply to Doug Thayer [:dthayer] from comment #2)
> Adding the [bhr-html] tag to match arewesmoothyet's File a Bug autofilling.
> Nika, what are you using the [bhr] tag for? Should we consolidate these
> tags? I'm using [bhr-html] to track things that were filed due to
> arewesmoothyet.

I've been using [bhr] to mark performance issues which have been detected through BHR, although I haven't been super consistent about it. The idea was to have some way to track performance issues which cause hangs in a similar way to how we tracked [qf] bugs.
Whiteboard: [aes?][bhr][bhr-html] → [aes+][bhr][bhr-html]
Jamie just a needinfo for your awareness. Wondering if this is related to freezes Mick was seeing.
Flags: needinfo?(jteh)
Actually, Nika, is this still showing up? I could find it in recent report data. I do get some hits for "Accessible" in the call stack though...
Flags: needinfo?(nika)
This definitely still shows up - not sure if the volume is still comparable to the old one - we seem to get (from nightly) 50-200 reports of >2s hangs in get_accChild per release in the last week. 

http://arewesmoothyet.com/?category=all&durationSpec=2048_65536&payloadID=76f882ef7b6343269088e1acee7ef0a0&search=get_accChild&thread=1
Flags: needinfo?(nika)
I can't read these reports due to inaccessibility. It'd be good if someone could give me an example of a stack where this occurs.

The only reason I can think that this would occur is if the content process is busy handling other stuff.
Flags: needinfo?(jteh)
This stack is still very frequently hanging, yesterday it caused an accumulated 2.8 hours of hang in over 15k occurrences in Nightly alone.

This is the most frequently hanging stack:

mozilla::a11y::AccessibleHandler::get_accChild(tagVARIANT,IDispatch * *) AccessibleHandler
static struct already_AddRefed<IDispatch> GetProxiedAccessibleInSubtree(const class mozilla::a11y::DocAccessibleParent *, const struct tagVARIANT & const) xul
struct already_AddRefed<IAccessible> mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor(const struct tagVARIANT & const) xul
mozilla::a11y::AccessibleWrap::GetIAccessibleFor(tagVARIANT const &,bool *) xul
mozilla::a11y::AccessibleWrap::get_accChild(tagVARIANT,IDispatch * *) xul
AccWrap_Base::get_accChild(tagVARIANT,IDispatch * *) oleacc
AccessibleObjectFromEvent oleacc
CThreadInputMgr::OnAccFocusEvent(HWND__ *,long,long) msctf
WinEventProc(HWINEVENTHOOK__ *,unsigned long,HWND__ *,long,long,unsigned long,unsigned long) msctf
_ClientCallWinEventProc user32
KiUserCallbackDispatcherContinue ntdll
NtUserNotifyWinEvent win32u
NotifyWinEvent user32
mozilla::a11y::DocAccessibleParent::RecvFocusEvent(unsigned __int64 const &,mozilla::gfx::IntRectTyped<mozilla::LayoutDevicePixel> const &) xul
mozilla::a11y::PDocAccessibleParent::OnMessageReceived(IPC::Message const &) xul
PDocAccessible::Msg_FocusEvent 
void mozilla::ipc::MessageChannel::DispatchMessageW(class IPC::Message *) xul
mozilla::ipc::MessageChannel::MessageTask::Run() xul
nsThread::ProcessNextEvent(bool,bool *) xul

Jamie, does this help you in any way? It would be great to see this reduced.
Flags: needinfo?(jteh)
Whiteboard: [aes+][bhr][bhr-html] → [aes+][bhr][bhr-html][fxperf]
Are these hangs still > 2 seconds as suggested in comment 0? Do we have any idea of the average hang time, etc.?

From this stack, it looks like the focus changes in content, we fire an accessibility focus event, and then Microsoft Text Services Framework (used for IMEs) wants to query info about the focus (which seems perfectly reasonable). Because the focus is in content, it has to make a cross-process query. That's where the hang seems to occur. That suggests that the content main thread is hung for some reason which may not even be related to accessibility. In order to really figure out what's oging on here, we'd need to correlate this with the stack at that same time from the content process being waited on, but I'm guessing there's no way for us to get that info?

One idea is to send up a COM proxy (along with cached handler data) for the focus along with the focus event. That way, the info TSF needs would hopefully already be there and not require a cross-proc call. I'm not sure whether this is possible, though; it would need investigation. Unfortunately, there's also no guarantee it would work: it's entirely possible TSF will query for something that isn't in the cache, at which point we're back to square 1. And the worst part is that there's no way to even know what TSF will ask for in these situations without being able to reproduce this ourselves.
Flags: needinfo?(jteh)
Yes, there are definitely still cases where this is hanging more than 2 seconds. I'm not sure if just averaging out the number I posted above is helpful here, in the > 2s hang case the mean time seems to be 4s, in the 0.5 - 2s range the mean is about 900ms.
Nika, regarding comment 9, it seems like we could correlate hangs across processes reasonably well just using timestamps, no? We could either send the timestamps with the ping or add some kind of correlatedHangId to hangs based on timestamps when we're building the ping. Thoughts?
Flags: needinfo?(nika)
(In reply to Doug Thayer [:dthayer] from comment #11)
> Nika, regarding comment 9, it seems like we could correlate hangs across
> processes reasonably well just using timestamps, no? We could either send
> the timestamps with the ping or add some kind of correlatedHangId to hangs
> based on timestamps when we're building the ping. Thoughts?

I'm honestly not sure. Perhaps? I haven't touched the code much recently. If you have an idea for how to implement it go for it :-)
Flags: needinfo?(nika)
Whiteboard: [aes+][bhr][bhr-html][fxperf] → [aes+][bhr][bhr-html][fxperf:p3]
Whiteboard: [aes+][bhr][bhr-html][fxperf:p3] → [aes+][bhr:get_accChild][bhr-html][fxperf:p3]

I just ran into this and got a profile (link). It doesn't look tremendously useful - the content process is busy responding to a memory-pressure event during the window where we're blocking on get_accChild. The only conclusion I can draw here is "sometimes content processes are hung and get_accChild means the parent process can hang too," which brings us back to the suggestion in the bottom paragraph of comment 9.

That being said, I don't know why accessibility code is even running on my machine? It may have something to do with the fact that I have a drawing tablet plugged in? James, should this code be running for me if that's the case, and I haven't done anything else on my machine to set up accessibility utilities? The tablet is pretty common, just a Wacom Intuos something or other.

Flags: needinfo?(jteh)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #13)

"sometimes content processes are hung and get_accChild means the parent process can hang too," which brings us back to the suggestion in the bottom paragraph of comment 9.

The a11y team has a (massive) project underway to completely re-architect cross-process a11y to deal with this among other things. The idea is to cache the a11y trees for all content processes in the parent process; see bug 1694563. Unfortunately, it keeps getting de-prioritised due to other work and too few people. :(

That being said, I don't know why accessibility code is even running on my machine? It may have something to do with the fact that I have a drawing tablet plugged in? James, should this code be running for me if that's the case, and I haven't done anything else on my machine to set up accessibility utilities?

The tablet is very likely the cause, yes. However, bug 1687535 dealt with a lot of cases of unwanted instantiation like this. I'm not sure why that tweak isn't working in your case, and without being able to debug on the machine itself, I can't really diagnose that. If you're familiar with Windows debugging, you could try placing a breakpoint on mozilla:a11y::LazyInstantiator::ResolveMsaaRoot. I'd be interested to know what the stack is (and I made need arguments from some functions, so a minidump would be very helpful).

Also, what does about:support show for "Accessibility instantiator"?

Flags: needinfo?(jteh)
Depends on: 1737192

(for comment 14)

Flags: needinfo?(dothayer)

Ack! Sorry for dropping the thread here. Anyway, I can no longer reproduce this. However, the request sounded really familiar so I did some digging, and I found a chat thread with aklotz from a year ago where he asked for the same info, and my response was "about:support says the Accessbility Instantiator is UNKNOWN|" and "I couldn't see anything in the stack - it all looked pretty generic?". Then aklotz said "hmmm... interesting...", and that's the end of the thread... 🤷

Flags: needinfo?(dothayer)
Severity: normal → S3

This is resolved by Cache the World, which is enabled by default in Firefox 113.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.