Closed Bug 1687535 Opened 3 years ago Closed 3 years ago

Accessibility gets initialized by unsupported interfaces for example on windows touch screen and related devices

Categories

(Core :: Disability Access APIs, defect)

defect

Tracking

()

RESOLVED FIXED
86 Branch
Performance Impact high
Tracking Status
firefox86 --- fixed

People

(Reporter: bas.schouten, Assigned: bas.schouten)

References

(Blocks 1 open bug)

Details

(Keywords: perf:responsiveness)

Attachments

(2 files, 1 obsolete file)

When the content process is somewhat busy (100-150ms JS executions, which aren't that uncommon during scenarios like pageload) my windows machines seem to frequently have long pauses in accessibility. This is related to focus events and I can reproduce this in at least one of my testcases.

See also this profile: https://share.firefox.dev/2XUvPPb

This has a huge impact on the user experience for this testcase. And I believe it has the same root cause as other use facing issues like https://bugzilla.mozilla.org/show_bug.cgi?id=1667901#c6

Flags: needinfo?(jteh)

The thing I'm really struggling to understand with issues like these is how the busy content process doesn't cause user-visible problems where a11y isn't a factor.

What happens here is that some a11y client (in a lot of these cases, it looks like some Windows component related to touch screens) receives an a11y focus event from content and wants to query the focus to see what it is. Because it's from content and a11y queries are sync, we have to dispatch a sync call to the busy content process; there's no way we can do it async. So, yes, that blocks the parent process, which definitely isn't ideal.

However, surely that affects other user tasks: typing, scrolling beyond APZ, etc.? Is it just that users don't generally scroll beyond APZ within, say, 150 ms, so the content process has a chance to catch up? Or is there something else I'm missing here? Keep in mind that I'm a totally blind user myself, so my understanding of the finer aspects of the visual UX is somewhat limited.

Flags: needinfo?(jteh)

Btw, are you using an installed Firefox build; i.e. not zip or local? Does about:support show "Accessible Handler Used" as true? If that's false, perf is going to be a great deal worse, since we can't cache anything in that case and there'll be a lot more sync cross-proc calls.

(In reply to James Teh [:Jamie] from comment #1)

However, surely that affects other user tasks: typing, scrolling beyond APZ, etc.? Is it just that users don't generally scroll beyond APZ within, say, 150 ms, so the content process has a chance to catch up? Or is there something else I'm missing here? Keep in mind that I'm a totally blind user myself, so my understanding of the finer aspects of the visual UX is somewhat limited.

If the content process is unresponsive, it does affect many interactions with the unresponsive tab: Typing in that tab is unresponsive, clicks are delayed, and scrolling with APZ reveals unpainted blank areas of the page. However, interactions in the browser chrome are still fully responsive: You can switch tabs, interact with the toolbar buttons, enter URLs etc. The freezing only affects the frozen page. As for scrolling, APZ pre-caches some additional content beyond what's currently visible on the screen, but it doesn't pre-cache the entire page. So while the content page is unresponsive, you can APZ scroll to reveal the cached areas. Anything beyond that is empty. Most importantly, the user still feels in control because their input is immediately respected - scrolling immediately moves the page, even if it doesn't immediately display the content that should be at the revealed location.

One of the main reasons for e10s (i.e. splitting Firefox into multiple processes) was just this: Improving responsiveness of the browser by keeping the browser UI responsive even when content pages are busy.
We can't allow the parent process to block on content. It entirely defeats the purpose.

On one machine I get:

Accessibility
Activated true
Prevent Accessibility 0
Accessible Handler Used false
Accessibility Instantiator UNKNOWN|

On another I get:

Accessibility
Activated true
Prevent Accessibility 0
Accessible Handler Used true
Accessibility Instantiator UNKNOWN|

Both exhibit these responsiveness issues. So the handler doesn't seem to matter. The question is why do these machines think accessibility is activated? Where's the code that determins this so I can try and debug?

A quick glance suggests the accessibility service is being activated here: https://paste.rs/zKx Therefore I expect it to be on for everyone with DPI scaling..... This is.. bad.

(In reply to Bas Schouten (:bas.schouten) from comment #6)

A quick glance suggests the accessibility service is being activated here: https://paste.rs/zKx Therefore I expect it to be on for everyone with DPI scaling..... This is.. bad.

This seems to have been a premature conclusion. The DPI scaling window proc being on the stack seems to be unrelated, the investigation continues.

Basically LaxyInstantiator::ShouldInstantiate is getting an in-process call seemingly from the windows internals. I have not yet been able to determine why. None of the blocklisted DLLs are loaded, and so it proceeds to instantiate a11y.

(In reply to Markus Stange [:mstange] from comment #4)

We can't allow the parent process to block on content. It entirely defeats the purpose.

Ideally, I agree, except we are dealing with OS APIs which are sync and for which there is no async equivalent. The only way to get around this completely would be to cache the entire a11y tree in the parent process. FWIW, that's something we're looking into this year, although we're estimating it'll be about 12 months of work for our team.

(In reply to Bas Schouten (:bas.schouten) from comment #5)

The question is why do these machines think accessibility is activated? Where's the code that determins this so I can try and debug?

Something in msctf.dll (which I believe is related to Microsoft Text Services Framework) seems to respond to accessibility events on certain systems, such as Microsoft Surface machines. I'm unclear as to whether this is all touch screen devices or only some. I don't have access to such a machine, so it's not something I'm able to debug.

If we can get the function signature for TF_InvalidAssemblyListCacheIfExist, we could hook it and prevent a11y in that case. We already do this for tiptsf!ProcessCaretEvents, which is needed to block touch screen a11y instantiations on Windows 8. We also have code to block touch screen a11y instantiations on Windows 10, but either that code isn't working any more, Windows has evolved or some devices are different somehow.

For informative purposes I'm putting up a very rough workaround patch here. It does the job and fixes a number of responsiveness issues. Not sure if it provides practical problems using a11y.

Assignee: nobody → bas
Assignee: bas → nobody

(In reply to James Teh [:Jamie] from comment #9)

(In reply to Bas Schouten (:bas.schouten) from comment #5)

The question is why do these machines think accessibility is activated? Where's the code that determins this so I can try and debug?

Something in msctf.dll (which I believe is related to Microsoft Text Services Framework) seems to respond to accessibility events on certain systems, such as Microsoft Surface machines. I'm unclear as to whether this is all touch screen devices or only some. I don't have access to such a machine, so it's not something I'm able to debug.

Note that one of these devices is a desktop, but with a drawing tablet connected. So that may trigger it. The other is a Thinkpad, without the screen open but the screen is a touch screen.

(In reply to James Teh [:Jamie] from comment #9)

(In reply to Markus Stange [:mstange] from comment #4)

We can't allow the parent process to block on content. It entirely defeats the purpose.

Ideally, I agree, except we are dealing with OS APIs which are sync and for which there is no async equivalent. The only way to get around this completely would be to cache the entire a11y tree in the parent process. FWIW, that's something we're looking into this year, although we're estimating it'll be about 12 months of work for our team.

How do the security folks feel about that?

Fwiw, Visual Studio believes our stack unwinder is wrong, it lists this as the stack: https://paste.rs/Oo6

(In reply to James Teh [:Jamie] from comment #9)

(In reply to Markus Stange [:mstange] from comment #4)

We can't allow the parent process to block on content. It entirely defeats the purpose.

Ideally, I agree, except we are dealing with OS APIs which are sync and for which there is no async equivalent. The only way to get around this completely would be to cache the entire a11y tree in the parent process. FWIW, that's something we're looking into this year, although we're estimating it'll be about 12 months of work for our team.

That's excellent news!

(In reply to Bas Schouten (:bas.schouten) from comment #14)

Fwiw, Visual Studio believes our stack unwinder is wrong, it lists this as the stack: https://paste.rs/Oo6

Looks like it's just some bad symbols for some system DLLs. Our stack unwinder only looks at what's in the DLL, whereas Visual Studio gets extra symbols from the Microsoft Symbol Server.

What are the arguments to LazyInstantiator::QueryService? There is a possibility that the service ID and/or IID are unsupported anyway we could filter out those requests...

(In reply to Aaron Klotz [:aklotz] from comment #16)

What are the arguments to LazyInstantiator::QueryService? There is a possibility that the service ID and/or IID are unsupported anyway we could filter out those requests...

IID is {33F139EE-E509-47F7-BF39-837644F74576}. Only reference to this I can see is https://social.msdn.microsoft.com/Forums/en-US/ff2857bb-b682-4d6f-92ee-68737567be87/windows-8-onscreen-keyboard-on-logon-screen, related to OSK, indeed I have a drawing tablet connected to my desktop, which supports OSK, and my laptop has a touch screen which supports OSK as well. We don't support this IID, it may be best to just explicitly ignore it.

(In reply to Bas Schouten (:bas.schouten) from comment #17)

(In reply to Aaron Klotz [:aklotz] from comment #16)

What are the arguments to LazyInstantiator::QueryService? There is a possibility that the service ID and/or IID are unsupported anyway we could filter out those requests...

IID is {33F139EE-E509-47F7-BF39-837644F74576}. Only reference to this I can see is https://social.msdn.microsoft.com/Forums/en-US/ff2857bb-b682-4d6f-92ee-68737567be87/windows-8-onscreen-keyboard-on-logon-screen, related to OSK, indeed I have a drawing tablet connected to my desktop, which supports OSK, and my laptop has a touch screen which supports OSK as well. We don't support this IID, it may be best to just explicitly ignore it.

Going to go one step further and create a list of IIDs we -do- support. And return E_NOTIMPL for everything else, how do you feel about this Aaron?

Flags: needinfo?(aklotz)

(In reply to Bas Schouten (:bas.schouten) from comment #18)

(In reply to Bas Schouten (:bas.schouten) from comment #17)

(In reply to Aaron Klotz [:aklotz] from comment #16)

What are the arguments to LazyInstantiator::QueryService? There is a possibility that the service ID and/or IID are unsupported anyway we could filter out those requests...

IID is {33F139EE-E509-47F7-BF39-837644F74576}. Only reference to this I can see is https://social.msdn.microsoft.com/Forums/en-US/ff2857bb-b682-4d6f-92ee-68737567be87/windows-8-onscreen-keyboard-on-logon-screen, related to OSK, indeed I have a drawing tablet connected to my desktop, which supports OSK, and my laptop has a touch screen which supports OSK as well. We don't support this IID, it may be best to just explicitly ignore it.

Going to go one step further and create a list of IIDs we -do- support. And return E_NOTIMPL for everything else, how do you feel about this Aaron?

Or we could just reuse https://searchfox.org/mozilla-central/source/accessible/ipc/win/handler/AccessibleHandler.cpp#1317 ... which actually has this GUID in there :)

(In reply to Bas Schouten (:bas.schouten) from comment #18)

Going to go one step further and create a list of IIDs we -do- support. And return E_NOTIMPL for everything else, how do you feel about this Aaron?

That seems reasonable. Unfortunately, from what I've seen in the past, I'd say we're eventually going to hit some other (non-QueryService) call we can't handle simply like that; e.g. some IAccessible call. Let's hope I'm wrong.

Pretty much, assuming Jamie's OK with this plan. We should gate on the service ID (I don't think the IID really matters as much) and only allow the ones we actually support. For reference, the "real" implementation of IServiceProvider::QueryService is here, so any GUIDs we check aGuidService against in that function should be included in such a LazyInstantiator allow list.

Flags: needinfo?(aklotz)

This prevents us from instantiating accessibility (and incurring a significant performance cost later in the runtime) when we're not going to support a requested interface anyway.

Assignee: nobody → bas
Status: NEW → ASSIGNED

(In reply to Aaron Klotz [:aklotz] from comment #21)

Pretty much, assuming Jamie's OK with this plan. We should gate on the service ID (I don't think the IID really matters as much) and only allow the ones we actually support. For reference, the "real" implementation of IServiceProvider::QueryService is here, so any GUIDs we check aGuidService against in that function should be included in such a LazyInstantiator allow list.

For simplicity sake, I used the Unsupported list for now. Depending on the telemetry that comes back from this I'd like to try and uplift it and minimal risk to accessibility seems important. We can do a whitelist based approach later in my mind :).

Summary: Somewhat busy content process leads to long jank in accessibility → Accessibility gets initialized by unsupported interfaces for example on windows touch screen and related devices
Blocks: 1626789
Blocks: 1623221

I'm assuming you tested this on your system and it prevented instantiation, which is great. However, I'm not convinced this is going to solve all the cases of this we know about on touch systems. My investigation in bug 1531537 comment 7 showed that even after QueryService, msctf made a11y calls such as get_accParent, which this won't prevent. Still, we can deal with other cases separately.

(In reply to James Teh [:Jamie] from comment #25)

I'm assuming you tested this on your system and it prevented instantiation, which is great. However, I'm not convinced this is going to solve all the cases of this we know about on touch systems. My investigation in bug 1531537 comment 7 showed that even after QueryService, msctf made a11y calls such as get_accParent, which this won't prevent. Still, we can deal with other cases separately.

So it worked sometimes (the first time), and isn't working others, sadly, I don't know why, but am investigating.

(In reply to James Teh [:Jamie] from comment #25)

I'm assuming you tested this on your system and it prevented instantiation, which is great. However, I'm not convinced this is going to solve all the cases of this we know about on touch systems. My investigation in bug 1531537 comment 7 showed that even after QueryService, msctf made a11y calls such as get_accParent, which this won't prevent. Still, we can deal with other cases separately.

The get_accChild is actually responsible for the number 2 top hang in BHR. If we factor in that only 10% of people have this issue it is by far the number one reason for hangs on these devices. And so will have to be addressed. I will continue to investigate to figure out a good way, if it means some hackery then I guess that will have to happen.

Whiteboard: [qf:p1][responsiveness]

(In reply to Bas Schouten (:bas.schouten) from comment #26)

So it worked sometimes (the first time), and isn't working others, sadly, I don't know why, but am investigating.

Honestly, I think the best way to deal with this is probably to hook MSCTF!WinEventProc as I speculated in bug 1531537 comment 9. I didn't suggest that here because the initial stack didn't include that frame, but the stack from VS does, so I think that's the best path.

It'd help a lot if I could get hold of an x64 machine which exhibits this... but in the meantime, I can only provide guidance.

(In reply to Bas Schouten (:bas.schouten) from comment #13)

The only way to get around this completely would be to cache the entire a11y tree in the parent process.

How do the security folks feel about that?

I haven't asked explicitly, but note that:

  1. The entire a11y tree is already available from the parent process (albeit not in memory), since we link the trees together by way of COM and any client (in-process or out-of-process) can reach into content as needed.
  2. A subset of the info for the entire tree (the role and relationships between objects) is already available in parent.
  3. Caching the entire tree in parent is the approach taken by Chromium.
  4. This is the only way we've been able to come up with (and believe me, we've brainstormed a lot of solutions over the past few years) to truly address these performance concerns.

I suspect I may be able to do something easier. The first request that comes in is for CHILDID_SELF (as you'd already concluded), this is obviously easily handled by returning the LazyInstantiator. The second one is for OBJ_SYSMENU on the LazyInstantiator, I believe that, we can again, handle this without instantiating accessibility. Handling that one now..

I suspect your approach in https://bugzilla.mozilla.org/show_bug.cgi?id=1531537#c7 may be very feasible and actually allow us to be correct with relation to OSKs.

(In reply to James Teh [:Jamie] from comment #29)

(In reply to Bas Schouten (:bas.schouten) from comment #13)

The only way to get around this completely would be to cache the entire a11y tree in the parent process.

How do the security folks feel about that?

I haven't asked explicitly, but note that:

  1. The entire a11y tree is already available from the parent process (albeit not in memory), since we link the trees together by way of COM and any client (in-process or out-of-process) can reach into content as needed.
  2. A subset of the info for the entire tree (the role and relationships between objects) is already available in parent.
  3. Caching the entire tree in parent is the approach taken by Chromium.
  4. This is the only way we've been able to come up with (and believe me, we've brainstormed a lot of solutions over the past few years) to truly address these performance concerns.

It also would be a huge cost to pay for the majority of users getting all of accessibility enabled just because they have a touch screen which they're not using. There's other costs to performance and memory here rather than just this call. And that group is only going to grow.

(In reply to Bas Schouten (:bas.schouten) from comment #30)

I suspect your approach in https://bugzilla.mozilla.org/show_bug.cgi?id=1531537#c7 may be very feasible and actually allow us to be correct with relation to OSKs.

It depends how you define "correct". I really don't know what the OSK wants or why it wants it. If disabling accessibility doesn't break anything in the OSK, then clearly the OSK doesn't really need this info at all and we should just prevent it from using a11y. On the other hand, if we truly want to give the OSK what it wants, we should be firing focus events (which requires a11y to be instantiated) so it can get the focused role, parent or whatever it wants.

My guess is that the OSK wants to look at focus events to work out whether a text box got focus, for example. It will never get the right answer to that question unless a11y is fully instantiated, since we don't fire focus events or have any a11y info at all before that point.

We can't have it both ways. We either want to allow the OSK to use a11y or we don't.

(In reply to Bas Schouten (:bas.schouten) from comment #30)

I suspect I may be able to do something easier. The first request that comes in is for CHILDID_SELF (as you'd already concluded), this is obviously easily handled by returning the LazyInstantiator. The second one is for OBJ_SYSMENU on the LazyInstantiator, I believe that, we can again, handle this without instantiating accessibility. Handling that one now..

I suspect your approach in https://bugzilla.mozilla.org/show_bug.cgi?id=1531537#c7 may be very feasible and actually allow us to be correct with relation to OSKs.

Ah, the OBJ_SYSMENU one is unrelated, that just happens because accessibility has now been instantiated :). It doesn't appear to happen otherwise afaict.

(In reply to James Teh [:Jamie] from comment #32)

(In reply to Bas Schouten (:bas.schouten) from comment #30)

I suspect your approach in https://bugzilla.mozilla.org/show_bug.cgi?id=1531537#c7 may be very feasible and actually allow us to be correct with relation to OSKs.

It depends how you define "correct". I really don't know what the OSK wants or why it wants it. If disabling accessibility doesn't break anything in the OSK, then clearly the OSK doesn't really need this info at all and we should just prevent it from using a11y. On the other hand, if we truly want to give the OSK what it wants, we should be firing focus events (which requires a11y to be instantiated) so it can get the focused role, parent or whatever it wants.

My guess is that the OSK wants to look at focus events to work out whether a text box got focus, for example. It will never get the right answer to that question unless a11y is fully instantiated, since we don't fire focus events or have any a11y info at all before that point.

We can't have it both ways. We either want to allow the OSK to use a11y or we don't.

I suspect, at least, that it's trying to get less information than that, since at this point the OSK isn't loaded at all, it just exists and may be loaded at some point! But it's certainly possible I'm wrong :-).

(In reply to Bas Schouten (:bas.schouten) from comment #31)

[Caching the a11y tree in parent] also would be a huge cost to pay for the majority of users getting all of accessibility enabled just because they have a touch screen which they're not using. There's other costs to performance and memory here rather than just this call. And that group is only going to grow.

Sure. So if you don't want the OSK to pay that penalty, we hook MSCTF!WinEventProc and block it completely.

I'd point out that Windows itself is starting to use a11y for more and more things. Windows Text Cursor Indicator is the most recent example. We can't continue to see a11y as something only used by traditional accessibility tools like screen readers. And if we want those things to perform well, caching in parent is the only option.

(In reply to James Teh [:Jamie] from comment #35)

(In reply to Bas Schouten (:bas.schouten) from comment #31)

[Caching the a11y tree in parent] also would be a huge cost to pay for the majority of users getting all of accessibility enabled just because they have a touch screen which they're not using. There's other costs to performance and memory here rather than just this call. And that group is only going to grow.

Sure. So if you don't want the OSK to pay that penalty, we hook MSCTF!WinEventProc and block it completely.

I'd point out that Windows itself is starting to use a11y for more and more things. Windows Text Cursor Indicator is the most recent example. We can't continue to see a11y as something only used by traditional accessibility tools like screen readers. And if we want those things to perform well, caching in parent is the only option.

For the last couple of years if I look at bugs for now this seems to be the main thing affecting our users. I agree we may need a solution for the longer term. But then we probably have to look at explicitly how to support those specific requests with a minimal amount of work. We go through far more trouble to avoid far smaller performance costs occurring for users :).

The Windows Test Cursor Indicator (to the best of my knowledge!) is disabled by default, when people enable it I think we should be perfectly okay incurring a significantly larger performance costs. (Just as when people are actually using the OSK)

(Fwiw, I'm not against hooking MSCTF WinEventProc I'm just concerned will interfere with others parts of the OSK's regular workings, but maybe I'm wrong, so I'll try that if this fails :-))

This seems to occur on some systems with touch screens.
There doesn't seem to be any negative impact when a11y is disabled on these systems, so we shouldn't instantiate it for msctf alone.

(In reply to Bas Schouten (:bas.schouten) from comment #36)

(Fwiw, I'm not against hooking MSCTF WinEventProc

I'm not able to test it, but I just uploaded a patch which should hopefully do this if you can give it a try.

I'm just concerned will interfere with others parts of the OSK's regular workings, but maybe I'm wrong, so I'll try that if this fails :-))

The goal of everything we're doing here is to not instantiate a11y if triggered by msctf. That's exactly what hooking this does. If this interferes, then that's because the OSK really does need proper a11y info.

Attachment #9198068 - Attachment description: Bug 1687535: Have LazyInstantiator::QueryService bail early for unsupported services. r=jamie → Bug 1687535: Give several LazyInstantiator functions fast paths that avoid a11y instantiation. r=jamie
Attachment #9198082 - Attachment is obsolete: true

It turns out hooking msctf!WinEventProc doesn't work and isn't an option. As we discovered in comment 15 (and I completely failed to realise the significance of this when writing the patch), that symbol isn't exported from the dll; it's only in the symbols provided by the Microsoft symbol server. We obviously can't hard-code the address, so unfortunately, we just can't hook this.

Bas's patch shouldn't hurt real a11y clients and does appear to prevent a11y from being instantiated on his system (and doesn't appear to break anything), even if he explicitly uses the OSK. So, we're going with that approach. Thanks, Bas!

(In reply to James Teh [:Jamie] from comment #39)

It turns out hooking msctf!WinEventProc doesn't work and isn't an option. As we discovered in comment 15 (and I completely failed to realise the significance of this when writing the patch), that symbol isn't exported from the dll; it's only in the symbols provided by the Microsoft symbol server. We obviously can't hard-code the address, so unfortunately, we just can't hook this.

It depends on how nasty you want to get! ;-) Presumably it gets passed as a function pointer to SetWinEventHook, right? If we were really desperate, we could hook the latter and check for the function pointer that comes from the former. And by that point, in fact, you don't even need to hook msctf itself.

Pushed by bschouten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7907ec5d7d1f
Give several LazyInstantiator functions fast paths that avoid a11y instantiation. r=Jamie
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

Telemetry is showing evidence of this working and dropping percentages of a11y instantiation dropping considerably. We'll have exact numbers in ~2 days I suspect.

Whiteboard: [qf:p1][responsiveness] → [qf:p1:responsiveness]

'final numbers' seem to suggest we've cut a11y instantiations roughly in half. On Windows in particular we went from 12-13% to 5-6%.

Blocks: 1686303
Regressions: 1726852
No longer regressions: 1726852
Depends on: 1737192
Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: