Closed Bug 677883 Opened 13 years ago Closed 13 years ago

[e10s] WM_GETOBJECT regularly fails on content process windows in anything but the first tab

Categories

(Core :: Disability Access APIs, defect)

x86_64
Windows 7
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla10

People

(Reporter: MarcoZ, Assigned: jimm)

References

()

Details

(Whiteboard: [e10s])

Attachments

(1 file)

Spun off from bug 675155. This bug often manifests itself when opening pages from anything but the first tab. What happens is that NvDA announces the presence of a NetscapeDispatchWnd. NetscapeDispatchWnd is actually the window text of the content process window. The fact that NVDA speaks this window text is an indicator that responding to the WM_GETOBJECT message failed. Jamie suspects that this is due to high IPC load on the content process, similar to what's the cause of bug 599814 (OOPP).
(In reply to Marco Zehe (:MarcoZ) from comment #0)
> Jamie suspects that this is due to high IPC load on the content
> process
Just a clarification: it needn't be "high IPC load". It just has to be any IPC request (no matter how small) that is in progress at the same time that a client sends WM_GETOBJECT. This explains why the behaviour is intermittent.
(In reply to James Teh [:Jamie] from comment #1)
> (In reply to Marco Zehe (:MarcoZ) from comment #0)
> > Jamie suspects that this is due to high IPC load on the content
> > process
> Just a clarification: it needn't be "high IPC load". It just has to be any
> IPC request (no matter how small) that is in progress at the same time that
> a client sends WM_GETOBJECT. This explains why the behaviour is intermittent.

so when nested WM_GETOBJECT happens then we are unable to proceed it? Jamie, can you think of small app that demonstrate it (in means of permanent step to reproduce)?
(In reply to alexander surkov from comment #2)
> > It just has to be any
> > IPC request (no matter how small) that is in progress at the same time that
> > a client sends WM_GETOBJECT.
> so when nested WM_GETOBJECT happens then we are unable to proceed it?
I don't mean nested WM_GETOBJECT; that should never happen. I mean an client sending a WM_GETOBJECT while IPC is happening between the chrome and content processes.

> can you think of small app that demonstrate it (in means of permanent step
> to reproduce)?
Not really. You need to ensure that content->chrome IPC is happening while a client sends WM_GETOBJECT. Because content->chrome IPC requests generally happen pretty fast, this is going to be very hard to trigger reliably. The only way you could possibly manage it is to hack a content->chrome IPC request which takes a long time and then send WM_GETOBJECT while it is blocked.
This try-server build does not fix the problem for me. I still get these failures in second or subsequent tabs.
Do we receive that message in the content process's UI thread window proc?
(In reply to alexander surkov from comment #4)
> try server build containing the patch from bug 599814 will be available here

that wasn't correct port, here's the right one - http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/surkov.alexander@gmail.com-3ffe77bc3b4c
new build - http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/surkov.alexander@gmail.com-5b700a8daa01

the previous one failed because errors in patches in the queue
(In reply to alexander surkov from comment #8)
> new build -
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/surkov.
> alexander@gmail.com-5b700a8daa01
> 
> the previous one failed because errors in patches in the queue

This one failed, too, there are no Win32 binaries.
(In reply to alexander surkov from comment #8)
> new build -
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/surkov.
> alexander@gmail.com-5b700a8daa01
> 
> the previous one failed because errors in patches in the queue

I'd like to reproduce this locally by building an e10s build. Alex, I was looking at your push here:

5b700a8daa01 – try: -p win32
5635d1d0affe – try: -u all
558898965977 – imported patch 546068
841405860662 – [mq]: 688126
b362ef2fc32a – Bug 682790 - ignore implicit label association when it's associated explicitly, r=marcoz
5c0a8e437339 – Bug 687393 - HTML select options gets relation from containing label, r=tbsaunde
a95f9dc00cf8 – Bug 673958 - rework accessible focus handling try: -u all

Would I need some of these to reproduce or can I get a good reproducible build by simply adding: 

ac_add_options --enable-e10s-compat

to a generic mc build?
Hey, Jim. Thank you for looking. You don't need to any of these patches to reproduce. However I don't see this bug so often as Marco sees, I saw it maybe couple of times.

The patch that might be a fix is here 
https://hg.mozilla.org/try/rev/0e0d51e5c9a6

at least when I debugged it I saw that ProcessOrDeferMessage triggers to process WM_GETOBJECT of the fake windows we created on accessibility side.
(In reply to Marco Zehe (:MarcoZ) from comment #9)

> This one failed, too, there are no Win32 binaries.

I started one more but it failed again due to some reason. It reports:
LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage 
but it doesn't look that I touched anything related. I wonder if e10s build is compiled successfully at all.
Is this only reproducible in NVDA, or might I be able to see it with Windows built in accessibility features?
(In reply to Jim Mathies [:jimm] from comment #13)
> Is this only reproducible in NVDA, or might I be able to see it with Windows
> built in accessibility features?

You may try Accessible Event Watcher which is a part of Windows SDK 7.1. But it should be easy to just download NVDA (nvda-project.org).
Do we support remote tabs yet in firefox with --enable-e10s-compat? I've built this up but don't see tabs running in separate processes. 

I realize reproducing this is intermittent, but could someone post build config / STR for attempting to reproduce?
(In reply to Jim Mathies [:jimm] from comment #15)
> Do we support remote tabs yet in firefox with --enable-e10s-compat? I've
> built this up but don't see tabs running in separate processes. 

Did you open more than one? (The first remains in the main process)
Ok, playing with this now. Didn't realize tabs currently run in the oopp container.
I have a patch I'm testing that's a variant of the test patch that isn't working. Basically it excludes accessible tab windows from our hooks in windows message loop. One open question I have though is whether or not accessible calls on IAccessible can trigger cross process ipc calls for things like dom inspection. Looking at the accessibility code it appears the the information we provide is self contained in the sub process - we expose the current state we have. Can someone confirm this assumption? If this is the case, I think excluding the tab window should be safe.
(In reply to Jim Mathies [:jimm] from comment #18)
> I have a patch I'm testing that's a variant of the test patch that isn't
> working. Basically it excludes accessible tab windows from our hooks in
> windows message loop. One open question I have though is whether or not
> accessible calls on IAccessible can trigger cross process ipc calls for
> things like dom inspection. Looking at the accessibility code it appears the
> the information we provide is self contained in the sub process - we expose
> the current state we have. Can someone confirm this assumption?

Yes, this is my understanding - unless I'm missing something. Alexander?
Jim, can you give me some background on when and what for message loop steals windows message processing. Let's we have two processes: chrome and content process and two windows (for chrome and for tab), when message loop decides to process events instead of default procs for these windows?

(In reply to Jim Mathies [:jimm] from comment #18)
> I have a patch I'm testing that's a variant of the test patch that isn't
> working. Basically it excludes accessible tab windows from our hooks in
> windows message loop. 

to be sure: if you excludes the tab windows created by a11y then is it ok if default windows proc answers these events?

> One open question I have though is whether or not
> accessible calls on IAccessible can trigger cross process ipc calls for
> things like dom inspection. Looking at the accessibility code it appears the
> the information we provide is self contained in the sub process - we expose
> the current state we have.

ipc calls in means that AT calls some method on accessible object for content process returned in response on WM_GETOBJECT message that calls into chrome process? I don't think it's possible, they can operate with DOM but that DOM is in content process.

> Can someone confirm this assumption? If this is
> the case, I think excluding the tab window should be safe.

btw, is bug 686861 completely different issue?
(In reply to alexander surkov from comment #20)
> Jim, can you give me some background on when and what for message loop
> steals windows message processing. Let's we have two processes: chrome and
> content process and two windows (for chrome and for tab), when message loop
> decides to process events instead of default procs for these windows?

The windows message loop code traps messages and then simulates similar actions on the target window. You can view the code under ipc/glue/WindowsMessageLoop.

There isn't a way to predict when this happens. The code is active on both sides of the rpc connection and is "turned on" whenever a process makes an out call and is awaiting a reply. (See WaitForNotify in WindowsMessageLoop.cpp) We enter into this state when there is an rpc outcall on the stack and stay there until the final response is received.

Heavy rpc traffic with plugins is a common trigger for this.

> 
> (In reply to Jim Mathies [:jimm] from comment #18)
> > I have a patch I'm testing that's a variant of the test patch that isn't
> > working. Basically it excludes accessible tab windows from our hooks in
> > windows message loop. 
> 
> to be sure: if you excludes the tab windows created by a11y then is it ok if
> default windows proc answers these events?

That's what I'm wondering. I'll post my patch. Basically it's safe to exclude the accessible tab window from this *if* we can be assured that incoming IAccessible calls will not trigger cross process traffic that can trigger Windows systems calls that trigger windowing events targeted at the accessible window.

Note it appears we parent the accessible tab window to the chrome window, which sets us up for this type of problem. I wonder if we could skip doing this and make the accessible tab window top level?

http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/nsDocAccessibleWrap.cpp#287

This would help mitigate the problem of messaging dead locks.

> 
> > One open question I have though is whether or not
> > accessible calls on IAccessible can trigger cross process ipc calls for
> > things like dom inspection. Looking at the accessibility code it appears the
> > the information we provide is self contained in the sub process - we expose
> > the current state we have.
> 
> ipc calls in means that AT calls some method on accessible object for
> content process returned in response on WM_GETOBJECT message that calls into
> chrome process? I don't think it's possible, they can operate with DOM but
> that DOM is in content process.

That's what I was thinking, if that's true then we should be ok.

> 
> > Can someone confirm this assumption? If this is
> > the case, I think excluding the tab window should be safe.
> 
> btw, is bug 686861 completely different issue?

That looks like an invalidation hang - a common problem with our current setup and appears to be unrelated to accessibility.
(In reply to Jim Mathies [:jimm] from comment #21)

> > to be sure: if you excludes the tab windows created by a11y then is it ok if
> > default windows proc answers these events?
> 
> That's what I'm wondering. I'll post my patch. Basically it's safe to
> exclude the accessible tab window from this *if* we can be assured that
> incoming IAccessible calls will not trigger cross process traffic that can
> trigger Windows systems calls that trigger windowing events targeted at the
> accessible window.

I meant is it ok to if default window proc will answer on messages other than WM_GETOBJECT (http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/nsAccessNodeWrap.cpp#746).

> Note it appears we parent the accessible tab window to the chrome window,
> which sets us up for this type of problem. I wonder if we could skip doing
> this and make the accessible tab window top level?

I think AT relies on AT that's why I did that. But I don't recall exactly what went wrong. We could disable it and try again.

> http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/
> nsDocAccessibleWrap.cpp#287
> 
> This would help mitigate the problem of messaging dead locks.

Can you elaborate on this please? I'm not sure I follow you.

> That's what I was thinking, if that's true then we should be ok.

that should be true, at least I don't see where this rule could be broken.
> > http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/
> > nsDocAccessibleWrap.cpp#287
> > 
> > This would help mitigate the problem of messaging dead locks.
> 
> Can you elaborate on this please? I'm not sure I follow you.


hypothetical paint example:

1) chrome: UpdateWindow(main chrome window)

2) Windows: WM_PAINT –> child windows

3) child process: receives WM_PAINT

4) child paint handler decides to inspect something in the dom

5) child: rpc call to chrome process

6) *deadlock*

This can happen in reverse order too where the child is blocked and the parent attempts to communicate with it.
(In reply to alexander surkov from comment #22)
> I meant is it ok to if default window proc will answer on messages other
> than WM_GETOBJECT
> (http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/
> nsAccessNodeWrap.cpp#746).

Sure, the less we do that can cause rpc traffic the better. If we find that we need to handle something to avoid a deadlock we can address those problems as they come up.
When a screen reader calls in on the IAccessible interface, does that enter the process as an OLE message on the UI thread?
(In reply to Jim Mathies [:jimm] from comment #25)
> When a screen reader calls in on the IAccessible interface, does that enter
> the process as an OLE message on the UI thread?
As Gecko's UI thread is an STA COM thread (as far as I am aware), then Windows will use OLE window messages to synchronize the COM call. Though I'm pretty sure that COM will create its own hidden message window for this purpose.
(In reply to Michael Curran from comment #26)
> (In reply to Jim Mathies [:jimm] from comment #25)
> > When a screen reader calls in on the IAccessible interface, does that enter
> > the process as an OLE message on the UI thread?

> As Gecko's UI thread is an STA COM thread (as far as I am aware), then
> Windows will use OLE window messages to synchronize the COM call. Though I'm
> pretty sure that COM will create its own hidden message window for this
> purpose.

That makes sense to me, it's similar to the way com handles copy paste and drag and drop.

If this is the case and the accessibility code can cause rpc calls from chrome <-> content we will likely be looking at some pretty major dead lock problems down the road. So I'm really curious how isolated the accessibility code is in each process.
Release/debug try builds with an alternative approach should be up in a bit:

http://dev.philringnalda.com/tbpl/?tree=Try&rev=948b6052a60a

These should have e10s-compat enabled through build config, assuming try's extra build options thing worked.
(In reply to Jim Mathies [:jimm] from comment #28)
> Release/debug try builds with an alternative approach should be up in a bit:
> 
> http://dev.philringnalda.com/tbpl/?tree=Try&rev=948b6052a60a
> 
> These should have e10s-compat enabled through build config, assuming try's
> extra build options thing worked.

Sorry, wrong set of builds which I had to cancel. Here's the correct link:

http://dev.philringnalda.com/tbpl/?tree=Try&rev=252dd31c64a6
(In reply to Jim Mathies [:jimm] from comment #27)

> If this is the case and the accessibility code can cause rpc calls from
> chrome <-> content we will likely be looking at some pretty major dead lock
> problems down the road. So I'm really curious how isolated the accessibility
> code is in each process.

All cross process communications happens in nsDocAccessibleWrap::DoInitialUpdate(). It's never triggered by screen reader.

On the another hand can screen reader send WM_GETOBJECT for chrome process window from content process (when its dll is injected into our content process) and could it lead to deadlock (for example, if they do that during rpc call)?

Btw, we have bug 649720 where screen readers want to be able to navigate between trees of chrome and content processes. This bug is related with isolation of accessible code in each process.
(In reply to alexander surkov from comment #30)
> (In reply to Jim Mathies [:jimm] from comment #27)
> 
> > If this is the case and the accessibility code can cause rpc calls from
> > chrome <-> content we will likely be looking at some pretty major dead lock
> > problems down the road. So I'm really curious how isolated the accessibility
> > code is in each process.
> 
> All cross process communications happens in
> nsDocAccessibleWrap::DoInitialUpdate(). It's never triggered by screen
> reader.
> 
> On the another hand can screen reader send WM_GETOBJECT for chrome process
> window from content process (when its dll is injected into our content
> process) and could it lead to deadlock (for example, if they do that during
> rpc call)?

Not sure and I doubt the behavior is clearly defined on MSDN. Since they are parented I suppose it's possible Windows might walk up the window chain. We should consider breaking that chain and making the content tab windows top level, assuming screen readers can handle that.
(In reply to Jim Mathies [:jimm] from comment #29)
> (In reply to Jim Mathies [:jimm] from comment #28)
> > Release/debug try builds with an alternative approach should be up in a bit:
> > 
> > http://dev.philringnalda.com/tbpl/?tree=Try&rev=948b6052a60a
> > 
> > These should have e10s-compat enabled through build config, assuming try's
> > extra build options thing worked.
> 
> Sorry, wrong set of builds which I had to cancel. Here's the correct link:
> 
> http://dev.philringnalda.com/tbpl/?tree=Try&rev=252dd31c64a6

The link on this release build seg faulted, will fire off a new set and post when they are available.
(In reply to Jim Mathies [:jimm] from comment #31)
> (In reply to alexander surkov from comment #30)
> > (In reply to Jim Mathies [:jimm] from comment #27)
> > 
> > > If this is the case and the accessibility code can cause rpc calls from
> > > chrome <-> content we will likely be looking at some pretty major dead lock
> > > problems down the road. So I'm really curious how isolated the accessibility
> > > code is in each process.
> > 
> > All cross process communications happens in
> > nsDocAccessibleWrap::DoInitialUpdate(). It's never triggered by screen
> > reader.
> > 
> > On the another hand can screen reader send WM_GETOBJECT for chrome process
> > window from content process (when its dll is injected into our content
> > process) and could it lead to deadlock (for example, if they do that during
> > rpc call)?
> 
> Not sure and I doubt the behavior is clearly defined on MSDN. Since they are
> parented I suppose it's possible Windows might walk up the window chain. We
> should consider breaking that chain and making the content tab windows top
> level, assuming screen readers can handle that.

Although, that might cause the window to show up in the task manager since it transparent but visible. Does this window need to be visible or can it be hidden by default?
(In reply to Jim Mathies [:jimm] from comment #33)

> Although, that might cause the window to show up in the task manager since
> it transparent but visible. Does this window need to be visible or can it be
> hidden by default?

yep, it should, especially when screen readers that requiring non zero width/height of windows are running. iirc visibility is something that they rely on.
Hmm, so building release builds with the e10compat flag basically don't succeed. I keep getting link problems:

   Creating library xul.lib and object xul.exp
Generating code
4058 of 156477 (  2.59%) profiled functions will be compiled for speed
e:\builds\moz2_slave\try-w32\build\toolkit\xre\nsapprunner.cpp(3631) : fatal error C1001: An internal error has occurred in the compiler.
(compiler file 'F:\SP\vctools\compiler\utc\src\P2\main.c[0x10C9CFA1:0x006E5EE0]', line 182)
 To work around this problem, try simplifying or changing the program near the locations listed above.
Please choose the Technical Support command on the Visual C++ 
 Help menu, or open the Technical Support help file for more information

LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage

So I will post the patch for people to build locally and test.
Assignee: nobody → jmathies
(In reply to Jim Mathies [:jimm] from comment #35)

> LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage

the same error I've got for my patch.
(In reply to alexander surkov from comment #37)
> (In reply to Jim Mathies [:jimm] from comment #35)
> 
> > LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage
> 
> the same error I've got for my patch.

Looks like it has something to do with PGO, although I don't see any specific build errors, just the LINK barf at the end. Trying a local PGO build now to see if I can get this working on try.
Odd, during the link phase, it barfs on XRE_main(), yet, we don't have any MOZ_E10S_COMPAT ifdef'ing in the app runner src. Apparently that function is too complex for the linker in this particular case, or there is some unknown factor throwing the linker off.
Local profiledbuild worked just fine. Not sure what's up with our build slaves with this config.

If we don't add the e10scompat flag, would a try build still be useful in testing? From what I understand simply flipping the pref an a standard build results in a pretty broken browser.
PGO builds on try went away recently, but unfortunately we also lost mozconfig-extra support via bug 558180, so folks will need to test the patch I've posted manually.
(In reply to Jim Mathies [:jimm] from comment #41)
> PGO builds on try went away recently, but unfortunately we also lost
> mozconfig-extra support via bug 558180, so folks will need to test the patch
> I've posted manually.

Someone clued me on how to get this to work (https://hg.mozilla.org/try/rev/6228bcaf4bca) so we shall see if these builds succeed.

https://tbpl.mozilla.org/?noignore=0&tree=Try&rev=94bccd25779e
Had to rerun the release build as try was using the 'nightly' config vs 'release'. 

release build for testing:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmathies@mozilla.com-800d9f2b243d/try-win32/
I tested the try-server build and am seeing that the WM_GETOBJECT messages are now being answered each time. I didn't encounter the "NetscapeDispatchWnd" situation at all any more while testing the build for more than 2 hours. I opened and closed tabs and windows, filled out forms, opened and closed Mozilla dialogs and menus. All worked well.

There are, of course, all sorts of keyboard, focus, and history-related bugs I encountered, but none of these have to do with this bug. Jim, if this is the patch you want to go with, I suggest to get the review process going. ;) From my end, this looks good.
Comment on attachment 564554 [details] [diff] [review]
filter tab content windows patch v.1

Ok, kicking of a review -> davidb

I noticed one bug that was rather annoying while testing, the window position reported to the accessible tool for tabs was often positioned at the upper left corner of the desktop, while the first in process tab reported the proper position. If that hasn't been filed yet I'll file it off.
Attachment #564554 - Flags: review?(bolterbugz)
Comment on attachment 564554 [details] [diff] [review]
filter tab content windows patch v.1

Review of attachment 564554 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. Thanks Jim!

::: accessible/src/msaa/nsAccessNodeWrap.cpp
@@ +744,5 @@
>  nsAccessNodeWrap::WindowProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam)
>  {
> +  // Note, this window's message handling should not invoke any call that
> +  // may result in a cross-process ipc call. Doing so may violate RPC
> +  // message symanticts.

Nit: you probably meant "semantics".

::: ipc/glue/WindowsMessageLoop.cpp
@@ +384,5 @@
> +  // Tab content creates a window that responds to accessible WM_GETOBJECT
> +  // calls. This window can safely be ignored.
> +  if (::GetPropW(hWnd, kPropNameTabContent)) {
> +    return false;
> +  }

Nice.
Attachment #564554 - Flags: review?(bolterbugz) → review+
(In reply to Jim Mathies [:jimm] from comment #45)

> I noticed one bug that was rather annoying while testing, the window
> position reported to the accessible tool for tabs was often positioned at
> the upper left corner of the desktop, while the first in process tab
> reported the proper position. If that hasn't been filed yet I'll file it off.

Please file.
Comment on attachment 564554 [details] [diff] [review]
filter tab content windows patch v.1


>+const PRUnichar* kPropNameTabContent = L"AccessibleTabWindow";
>+

>+  if (hwnd) {
>+    ::SetPropW(hwnd, kPropNameTabContent, (HANDLE)1);
>+  }
>+  return hwnd;

please add a comment for this (comment for kPropNameTabContent is good option) so that I don't need to look the ipc code to figure out what it is for.
Depends on: 693644
https://hg.mozilla.org/mozilla-central/rev/49ccbc7fd631
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Depends on: 693890
It appears this breaks the Windows build with 

  ac_add_options --disable-accessibility
(In reply to Julian Reschke from comment #51)
> It appears this breaks the Windows build with 
> 
>   ac_add_options --disable-accessibility

See bug 693890 :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: