Closed Bug 522414 Opened 10 years ago Closed 10 years ago

RPC and Sync channels will deadlock easily on Windows

Categories

(Core :: IPC, defect)

x86
Windows XP
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: bent.mozilla, Assigned: bent.mozilla)

References

(Depends on 1 open bug)

Details

Attachments

(4 files, 2 obsolete files)

Attached patch Patch, v1 (obsolete) — Splinter Review
Due to the craziness that is the Windows message system we can very easily deadlock when we have native widgets in two processes that are also allowed to synchronously call each other via IPC. The only way around this as far as I know is to spin a mini-event loop during synchronous calls that will only respond to certain Windows messages. Patch attached that fixes everything, I think.
Attachment #406383 - Flags: review?(jmathies)
Attachment #406383 - Flags: review?(robert.bugzilla)
Checked out the code base today and have a build going so I can get into this to understand what you're trying to do here. Have a blocking top crasher for 1.9.1 though I'm working on, so it will be a day or so until I can get to this review.
Comment on attachment 406383 [details] [diff] [review]
Patch, v1

>diff --git a/ipc/glue/SyncChannel.cpp b/ipc/glue/SyncChannel.cpp
>--- a/ipc/glue/SyncChannel.cpp
>+++ b/ipc/glue/SyncChannel.cpp
>...
>+static inline bool
>+WindowIsMozillaWindow(HWND hwnd)
>+{
>+    PRUnichar buffer[256] = {0};
>+    int length = GetClassNameW(hwnd, (wchar_t*)buffer, sizeof(buffer) - 1);
>+    if (length <= 0) {
>+        NS_WARNING("Failed to get class name?!");
>+        return false;
>+    }
>+
>+    // Cheap test
>+    if (buffer[0] != PRUnichar('M') && buffer[0] != PRUnichar('G') &&
>+        buffer[0] != PRUnichar('n') && buffer[0] != PRUnichar('F')) {
>+        return false;
>+    }
>+
>+    // More expensive test
>+    nsDependentString className(buffer, length);
>+    if (!StringBeginsWith(className, NS_LITERAL_STRING("Mozilla")) &&
>+        !StringBeginsWith(className, NS_LITERAL_STRING("Gecko")) &&
>+        !className.EqualsLiteral("nsToolkitClass") &&
>+        !className.EqualsLiteral("nsAppShell:EventWindowClass") &&
>+        !className.EqualsLiteral("FirefoxMessageWindow")) {
>+        return false;
From nsNativeAppSupportWin.cpp for setting the MessageWindow... it is different for each app
    static const PRUnichar *className() {
        static PRUnichar classNameBuffer[128];
        static PRUnichar *mClassName = 0;
        if ( !mClassName ) {
            ::_snwprintf(classNameBuffer,
                         128,   // size of classNameBuffer in PRUnichars
                         L"%s%s",
                         NS_ConvertUTF8toUTF16(gAppData->name).get(),
                         L"MessageWindow" );
            mClassName = classNameBuffer;
        }
        return mClassName;
    }
(In reply to comment #0)
> Created an attachment (id=406383) [details]
> Patch, v1
> 
> Due to the craziness that is the Windows message system we can very easily
> deadlock when we have native widgets in two processes that are also allowed to
> synchronously call each other via IPC. The only way around this as far as I
> know is to spin a mini-event loop during synchronous calls that will only
> respond to certain Windows messages. Patch attached that fixes everything, I
> think.

Can you be more specific as to how the deadlock occurs? Why are we using window messages for IPC? I don't understand under what circumstances we have two native widgets in different processes sending messages to each other.
After some discussion with cjones over IRC, I have an idea. This all bases on the theory that SendMessage is implemented by queuing an APC (http://msdn.microsoft.com/en-us/library/ms684954(VS.85).aspx) to the thread that owns the window - a call stack would confirm this I think. APCs are not delivered until the thread is in an alertable state and many wait/synchronization calls allow you to specify if you want to be alertable or not. So long as we are not alertable when we block in the child, the APCs will not be dispatched and so we will not have these odd calls into window procedures when we intend to be blocked. After unblocking, a simple call to check the message queue should allow the APCs to be dispatched.

I don't have details on how to do this right now because I don't know your IPC that well and I'd like to confirm the theory first.

So that would solve the child unblocking itself, but we really shouldn't call SendMessage (sync IPC) to the child process - can we figure out where in the parent we are doing this? There's SendMessageCallback which is probably a better option (if not just PostMessage). Note that some SendMessage calls may be hiding inside API functions (SetWindowText is an example).
(In reply to comment #4)
> After some discussion with cjones over IRC, I have an idea. This all bases on
> the theory that SendMessage is implemented by queuing an APC

No.

We're not calling SendMessage manually at all. Here's an example of the problem:

1. Browser wants to create a plugin window. It makes a new HWND in the browser process.
2. Browser calls SetWindow on the plugin process via IPC. This is a synchronous message and so the browser process blocks.
3. Plugin process creates its own plugin window that it can actually subclass, draw into, etc. It then calls SetParent to make the plugin window a child of the browser's plugin window.
4. Windows generates a WM_PARENTNOTIFY message that it sends *synchronously* to the browser's plugin HWND from _within_ the SetParent call. This causes the plugin process to block.
5. Because the browser process is blocked the WM_PARENTNOTIFY message is never serviced. We're hosed.

This is just one example, and, incidentally, I was able to work around it by making the plugin's window set the WS_EX_NOPARENTNOTIFY style bit. But the general problem remains - Windows sends events synchronously between parent and child windows without notifying us beforehand. We cannot avoid deadlocks in all cases unless we have a minimal Windows message loop going while we are waiting for synchronous IPC to complete.
(In reply to comment #5)
> 1. Browser wants to create a plugin window. It makes a new HWND in the browser
> process.
> 2. Browser calls SetWindow on the plugin process via IPC. This is a synchronous
> message and so the browser process blocks.
> 3. Plugin process creates its own plugin window that it can actually subclass,
> draw into, etc. It then calls SetParent to make the plugin window a child of
> the browser's plugin window.

Under what circumstance does this happen? I'm playing with a windowed flash video from you tube. I see the set window call from parent to child, and things return fine and execution continues. I'm not 100% on how the new plugin code works - in the original implementation, the window was created by the browser and set for the plugin, which then subclased that window. The plugin never created its own base 'browser' window to draw in. 

Does this match our current electrolysis impl? - 

1) create the plugin window in the browser process [by browser, I mean the browser shell (firefox.exe), w/mozillaruntime.exe holding the tab in a child process]

2) hand the plugin window hwnd via ipc to the child plugin shim via NPP_SetWindow.

3) child shim loads the plugin, and calls NPP_SetWindow on flash.

4) flash subclasses that window, returns.

5) execution unwinds back up to the browser and continue normally.

I'm asking this because, w/out your patch, I'm not seeing any deadlock issues with my test video.

I haven't tested creating a child window of the base plugin window (in the child process) yet, will try to get that going here in a bit with the test plugin.
I got my example wrong in comment 5. It isn't the SetParent call that causes problems, it's the CreateWindow call when we give a parent HWND. We've since changed the way this works a little, but here's what the docs say for WM_PARENTNOTIFY:

  When the child window is being created, the system sends WM_PARENTNOTIFY
  just before the CreateWindow or CreateWindowEx function that creates the
  window returns.

That happens when we send the parent HWND to the CreateWindow call, which we no longer do, so I guess this wasn't the best example.

The general problem remains, though. Once the parent/child relationship is established Windows will freely send synchronous messages when we're not expecting them.

It's interesting that you don't see any deadlocks... I can't even load a youtube video without hitting one. I'll attach a stack in a sec.
(In reply to comment #6)
> Does this match our current electrolysis impl?

Not exactly. For plugins let's assume that we don't have tabs in their own processes yet. I've never even tried running with both prefs enabled.

So here's how we do it currently:

1. firefox.exe creates a plugin window, let's call it the "parent plugin window". It is created as a child of the main firefox window.
2. HWND of parent plugin window is sent via IPC to the mozilla-runtime.exe process through the NPP_SetWindow call.
3. mozilla-runtime.exe creates its own plugin window, let's call it the "child plugin window". It must be created by the mozilla-runtime.exe process because you cannot subclass windows that belong to another process.
4. mozilla-runtime.exe calls SetParent to make the child plugin window the child of the parent plugin window. The child plugin window is resized to fill the parent plugin window and made visible.
5. mozilla-runtime.exe loads flash and calls NPP_SetWindow with the child plugin window as its parameter.
6. Flash subclasses the child plugin window to do its thing.
Ben and I talked about this over IRC. There are few things that I'd like to see more comments/documentation on:
a) Why MsgWaitForMultipleObjects checks the queue in both processes. Ben pointed me at http://msdn.microsoft.com/en-us/library/ms810439.aspx which I intend to read.
b) A larger block comment somewhere that describes the race and the solution here (namely pumping the message loop with all the WndProcs substituted for DefWindowProc).

Before this gets checked in, I'd like to see an audit of our WndProcs to check for important messages sent to them that we cannot miss in our code (ex: theme change notification). If something turns up, we'll need a better solution (will probably depend on the message).
I've been able to reproduce the deadlock, at least on the destroy message after removing WS_EX_NOPARENTNOTIFY from the styles. The destroy deadlock was easy to fix by setting the parent to null before calling DestroyWindow. 

Generally speaking, I'm not convinced we need this message trap. For example, why remove WS_EX_NOPARENTNOTIFY from the style? The purpose there is to prevent parent notifications, which is exactly what we want to do here on create and destroy. 

I searched around for additional notification messages that might crop up independent of those that WS_EX_NOPARENTNOTIFY filters, but didn't find much. Windows notifications are listed here - 

http://msdn.microsoft.com/en-us/library/dd469354(VS.85).aspx

If the only deadlock problem we have is with create, destroy, and some mouse events, and those can be addressed with WS_EX_NOPARENTNOTIFY filtering, what's the point of the trap? Also, if we see deadlocks from other events caused by NPP calls, why can't we make those calls async?

I'm a little hesitant to sign onto dumping windowing events for periods of time. I suppose if we have to do it, so be it, but I would hope we could address this without that.

bent, I'd be more than happy to do some more debugging, looking for solutions independent of this approach if we can come up with specific cases where they occur. 

One additional note - on Vista, the SetParent deadlock doesn't occur, which obviously doesn't mitigate problems on XP but I find it interesting that changes in the os apparently correct the problem.
> If the only deadlock problem we have is with create, destroy, and some mouse
> events, and those can be addressed with WS_EX_NOPARENTNOTIFY filtering, what's
> the point of the trap?

The WM_PARENTNOTIFY message is only one message that is sent synchronously by Windows. And yes, in this case we can avoid it, though it's always possible that we could one day want to catch those messages.

The general problem, however, remains. The stack I attached is in a paint message, for instance.

> Also, if we see deadlocks from other events caused by NPP calls, why can't
> we make those calls async?

The entire plugin API is synchronous. A web page can call into a plugin and we cannot suspend the execution of the page while we wait for the plugin to respond.
> I'm a little hesitant to sign onto dumping windowing events for periods of
> time. I suppose if we have to do it, so be it, but I would hope we could
> address this without that.

Also, the "nice" thing about this patch is that we have a single spot to handle particular messages that we don't want to discard. So if you're nervous about ignoring a particular message then we don't have to. We can either handle them immediately if we know that it is safe to do so or we can 'fake' them by posting replacement messages to be processed as soon as we return to the message loop. None of that is particularly tricky, we just need to know which messages to care about.
+LRESULT CALLBACK
+NeuteredWindowProc(HWND hwnd,
+                   UINT uMsg,
+                   WPARAM wParam,
+                   LPARAM lParam)
+{
+    WNDPROC oldWndProc = (WNDPROC)GetProp(hwnd, kOldWndProcProp);

Maybe add a warning in here that prints these messages out to the console? I noticed at least a few set focus events that got dropped, some set icon and set cursor events, but nothing critical. We should probably warn people that this is taking place, at least initially.

+        if (!currentWndProc) {
+            DWORD lastError = GetLastError();
+            if (!lastError) {

if (ERROR_SUCCESS != GetLastError()) {

+        if (SetProp(hwnd, kOldWndProcProp, (HANDLE)currentWndProc) &&
+            SetProp(hwnd, kOldWndProcPropSet, (HANDLE)1)) {
+            return TRUE;
+        }

Do we need both of these properties? Looks like you could key off of the presence of kOldWndProcProp.

+    NS_ASSERTION(mEventLoopDepth > 0, "Huh?!");

Generally, make these assert events more detailed.

And as rob pointed out, a nice comment block explaining why we're doing this up top would probably be a good idea.
(In reply to comment #14)
> if (ERROR_SUCCESS != GetLastError()) {

eh!

if (ERROR_SUCCESS == GetLastError()) {
Attached patch Patch, v1.1 (obsolete) — Splinter Review
Here's a patch with loads more comments and fixes for the suggestions given by rstrong and jmathies.
Attachment #406383 - Attachment is obsolete: true
Attachment #408118 - Flags: review?(tellrob)
Attachment #408118 - Flags: review?(robert.bugzilla)
Attachment #408118 - Flags: review?(jmathies)
Attachment #406383 - Flags: review?(robert.bugzilla)
Attachment #406383 - Flags: review?(jmathies)
Attachment #408118 - Flags: review?(jmathies) → review+
Comment on attachment 408118 [details] [diff] [review]
Patch, v1.1

>+static inline bool
>+WindowIsMozillaWindow(HWND hwnd)
>+{
>+    PRUnichar buffer[256] = { 0 };
>+    int length = GetClassNameW(hwnd, (wchar_t*)buffer, sizeof(buffer) - 1);
>+    if (length <= 0) {
>+        NS_WARNING("Failed to get class name!");
>+        return false;
>+    }
>+
>+    nsDependentString className(buffer, length);
>+    if (StringBeginsWith(className, NS_LITERAL_STRING("Mozilla")) ||
>+        StringBeginsWith(className, NS_LITERAL_STRING("Gecko")) ||
>+        className.EqualsLiteral("nsToolkitClass") ||
>+        className.EqualsLiteral("nsAppShell:EventWindowClass") ||
>+        className.EqualsLiteral("FirefoxMessageWindow")) {
Did you mean to keep FirefoxMessageWindow to try to improve perf?
Attached patch Patch, v1.2Splinter Review
This version uses a WH_CALLWNDPROC hook instead of enumerating all the thread's windows to figure out which window procedures should be swapped out. Thanks to Rob Arnold for suggesting it. I also fixed the 'FirefoxMessageWindow' thing that Rob Strong pointed out, it was not supposed to hang around.
Attachment #408118 - Attachment is obsolete: true
Attachment #408481 - Flags: review?(tellrob)
Attachment #408481 - Flags: review?(robert.bugzilla)
Attachment #408481 - Flags: review?(jmathies)
Attachment #408118 - Flags: review?(tellrob)
Attachment #408118 - Flags: review?(robert.bugzilla)
(In reply to comment #10)
> Before this gets checked in, I'd like to see an audit of our WndProcs to check
> for important messages sent to them that we cannot miss in our code (ex: theme
> change notification).

I mentioned this on IRC today but it needs to be recorded here: I don't know of any way to do this systematically. We're talking about an unknown set of windows messages that are delivered within a tiny window of time in response to an unknown set of user and system events. This is not something we can audit conclusively in any way that I know.

As an example, I don't see any messages that get hooked the majority of the time. If I run something that uses IPC heavily (YouTube is a good test) and switch back and forth between several different applications I can see a few focus related messages that get hooked. Sometimes I see IME messages. I don't change themes all that often but I'd be surprised if windows decided that that should be a "nonqueued" message. Trying to change themes at precisely the right moment would most likely lead to a mental illness, in any case.

My point is that I don't believe an audit like you're requesting is possible. I believe that the best we can do is watch for bugs and make sure that the logs get analyzed (they currently note every message that gets dropped).

Here's a list of all the messages I've seen while hooked during my testing, in case any of these are worrisome:

WM_ACTIVATE
WM_ACTIVATEAPP
WM_SETFOCUS
WM_KILLFOCUS
WM_IME_SETCONTEXT
WM_IME_NOTIFY
WM_ERASEBKGND
WM_WINDOWPOSCHANGED
WM_NCPAINT
(In reply to comment #20)
> WM_ACTIVATE
> WM_ACTIVATEAPP
> WM_SETFOCUS
> WM_KILLFOCUS
> WM_IME_SETCONTEXT
> WM_IME_NOTIFY
> WM_ERASEBKGND
> WM_WINDOWPOSCHANGED
> WM_NCPAINT

If we're dropping these on a regular basis, we're going to need a way to queue some of these up and get them sent. (or a different approach all together)  

WM_WINDOWPOSCHANGED is critical for top level windows when we are setting window state, or size. It passes in a WINDOWPOS structure. 

WM_ACTIVATE, WM_SETFOCUS, and WM_KILLFOCUS are critical for focus changes and passes information through flags.

The IME messages get thrown around a lot, not sure how loosing these is going to  effect native and 3rd party imes.
Here's a list of messages that we see with the WH_CALLWNDPROC hook running normally for every window in the system (on Windows 7). I even switched themes a few times and did the taskbar preview thing.

Oh, and the InSendMessage API call returns false for the (vast) majority of these, so there's clearly a separate mechanism at work here.

So what do you guys think? Do we need handlers for some of these? Are there any messages here that we absolutely could not miss?
Comment on attachment 408481 [details] [diff] [review]
Patch, v1.2

r=me for this only way to accomplish this with our current architecture patch that doesn't make me terribly happy and I know makes you even less happy.
Attachment #408481 - Flags: review?(robert.bugzilla) → review+
Blocks: OOPP
I should also mention that I think it would be good to get this landed sooner rather than later and to fix things like perhaps storing deferred messages in an array and handling them prior to handling new messages in a followup bug.
Filed bug 526361, patch attached for deferring messages.

Jim, Rob Arnold, can you please review the patch in this bug so that we can start getting feedback on this approach? Thanks a bunch for spelunking with me.
Attachment #408490 - Flags: review?(jmathies) → review+
Attachment #408481 - Flags: review?(jmathies) → review+
Landed changeset 478360ac91c4 on electrolysis. Rob Arnold said that he would followup if he sees anything lacking.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 527925
Depends on: 1334324
You need to log in before you can comment on or make changes to this bug.